CN111178074B

CN111178074B - Chinese named entity recognition method based on deep learning

Info

Publication number: CN111178074B
Application number: CN201911271419.XA
Authority: CN
Inventors: 罗韬; 冯爽; 徐天一; 赵满坤; 于健; 喻梅; 于瑞国; 李雪威
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-12-12
Filing date: 2019-12-12
Publication date: 2023-08-25
Anticipated expiration: 2039-12-12
Also published as: CN111178074A

Abstract

The invention relates to a Chinese named entity recognition method based on deep learning, which is characterized by comprising the following steps of: the identification method comprises the following steps: 1) Embedding word position information mixed vectors into the data text; 2) Inputting the vectors obtained in the step into a Bi-LSTM layer for vector coding, and simulating long-term relations among the vectors captured by the time sequence; 3) Inputting the vector output by the Bi-LSTM layer into the self-attention layer, clearly learning the dependency relationship between any two characters in the sentence, and capturing the internal structure information of the sentence; 4) And inputting the output vector sequence to the CRF layer, making independent marking decisions, and performing label decoding. The invention has scientific and reasonable design, can run multiple data sets, has strong applicability and high accuracy, and can be applied to named entity recognition models of multi-field texts.

Description

Chinese named entity recognition method based on deep learning

Technical Field

The invention belongs to the technical fields of natural language processing, knowledge graph and sequence marking, relates to a deep learning technology and a sequence marking technology, and in particular relates to a Chinese named entity recognition method based on deep learning.

Background

Named entity recognition belongs to the field of sequence labeling, is a basic task of natural language processing, and mainly aims to find out entities with specific meanings in texts, including person names, place names, organization names and some specific proper nouns. The identified task mainly comprises two parts: entity boundaries identify and determine entity categories (person names, place names, organization names, etc.), where named entities are the basic elements of text and are the basic units for understanding the content of articles. More, named entity recognition is used as an upper-layer basic task for text data processing such as knowledge graph and the like, wherein the accuracy of named entity recognition directly influences the final effect of knowledge graph construction. The knowledge graph is established on the relation between the entities, if the entity extraction is wrong, the subsequent entity relation determination cannot be performed; the same is true of automatic abstract and question-answering systems, where related named entities must be found when semantic analysis is to be performed on sentences. Thus, named entity recognition is extremely critical and important for text data processing, particularly natural language processing.

Currently, the commonly applicable named entity recognition method comprises a CRF model, an LSTM model and a model combining LSTM and CRF as the named entity recognition model which is popular at present. Compared with an independent single model, the LSTM combined with the CRF mixed model combines the advantages of the LSTM and the CRF, can memorize the dependency relationship between long-distance sequences, and utilizes the advantage of CRF labeling, so that the method is widely applied to the field of named entity identification, and is optimized and improved on the basis of the method. Zhang et al studied a new dynamic element embedding method in 2019 and applied it to the chinese NER task. The method creates dynamic, data-specific and task-specific meta-embeddings, because the meta-embeddings of the same character in different sentence sequences are different. Experiments on MSRA and LiteraureNER datasets validated the model and the latest results were obtained on LiteraureNER.

Although research comparisons in recent years have proposed more methods, these generally do not produce good results on multiple data sets, and at the same time, there is no universal named entity recognition model that is highly adaptable, accurate, and applicable to multiple fields.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides a Chinese named entity recognition method based on deep learning, which can run multiple data sets, has strong applicability and high accuracy, and can be applied to named entity recognition models of multi-field texts.

The invention solves the technical problems by the following technical proposal:

a Chinese named entity recognition method based on deep learning is characterized by comprising the following steps: the identification method comprises the following steps:

1) Embedding word position information mixed vectors into the data text;

2) Inputting the vectors obtained in the step into a Bi-LSTM layer for vector coding, and simulating long-term relations among the vectors captured by the time sequence;

3) Inputting the vector output by the Bi-LSTM layer into the self-attention layer, clearly learning the dependency relationship between any two characters in the sentence, and capturing the internal structure information of the sentence;

4) And inputting the output vector sequence to the CRF layer, making independent marking decisions, and performing label decoding.

Moreover, the specific operation of the step 1) is as follows:

a. establishing a dictionary according to a training data set, obtaining one-hook vectors of each word, wherein the length of the one-hook vectors is the length V of the dictionary, and mapping the one-hook vectors into low-dimensional dense vectors by a look-up layer and utilizing a pre-trained single-word position vector matrix;

b. vector concatenation is carried out on three character vectors of word vectors, character vectors divided into words and character position vectors, the vectors are used as input of a network model, and a Chinese token sequence is obtained

X＝(x ₁ ，x ₂ ，x ₃ ，...x _n ，)

Checking whether a token X exists in the word lookup table and the character lookup table, and taking the vector combination of two embedded items as the distributed representation of the token when X exists in all the two tables, namely the token consists of one character; otherwise, the word position vector will be initialized to the word vector of the word in which the word is located using only the embedding in one of the look-up tables as the output of the embedding layer.

Moreover, the specific operation of the step 2) is as follows: the word mixed vector of each word in an input sequence is used as each time step of the network to be input into a Bi-LSTM layer, global characteristics are extracted, and an implicit output sequence (h) of a forward LSTM is obtained through a bidirectional LSTM network ₁ ，h ₂ ...h _n ) Implicit output sequence of reverse LSTMSplicing the two groups of hidden sequences according to the positions to obtain the complete hidden sequence +.>This implicit sequence is taken as input to the next layer.

Moreover, the specific operation of the step 3) is as follows: for each time step input, h=h ₁ ，h ₃ ，...h _n Representing the output of the B-iLSTM hidden layer, according to the principle of a multi-head attention mechanism, after linear transformation of an input vector, and scaling the dotproduct, the attention formula is:

wherein:is a query matrix;

is a key matrix;

is a value matrix;

d is the dimension of the hidden unit of Bi-LSTM and is numerically equal to 2d _h ；

Setting q=k=v=h, the multi-head attention first projects the query, key and value H linearly by using different linear projections, then H projections perform scaled dot product attention in parallel, finally, concatenating these attention results and projects again to get a new representation.

Moreover, the specific operation of the step 4) is as follows: the result is accessed into a CRF layer, the CRF layer comprises a transfer matrix which represents transfer scores among the labels, and the score of the label corresponding to each word in the CRF layer is composed of two parts: and adding legal constraint among the predicted labels through a transfer matrix in a CRF layer, increasing the rationality of label grammar, and finally deducing a label sequence with highest score for label prediction by using Viterbi decoding.

The invention has the advantages and beneficial effects that:

the Chinese named entity recognition method based on deep learning can run multiple data sets, has strong applicability and high accuracy, and can be applied to named entity recognition models of multi-field texts.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a graph of the number of iterations versus model F1 values on an MSRA data set in accordance with the present invention;

FIG. 3 is a graph of the number of iterations versus model F1 values on a LiteratureNER dataset according to the present invention;

FIG. 4 is a graph of the number of iterations versus model Accuracy value on an MSRA data set in accordance with the present invention;

FIG. 5 is a graph of the number of iterations versus model Accuracy values on a LiteratureNER dataset according to the present invention.

Detailed Description

The invention is further illustrated by the following examples, which are intended to be illustrative only and not limiting in any way.

As shown in fig. 1, a method for identifying a chinese named entity based on deep learning is characterized in that: the identification method comprises the following steps:

1) Embedding word position information mixed vectors into the data text;

X＝(x ₁ ，x ₂ ，x ₃ ，...x _n ，)

Checking whether a token X exists in the word lookup table and the character lookup table, and taking the vector combination of two embedded items as the distributed representation of the token when X exists in all the two tables, namely the token consists of one character; otherwise, the word position vector is initialized to the word vector of the word where the word is located by using the embedding in only one lookup table as the output of the embedding layer;

the word mixed vector of each word in an input sequence is used as each time step of the network to be input into a Bi-LSTM layer, global characteristics are extracted, and an implicit output sequence (h) of a forward LSTM is obtained through a bidirectional LSTM network ₁ ，h ₂ ...h _n ) Implicit output sequence of reverse LSTMSplicing the two groups of hidden sequences according to positions to obtain a complete hidden sequenceTaking the implicit sequence as the input of the next layer;

for each time step input, h=h ₁ ，h ₃ ，...h _n Representing the output of the B-iLSTM hidden layer, according to the principle of a multi-head attention mechanism, after linear transformation of an input vector, and scaling the dotproduct, the attention formula is:

wherein:is a query matrix;

is a key matrix;

is a value matrix;

Setting q=k=v=h, the multi-head attention first carries out linear projection on the query, the key and the value H by using different linear projections, then H projections execute scaled dot product attention in parallel, finally, connect the attention results, and project again to obtain a new representation;

4) Inputting the output vector sequence to a CRF layer, making independent marking decisions, and performing label decoding; the result is accessed into a CRF layer, the CRF layer comprises a transfer matrix which represents transfer scores among the labels, and the score of the label corresponding to each word in the CRF layer is composed of two parts: and adding legal constraint among the predicted labels through a transfer matrix in a CRF layer, increasing the rationality of label grammar, and finally deducing a label sequence with highest score for label prediction by using Viterbi decoding.

5) Model training:

a. the training sample is read by the network to train, iteration is carried out from 1, and when the maximum iteration number is greater than K, training is stopped; and for each input training data set, calculating a loss difference value of the current output according to a loss function, wherein the loss difference value is used for measuring the training degree of the model, if the loss is larger than a preset minimum loss value, the model still needs to be continuously trained and adjusted, then the network parameters of each layer need to be updated in sequence by using a back propagation algorithm, if the loss is smaller than the preset minimum loss value, the model reaches the training standard, the training is ended, and the program is exited.

b. After the current batch of training data sets is traversed, the verification set is used for verifying the training degree of the model, if the current verification result is better than the best result of the historical verification, the current training is effective, the performance of the model is in a rising stage, the training can be attempted to be continued, the current data is recorded, the best result of the history is replaced by the current verification result, and the next training is continued. If the verification result is not improved in the continuous M times of training, which possibly indicates that the learning rate span is too large and the extremum part of the minimum loss is possibly just crossed, then the learning rate can be properly reduced, the training is attempted to be continued, iteration is repeated until the learning rate is lower than the preset value of the system, and the training is ended and the system is exited.

c. After model training is finished, the training condition of the model is tested, and the test process of the model is as follows:

(1) The network parameters obtained through training are loaded into the model, and a test data set is input.

(2) The network receives the test data set, and obtains the final test output through a forward propagation algorithm.

(3) And comparing the output sequence of the network model with the correct labeling sequence.

(4) And finally, calculating the accuracy, F1 value and recall rate.

The experiments of this example were performed on microsoft's MSRA news dataset and the published literaurener dataset, respectively.

MSRA is from SIGHAN 2006 and is a shared task of Chinese named entity recognition. The dataset contains 3 entity types: personnel, organization, and location. Statistics show that this dataset contains 48998 sentences for training and 4432 sentences for testing, and since the MSRA dataset lacks a validation set, this embodiment takes one tenth of the training set as the validation set.

The LiteratureNER dataset was constructed from hundreds of Chinese literature articles, excluding too short, too cluttered articles. There are 9 entity types: personnel, organizations, locations, abstractions, times, things, and metrics. The specific contents of the data set segmentation are: 26320 training sentences, 2045 verification sentences and 3016 test sentences.

In order to prove the data superiority, the experimental methods of a plurality of journals are selected as baseline results for comparison, and the final performance of the model is compared with the performance of a control model in tables 1 and 2.

In this example, the same evaluation indexes as in the conventional work were adopted, namely Precision (Precision P), recall (Recall R) and F1-score (F1). The accuracy reflects the ratio of the number of correctly predicted tokens to the number of predicted tokens; the recall reflects the ratio of correctly predicted tokens to all tokens in the data used; f1 is the harmonic mean of the precision and recall. The following formula gives the calculation formulas of three indexes:

wherein: TP is the number of tokens that the model determines as positive and is actually positive;

FP is the number of tokens that are determined by the model to be positive but actually negative;

TN is the number of tokens that are determined by the model to be negative but actually positive;

FN is the number of tokens that are determined by the model to be negative and are actually negative.

Finally, experiments show that the new model obtains better results on the MSRA of the public dataset without using any manually-made characteristic templates, the F1 value reaches 91.37%, the F1 value reaches 73.23% on the LiteraureNER dataset, and the results are better than those of the former others, and compared with Zhang, the results are improved by 0.5% and 0.2%, so that the current best performance of the task is achieved, and meanwhile, the method has the characteristics of being capable of running multiple datasets, high in applicability and accuracy, and capable of being applied to multi-field texts.

Table 1 results comparison table on dataset MSRA

Table 2 results comparison table on dataset LiteratureNER

In the model training process, researchers can judge the training state of the model through the iteration times, the label effect change curve graph and the Accuracy graph of the model. Wherein validation Accuracy refers to the ratio of the number of correctly predicted samples in the verification set to the total number of predicted samples, and the Accuracy calculation formula is as follows, regardless of whether the predicted samples are positive examples or negative examples:

thus, this example demonstrates the scaled up 33-round results of the experimental setup in 100 rounds of iterations, and plots the F1 values of the model on the two data sets, respectively, as shown in fig. 2-5.

It can be seen from the graph that the convergence speed of the F1 value of the model from the iteration start is relatively high, the model tends to be stable after about 15 iterations and keeps floating in a small range, and the model tends to be stable in the initial stage of training on the LiteraureNER data set, which is related to the composition and the data size of the two data sets, and the MSRA data set has relatively high data size, so that the steady state can be achieved only after the training is relatively high, but the model F1 value curve change graph seen by the two data sets can be used for explaining that the model can be converged quickly and does not fall into an overfitting state, and the model is well suitable for Chinese named entity recognition tasks.

Although the embodiments of the present invention and the accompanying drawings have been disclosed for illustrative purposes, those skilled in the art will appreciate that: various substitutions, changes and modifications are possible without departing from the spirit and scope of the invention and the appended claims, and therefore the scope of the invention is not limited to the embodiments and the disclosure of the drawings.

Claims

1. A Chinese named entity recognition method based on deep learning is characterized by comprising the following steps: the identification method comprises the following steps:

1) Embedding word position information mixed vectors into the data text;

2) Inputting the vector obtained in the step 1) into a Bi-LSTM layer for vector coding, and simulating a long-term relationship between the time series captured vectors;

4) Inputting the output vector sequence to a CRF layer, making independent marking decisions, and performing label decoding;

the specific operation of the step 1) is as follows:

a. establishing a dictionary according to a training data set, obtaining one-hot vectors of each word, wherein the length is the dictionary length, and mapping the one-hot vectors into low-dimensional dense vectors by using a pre-trained single-word position vector matrix through a look-up layer;

X＝(x ₁ ，x ₂ ，x ₃ ，…x _n ，)

the specific operation of the step 2) is as follows: the word mixed vector of each word in an input sequence is used as each time step of the network to be input into a Bi-LSTM layer, global characteristics are extracted, and an implicit output sequence of a forward LSTM is obtained through a bidirectional LSTM networkAnd the implicit output sequence of reverse LSTM +.>Splicing the two groups of hidden sequences according to the positions to obtain the complete hidden sequence +.>Taking the implicit sequence as the input of the next layer;

the specific operation of the step 3) is as follows:

wherein:is a query matrix;

is a key matrix;

is a value matrix;

the specific operation of the step 4) is as follows:

the result is accessed into a CRF layer, the CRF layer comprises a transfer matrix which represents transfer scores among the labels, and the score of the label corresponding to each word in the CRF layer is composed of two parts: adding legal constraint among the predicted labels through a transfer matrix in a CRF layer, increasing the rationality of label grammar, and finally deducing a label sequence with highest score for label prediction by using Viterbi decoding;

5) Model training

a. The training sample is read by the network to train, iteration is carried out from 1, and when the maximum iteration number is greater than T, training is stopped; for each input training data set, calculating a current output loss difference value according to a loss function, wherein the loss difference value is used for measuring the training degree of the model, if the loss is larger than a preset minimum loss value, the model is required to be continuously trained and adjusted, then the network parameters of each layer are required to be updated in sequence by using a back propagation algorithm, if the loss is smaller than the preset minimum loss value, the model is required to reach a training standard, the training is ended, and the program is exited;

b. after the current batch of training data sets is traversed, verifying the training degree of the model by using a verification set, if the current verification result is better than the best result of historical verification, indicating that the current training is effective, continuing training when the performance of the model is in an ascending stage, recording the current data, replacing the best result of the history by the current verification result, and continuing the next training; if the verification result is not improved in the continuous M times of training, the learning rate is reduced, the training is continued, the iteration is repeated until the learning rate is lower than the preset value of the system, and the training is ended and the user exits;

(1) Loading the network parameters obtained through training into a model, and inputting a test data set;

(2) The network receives the test data set, and obtains the final test output through a forward propagation algorithm;

(3) Comparing and calculating the output sequence of the network model with the correct labeling sequence;

(4) And finally, calculating the accuracy, F1 value and recall rate.