CN112101027A

CN112101027A - Chinese named entity recognition method based on reading understanding

Info

Publication number: CN112101027A
Application number: CN202010720804.4A
Authority: CN
Inventors: 余正涛; 刘奕洋; 高盛祥; 郭军军; 张亚飞; 毛存礼
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2020-12-18

Abstract

The invention relates to a Chinese named entity recognition method based on reading understanding, and belongs to the technical field of natural language processing. The invention comprises the following steps: performing word segmentation processing on the document level linguistic data to obtain a document level sequence; acquiring a triple composed of a retrieval tag problem, a document level sequence entity and a document level sequence; taking the retrieval tag problem and the document sequence in the triple as input, and generating hidden output of the context information fused into the document level through a BERT coding layer; hidden output of the document level context information is merged into the long-distance context information through a convolutional neural network, semantic features of the long-distance context are obtained, semantic information of the whole document context is captured, and the semantic information is compressed into a feature map; and predicting all entities in the document through a prediction layer by utilizing semantic information of the whole document context, predicting the initial index and the termination index of the entities, and splicing the initial index and the termination index to generate the named entity. The invention can identify the entity in the document and has good identification effect.

Description

Chinese named entity recognition method based on reading understanding

Technical Field

The invention relates to a Chinese named entity recognition method based on reading understanding, and belongs to the technical field of natural language processing.

Background

Named Entity Recognition (NER), also known as entity identification, entity segmentation and entity extraction, is a subtask of information extraction that aims to locate and classify named entity mentions in unstructured text into predefined categories such as people. Name, organization, location, medical code, time expression, quantity, monetary value, percentage, etc. This is a fundamental NLP research problem that has been studied for many years. NER is a fundamental key task in NLP. From the flow of natural language processing, NER can be regarded as one of the identification of unknown words in lexical analysis, and is a problem that the number of the unknown words is the largest, the identification difficulty is the largest, and the influence on the word segmentation effect is the largest. Meanwhile, the NER is also the basis of a plurality of NLP tasks such as relation extraction, event extraction, knowledge graph, machine translation, question-answering system and the like.

The NER task is not solved well at present. Although named entity recognition only achieves good effects in limited text types (mainly in news corpora) and entity categories (mainly including names of people, places and organizations), compared with other information retrieval fields, entity naming evaluation is less expected and is easy to generate overfitting, and the system for universally recognizing various types of named entities has poor performance. Especially in the named entity recognition in Chinese, the named entity recognition in Chinese has more challenges compared with English, and the problem which is not solved at present is more. The named entities in english have a relatively obvious formal notation, i.e. the first letter of each word in the entity is capitalized, so the identification of the boundaries of the entities is relatively easy and the task is focused on determining the category of the entity. Compared with English, the Chinese named entity recognition task is more complex, and compared with the entity category labeling subtasks, the recognition of the entity boundary is more difficult. Therefore, the research on Chinese named entity recognition is still a hot problem in the field of natural language processing, and is widely concerned by academia and industry.

Disclosure of Invention

The invention provides a Chinese named entity recognition method based on reading understanding, which is used for solving the problem that the existing recognition method can only recognize entities in sentences.

The technical scheme of the invention is as follows: the Chinese named entity recognition method based on reading understanding comprises the following steps:

step1, performing word segmentation processing on the document level linguistic data to obtain a document level sequence;

step2, splicing the retrieval tag problem, the document level sequence entity and the document level sequence to obtain a triple composed of the retrieval tag problem, the document level sequence entity and the document level sequence;

step3, taking the retrieval tag problem in the triple and the document level sequence as input, and generating hidden output of the context information fused into the document level through a BERT coding layer;

step4, hidden output of the document level context information is merged into is passed through a convolutional neural network, semantic features of the long-distance context are obtained, semantic information of the whole document context is captured, and the semantic information is compressed into a feature map;

step5, predicting all entities in the document through a prediction layer by utilizing semantic information of the whole document context, predicting a start index and an end index of the entities, and splicing the start index and the end index to generate the named entity.

Further, the Step2 includes the specific steps of:

step2.1, dividing the document level linguistic data into words to obtain a document level sequence X ═ X (X)₁，x₂，x₃......x_n) Where n represents the length of the sequence, the entities are extracted in a document-level sequence, and each entity is then labeled with a category label y, each represented as x_start，endAssuming that all entity class label sets of the document level sequence are Y, the class label Y corresponding to each entity belongs to Y;

step2.2, retrieval tag problem construction: the problem of constructing the retrieval label of each category label by using the label explanation, namely explaining the category label;

step2.3, converting the document level sequence with the category label into a triple (Question, Answer, Context), wherein Question is a retrieval label problem q_yContext is a document level sequence X, and Answer is all document level sequence entities, class tagged entity X_start，end＝{x_start，x_start+1，......，x_end-1，x_endIs a set of subsequences of the document level sequence X satisfying the category label Y, where the subscript "start, end" is used to denote the succession of elements in the sequence from index "start" to index "end", associated with the category label Y e Y;

generating natural language search tag question q by based on category tag y_yObtaining a triplet (q)_y，x_start，endX), i.e. a triplet (Question, Answer, Context).

Further, in Step3, the search tag problem and the document level sequence in the triplet are used as input, when the sequence length of the search tag problem and the document level sequence is larger than 512 bytes when the search tag problem and the document level sequence pass through the BERT coding layer, the search tag problem and the document level sequence are directly truncated, and the Step2-Step3 are executed by taking the truncated part as a new sample processing loop.

Further, in step4.1, the hidden output merged into the document level context information is input into the convolutional layer of the CNN convolutional neural network to obtain long-distance context semantic features, and the maximum pooling operation is applied to the context semantic features, so that a global feature vector which is fixed in size and independent from the input can be obtained.

Further, in Step5, the prediction layer uses two binary classifiers, one for predicting whether each class label is a start index, and the other for predicting whether each class label is an end index.

Further, the Step5 includes the specific steps of:

step5.1, processing the output representation of the context feature extraction layer, only retaining text information to obtain a representation matrix E'_CNN；

Step5.2, then predicting a starting index and an ending index of the named entity through two binary classifiers in a prediction layer; the prediction layer firstly predicts the probability that each entity class label becomes a starting index and an ending index, wherein the probability that each entity class label becomes the starting index is predicted by the prediction layer as follows:

is a learned parameter, d represents the vector dimension of the last layer of the BERT coding layer, N represents the length of the document-level sequence, P_startEach row of (a) means that the probability distribution of each index is the starting or ending position of the entity given the retrieve tag problem;

and Step5.3, predicting the initial index and the end index by using the predicted probability that each entity class label becomes the initial index and the end index, and matching and splicing the predicted initial index and the corresponding end index to generate the named entity.

The invention has the beneficial effects that:

1. the method has better performance on the recognition task of the named entity of the Chinese document;

2. problem coding based on reading understanding is priori knowledge, and the included priori information has a good guiding effect on overall improvement of the named entity model;

3. the entity is usually in a document level text, context content information of a full-text sentence environment also has a certain supporting effect on entity identification, and effective information of the entity in a wider document level context environment is considered, so that the understanding of the model to the context is improved.

Drawings

FIG. 1 is a general model architecture diagram of the present invention;

FIG. 2 is a flow chart of the present invention;

fig. 3 is a BERT context coding diagram.

Detailed Description

Example 1: as shown in fig. 1-3, the method for recognizing a named entity in chinese based on reading understanding includes the following steps:

step1, collecting and sorting a public data set MSRA, and performing word segmentation processing on document level linguistic data to obtain a document level sequence;

in Step1, the public data set MSRA is collected and sorted, and for each type of entity in the MSRA data set, the problem of the retrieval tag of each type is constructed through 'annotation specification', the problem of the retrieval tag can be constructed manually by writing software, and can also be constructed by adopting other methods in the prior art. For example, if the annotator needs to label all entities with category labels as the location LOC, and the label description corresponding to the location LOC is "country, city, and mountain", the problem of the search label constructed by the label description corresponding to the location LOC is "find out abstract or concrete locations such as country, city, and mountain"; as shown in table 1:

table 1 illustrates an example of a search tag problem

step2.1, the document level linguistic data is processed by word segmentation to obtain a document level sequence X ═ X (x₁，x₂，x₃，......x_n) Where n represents the length of the sequence, the entities are extracted in a document-level sequence, and each entity is then labeled with a category label y, each represented as x_start，endAssuming that all entity class label sets of the document level sequence are Y, the class label Y corresponding to each entity belongs to Y;

Step3, converting the sequence labeling problem into a reading understanding problem, taking the retrieval tag problem and the document level sequence in the triple as input, and generating hidden output of the document level context information by a BERT coding layer; the idea of reading understanding is introduced by the retrieval of the label problem and the text splicing input BERT pre-training model, the document level sequence context is coded, and the problem is introduced as prior information.

In the conventional method, truncation is performed directly, and the portion beyond the length is discarded directly. In the method, the part with the excess length is selected and reserved and directly treated as a new sample, and in order to prevent entity words from being truncated, rules are also used, so that characters of the same entity are stored in the same sequence sample as much as possible. However, for the test sample, no entity tag is provided by default, direct truncation according to length is adopted, and the sequence beyond the length is still reserved as a new sample. The BERT then receives the combined character string and outputs a representation matrix

Where d is the vector dimension of the last layer of the BERT coding layer;

step4, hidden output of the document level context information is merged into a convolutional neural network (context feature convolutional layer in figure 1), semantic features of long-distance context are obtained, semantic information of the whole document context is captured, and the semantic information is compressed into a feature map;

Specifically, in order to prevent the loss of context information and effectively acquire the hierarchical features of a long-distance context, the semantic information of the whole context is captured, and valuable semantics are compressed into a feature map, so that the context adopts a convolution layer to acquire the context semantic features, and meanwhile, the maximum pooling operation is applied to the context semantic features, and a global feature vector which is fixed in size and independent from the input can be obtained.

Representation matrix obtained for BERT coding layer

Convolution is carried out, then the output of the convolution layer is input into the maximum pooling layer, and the calculation formula is as follows:

E_CNN＝max(f(ωE+b))

wherein E_CNNExtracting a representation matrix for obtaining the context characteristics, wherein b is a bias term, and f is a nonlinear activation function;

Further, the Step5 includes the specific steps of:

In this context, there may be multiple entities of the same category. This means that multiple start indices can be predicted from the start index prediction model and multiple end indices can be predicted from the end index prediction model. A model is further required to match the predicted start index with its corresponding end index. In particular by applying argmax to P_startAnd P_endWill obtain the predicted index, i.e. 0-1 sequences of length n

And

to pair

Are each 1, and

each of which is a 1, when satisfying i ≦ j of consecutive characters x_i，jIs the predicted entity, P_endThe probability of each entity class label becoming a termination index is predicted for the prediction layer.

In training the model of the present invention, the document-level series X and the tag sequence Y_startAnd Y_endPairings, representing the true label of each entity, are the start index or end index of any entity. Thus, for the start andthe end index prediction defines the following two losses, i.e., a start position loss and an end position loss. Data imbalance problems arise due to start and end index prediction, i.e., given search tag problem q_yAnd document level sequence X, the beginning or end of which is marked much less than the beginning or end of which is not marked. Therefore, a Loss function, namely, Focal location, of target detection in an image is cited, wherein the Focal location is mainly used for solving the problem that the proportion of positive and negative samples in one-stage target detection is seriously unbalanced. The loss function reduces the weight of a large number of simple negative samples in training, which can also be understood as a difficult sample mining. The definition is as follows:

where y is the label of the true sample and y' is the predictor. Finally, at the time of training, the lossy function is defined as follows:

L_start＝Focal Loss(p_s，p_y)

L_start＝Focal Loss(p_e，p_y)

Final loss＝L_start+L_start

further, return loss, tuning the model.

The metric value F1 score widely applied to the fields of information retrieval and statistical classification is adopted as an evaluation index. The F1 score is an index used in statistics to measure the accuracy of the two-class model. The method simultaneously considers the accuracy rate and the recall rate of the classification model. The F1 score can be viewed as a harmonic mean of model accuracy and recall with a maximum of 1 and a minimum of 0. Therefore, our experiments used precision, recall, and F1 scores to evaluate the performance of our model.

In order to make the experiment more convincing, two types of baseline models are designed for experimental comparative analysis in order to verify the effectiveness of the reading understanding method. A method for binding BilsTM to CRF (hereinafter abbreviated as BiLSTM + CRF) and a method for binding BilsTM to CRF (hereinafter abbreviated as CNN + BilsTM + CRF) without pre-training are used. Another method for combining BilsTM with CRF (hereinafter BERT + BilsTM + CRF) is pre-trained using BERT sequence labeling trimming (hereinafter BERT) and BERT.

In the experiment, with pre-training, a BERT model is called as a roberta-base character-level Chinese model for Haugh training, the fine tuning parameters are set to be 128 in hidden layer size, 0.25 in dropout, 6 in training round and 5e-6 in learning rate; without pre-training, the hidden layer size is 128, dropout is 0.25, the training round is 6, and the learning rate is 5 e-6. For the baseline model, the training corpus is set to be character-type sentences, and the document-level data is divided into sentence representations.

The results of the identification of our proposed model and baseline model in the MSRA dataset are shown in table 2. It is clear that the method proposed by the present invention achieves optimal results for all data sets with respect to all baseline model P, R, F1 values. The method F1 of the invention reaches 95.72%, the value of F1 without pre-training reaches 85.88% and 87.23%, compared with the method for identifying the Chinese named entity without pre-training, the accuracy is improved by 8.01%, 6.21%, the recall rate is improved by 11.3%, 10.17%, the F1 is improved by 9.84%, 8.49%, and the accuracy and the recall rate are both greatly improved. Compared with the Chinese named entity recognition method with pre-training, the accuracy is improved by 1.11%, 0.3%, the recall rate is improved by 3.51%, 3.31%, and F1 is improved by 3.22%, 2.89%, the recall rate is greatly improved, and the accuracy is slightly improved.

The method for recognizing the Chinese named entity based on reading understanding has better performance on the Chinese named entity recognition task, and mainly has the following reasons: 1. reading and understanding problem codes are priori knowledge, and the priori information contained in the problem codes has a good guiding effect on overall improvement of the named entity model; 2. the entity is usually in a document level text, the context content information of the full-text sentence environment also has a certain supporting function on entity identification, and the invention considers the effective information of the entity in a wider document level context environment, thereby improving the understanding of the model on the context.

Table 2 named entity identification method control experiment F1 values

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. The Chinese named entity recognition method based on reading understanding is characterized by comprising the following steps of:

2. The reading understanding-based Chinese named entity recognition method of claim 1, wherein:

the specific steps of Step2 are as follows:

step2.1, dividing the document level linguistic data into words to obtain a document level sequence X ═ X (X)₁，x₂，x₃，......x_n) Where n represents the length of the sequence, the entities are extracted in a document-level sequence, and each entity is then labeled with a category label y, each represented as x_start，endAssuming that all entity class label sets of the document level sequence are Y, the class label Y corresponding to each entity belongs to Y;

3. The reading understanding-based Chinese named entity recognition method of claim 1, wherein:

in Step3, the search tag problem and the document level sequence in the triplet are used as input, when the sequence length of the search tag problem and the document level sequence is larger than 512 bytes when the search tag problem and the document level sequence pass through a BERT coding layer, the search tag problem and the document level sequence are directly truncated, and the truncated part is used as a new sample processing loop to execute Step2-Step3.

4. The reading understanding-based Chinese named entity recognition method of claim 1, wherein:

in Step4.1, the hidden output of the document level context information is input into the convolutional layer of the CNN convolutional neural network to obtain long-distance context semantic features, and the maximum pooling operation is applied to the context semantic features, so that the global feature vector which is fixed in size and independent of the input can be obtained.

5. The reading understanding-based Chinese named entity recognition method of claim 1, wherein:

in Step5, the prediction layer uses two binary classifiers, one for predicting whether each class label is a start index and the other for predicting whether each class label is an end index.

6. The reading understanding-based Chinese named entity recognition method of claim 1, wherein:

the specific steps of Step5 are as follows:

is a learned parameter, d represents the vector dimension of the last layer of the BERT coding layer, N represents the length of the document-level sequence, P_startMeans that the probability distribution of each index is the entity of a given retrieval tag problemA starting position or an ending position of;