CN111476031A

CN111476031A - Improved Chinese named entity recognition method based on L attice-L STM

Info

Publication number: CN111476031A
Application number: CN202010167070.1A
Authority: CN
Inventors: 甘玲; 黄成明
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2020-07-31

Abstract

The invention relates to an improved Chinese named entity recognition method based on L attice-L STM, which belongs to the technical field of language processing and comprises the following steps of S1 constructing a model, S2 inputting characteristics, S3 extracting characteristics, S4 predicting labels, S5 evaluating results, adopting an improved L STM structure to calculate complete semantic hidden information of sentences, simultaneously adding consideration to the global information of the whole sentence, starting from the aspect of sentence structure, making up the defect that only the hidden information of the literal meaning is concerned and the sentence structure information is not considered by adopting a L structure, fusing a Transformer structure, enabling the model to understand the logic behind the complex sentence to a certain degree, and helping to recognize the type of a named entity in the sentence.

Description

Improved Chinese named entity recognition method based on L attice-L STM

Technical Field

The invention belongs to the technical field of language processing, and relates to an improved Chinese named entity recognition method based on L attice-L STM.

Background

Named entity recognition was first organized by Grishman and Sundheim in the sixth information understanding conference in 1996, and the development of named entity recognition can be roughly divided into three stages, namely an early primary stage for recognizing entity classes by using manual rules and a later advanced stage for recognizing entity classes by using a machine learning method in about 2000 in combination with a probability model, and then to a deep stage for combining a language model with a more popular deep learning-based method at present, each stage has some learning-worthy advantages, and the following description is made from the three stages respectively.

The early scheme of using manual rules is performed by combining dictionary construction and a normalized word construction method, the classification process has the disadvantages of large engineering quantity and long time consumption, and is quickly replaced by a method based on probability theory, but the method also provides some guidance, causes thinking on how to construct cheaper and more excellent models, and in the large-scale application stage of the machine learning method, because the predecessor already has a plurality of very rich and more perfect theoretical models, the stage mainly combines the probability theoretical method, puts the theory into practical production life and obtains positive effects, and in the stage, the main foreign theory and practice are combined closely, and a plurality of excellent models such as Hidden Markov Model (HMM), maximum Entropy Model (EM), support vector machine model (SVM) and conditional contingency field model (CRF) appear, the models have certain relevance, mainly supplements some weaknesses of the former, meanwhile, the domestic application also focuses on the theory and the actual application of Chinese named entity recognition, and some feature projects constructed aiming at Chinese character features appear, so that entity label prediction is further carried out through machine learning.

The deep learning method starts to be popular on a large scale after 2014, text information in a network grows explosively due to the maturity of software and hardware, the deep learning massive data is a source of deep learning massive data, after conditions in all aspects are mature, deep learning gradually fills theory and goes to practice, a plurality of models and ideas are developed in the named entity recognition direction, the deficiency is further compensated on the basis of machine learning, the method is closer to the production life of people, some important models comprise a BERT model, a GPT-2.0 model, an xlnet model and the like, the model and ideas are standard pole models of the current deep learning, the ideas of a plurality of models influence the model improvement direction of named entity recognition, and further effects are achieved.

For the recognition direction of the Chinese named entity, there are the following schemes:

(1) based on the characteristics of Chinese texts, the scheme of further extracting text characteristic information is considered, the essence of the scheme is similar to that of machine learning, more effective information can be expected to be extracted, label prediction of each character is assisted, basic text characteristics comprise character vector characteristics of a single Chinese character and word vector characteristics after a sentence is segmented, and then the probability of calculating character labels by utilizing character characteristics such as character pinyin and components appears.

(2) Based on the characteristics of Chinese text, the L STM structure is selected to extract the characteristics of sequential language logic, the information of the previous words in the sentence can be selected to be fused with the information of the current words, the hidden state of the current words is obtained through the calculation of a series of formulas, then the label probability of the current words is calculated, and since L STM can only calculate the unidirectional text hidden information of a sentence, a bidirectional L STM structure, namely a bi-L STM structure, is adopted, and then the bidirectional hidden state is spliced through dimensionality to serve as the input of the final word label probability calculation, and a better effect is obtained.

(3) On the basis of L STM structure, there is the further deepening research of researcher, carry out further improvement in L STM, expect to fuse character feature and word feature according to sentence sequence, selectively fuse word feature into the calculation of character feature, further promote named entity's effect.

(4) In addition, in the english named entity recognition calculation task, some researchers also adopt different methods, and the named entity label probability calculated based on the multilayer Transformer structure takes a method route different from the L STM structure.

The defects of the prior art are as follows:

(1) the named entity recognition research is carried out by adopting a bidirectional L STM structure, the character and word feature utilization is good, but L STM has a great defect that character feature information cannot be transmitted in a long distance, because a paper shows that in a longer sentence, when L STM transmits the front text information, due to the threshold method adopted in the L STM structure, the long-distance transmission of more information is limited, and the loss of the part of information can greatly influence the recognition effect of the named entity, although a researcher adopts a GRU structure later, the number of thresholds is reduced by comparing the L STM structure, but the problem that the information is lost in the processes of transmission before and after the information is still not fundamentally solved.

(2) In the named entity recognition network adopting the Transformer structure, the core of the Transformer structure is to calculate an attention matrix, important characters or short sentences in a sentence can be more prominently utilized through the calculation of the attention matrix, the characters are favorable for expressing key information of the whole sentence, the information is favorable for predicting a named entity class, the matrix calculation is used, the defect of a threshold structure in an L STM structure can be avoided, the character characteristics can be more directly utilized through the direct matrix calculation of the whole sentence, but the defect is also existed, that is, the sequential characteristics of the front characters and the back characters of the sentence can not be utilized, and therefore, some special entities can not be better recognized when the named entity is predicted.

Disclosure of Invention

In view of the above, the present invention provides an improved method for recognizing a named entity in chinese based on L attice-L STM, which solves the problem of accuracy of recognition effect of the named entity in chinese, and simultaneously calculates probability of the named entity by using advantages of L STM processing sequence and advantages of a Transformer structure capable of processing long-distance sentences, thereby achieving the purpose of improving recognition effect.

In order to achieve the purpose, the invention provides the following technical scheme:

an improved Chinese named entity recognition method based on L attice-L STM, comprising the following steps:

s1: constructing a model;

s2: inputting characteristics;

s3: extracting characteristics;

s4: predicting a label;

s5: and (6) evaluating the results.

Optionally, the S1 is specifically that a transform structure encoder part is introduced based on a L attice-L STM model, and written and debugged using python language;

the experimental data set comprises a weibo data set, a Microsoft MSRA data set and a resume data set.

Optionally, the S2 specifically includes:

representing each character in a sentence by using high-dimensional digital vectors, and finally participating in calculation of the high-dimensional vectors, wherein the feature information comprises word vectors, Chinese pinyin features, Chinese character component features and Chinese character font features;

and after splicing a plurality of characteristics, expressing the characteristics of the Chinese character, wherein the characteristics are high latitude vectors trained by adopting different models.

Optionally, the S3 specifically includes: extracting features in two aspects;

on one hand, for the feature extraction of the character information in a sentence, an improved model L attic-L STM structure aiming at Chinese of an L STM structure is used, the features of a single Chinese character and the features of each word after the sentence is participated are fused, a bidirectional structure is adopted at the same time, the sentence features are extracted from the forward direction sequentially, the sentence features are extracted from the reverse direction, and the sentence features are spliced to obtain the basic character information of the sentence;

on the other hand, a Transformer structure is adopted, the importance degree of different characters in a sentence is calculated, then the characteristic information of the characters in the sentence is calculated, then a forward network structure is adopted, the hidden information of the characters is fully mapped and fused, the structural information of the sentence is obtained, the structural information of the sentence is extracted from the whole sentence, the overall characteristics of the sentence can be expressed in a summary mode, and the information and the calculated character information are combined to express the semantic meaning and the structural characteristic information of the whole sentence.

Optionally, the S4 specifically includes: and (3) decoding the characteristic information of the previous part based on a Viterbi algorithm by adopting a main flow structure conditional random field CRF structure, and calculating a global optimal tag sequence of the whole sentence, wherein the tag sequence is the predicted entity tag category of the whole sentence.

Optionally, the S5 specifically includes: evaluation index of results: the method comprises the following steps of calculating an accuracy rate, a recall rate and a comprehensive evaluation index F1:

the precision ratio is as follows: p is TP/(TP + FP)

The recall ratio is as follows: r is TP/(TP + FN)

Comprehensive evaluation indexes are as follows: f₁＝2PR/(P+R)

Wherein, TP: positive samples are predicted as positive samples; FP: negative samples are predicted as positive samples; FN: positive samples are predicted as negative samples.

The method has the advantages that the existing mainstream thought is taken into consideration, the improved L STM structure, namely the L attice-L STM structure is adopted, the comprehensive hidden information of the sentence is calculated by utilizing the advantages of character and word feature fusion, the consideration on the global information of the whole sentence is added, the more is the starting point of the sentence structure, the defect that only the hidden information with the character meaning is concerned by adopting the L STM structure and the sentence structure information is not considered is overcome, and after the Transformer structure is fused, the model can understand the logic property behind the complex sentence to a certain degree, so that the recognition of the named entity category in the sentence is facilitated.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a flow chart of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.

As shown in fig. 1, after analyzing L STM structure and fransformer structure, the present invention finds that these two models can complement each other to overcome the deficiency in processing the named entity recognition task, firstly, the problem of long distance sentence information transfer loss in L STM structure, after using the fransformer structure, there is no problem of text information transfer loss before and after the sentence due to the way of matrix calculation, the matrix calculation is calculation for the attention matrix, and the attention matrix indicates the importance degree of the current position text in the sentence, and the entity type word often has a very important status in the sentence, but the use of the fransformer structure alone cannot utilize the sequential semantic information of the sentence, and cannot better solve the problem of entity recognition.

The invention is mainly divided into three parts, namely a characteristic input part, a characteristic extraction part and a label prediction part, and detailed information is as follows:

(1) in the feature input part, high-dimensional digital vectors are mainly used for representing each character in a sentence, and the high-dimensional vectors are finally used for calculation, and meanwhile, more feature information such as word vectors, Chinese pinyin features, Chinese character component features and Chinese character font features can be used for feature input. After being spliced, a plurality of characteristics can express the characteristics of the Chinese character, and the characteristics are high latitude vectors trained by adopting different models.

(2) The feature extraction part is mainly divided into two aspects of feature extraction, firstly, for the feature extraction of the character information in a sentence, an L STM structure improved model L attic-L STM structure aiming at Chinese is used, the structure well fuses the features of a single Chinese character and the features of each word after the sentence is divided into words, so that the literal meaning of each character in the sentence can be fully expressed, meanwhile, a bidirectional structure is also adopted, the sentence features can be extracted from the forward direction in sequence, the sentence features are extracted from the reverse direction, the basic character information of the sentence can be obtained by splicing the two, secondly, a Transformer structure is adopted, the whole features of the sentence can be expressed in an outline mode by calculating the importance degree of different characters in the sentence, then the feature information of the character in the sentence is calculated, then, the forward network structure is adopted, the full mapping and the hidden information of the sentence are fused, and the sentence structure information is the information extracted from the whole sentence, and the whole character information is combined with the character information calculated in the front direction, so that the feature information, the whole sentence structure information, the meaning information, and the like of the sentence can be fully expressed.

In the label prediction part, a commonly used main stream structure Conditional Random Field (CRF) structure is adopted, the feature information of the last part is decoded based on a Viterbi algorithm, and a global optimal label sequence of the whole sentence is calculated, wherein the label sequence is the predicted entity label category of the whole sentence.

The invention is improved according to the existing Chinese named entity recognition model and the related language model, the main structure is as shown in the figure, the invention is mainly divided into three parts, namely a text word vector input part, a feature extraction and fusion part and a label prediction part, the three parts are respectively verified on a plurality of disclosed Chinese named entity recognition data sets, the data sets comprise a weibo data set, a Microsoft MSRA data set and a resume data set, a Transformer structure encoder part is introduced on an L attice-L STM structure, named entities are effectively calculated from the perspective of a full sentence structure, certain effects are obtained, the functions of each part are described in the previous text, and the description is omitted here.

The implementation process of the invention is as follows:

1. and (3) building a model, namely introducing a Transformer structure encoder part based on an L attice-L STM model, writing and debugging by using a python language.

2. Experimental data set: the data sets comprise a weibo data set, a Microsoft MSRA data set and a resume data set.

3. Evaluation indexes of experimental results: the method comprises the following steps of calculating an accuracy rate, a recall rate and a comprehensive evaluation index F1:

the precision ratio is as follows: p is TP/(TP + FP)

The recall ratio is as follows: r is TP/(TP + FN)

Comprehensive evaluation indexes are as follows: f₁＝2PR/(P+R)

Note:

TP: positive samples are predicted as positive samples;

FP: negative samples are predicted as positive samples;

FN: positive samples are predicted as negative samples;

4. experiment: the experiment is written in a windows 10 system by using python3.6.3 based on the 1.0 edition of the pytorch, and is obtained in an i7-8550U processor and 16GB memory experiment, and the experimental result is as follows:

model (model)	Data set	Rate of accuracy/%)	Recall/%)	Overall evaluation index/%
					L attice-L STM model	weibo dataset	62.25	53.04	58.79
L attice-L STM + Transformer model	weibo dataset	67.65	56.79	61.74
					L attice-L STM model	resume data set	94.81	94.11	94.46
L attice-L STM + Transformer model	resume data set	96.7	96.1	96.4

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. An improved Chinese named entity recognition method based on L attice-L STM is characterized by comprising the following steps:

s1: constructing a model;

s2: inputting characteristics;

s3: extracting characteristics;

s4: predicting a label;

s5: and (6) evaluating the results.

2. The improved Chinese named entity recognition method based on L attice-L STM according to claim 1, wherein the S1 is based on L attice-L STM model, introduces a Transformer structure encoder part, writes and debugs using python language;

3. The improved Chinese named entity recognition method based on L attice-L STM according to claim 1, wherein the S2 specifically comprises:

4. The improved Chinese named entity recognition method based on L attice-L STM as claimed in claim 1, wherein the S3 is specifically a feature extraction divided into two aspects;

5. The method for improved Chinese named entity recognition based on L attice-L STM as claimed in claim 1, wherein the S4 is to adopt a conditional random field CRF structure of a mainstream structure, decode the feature information of the previous part based on a Viterbi algorithm, and calculate a global optimal tag sequence of the whole sentence, wherein the tag sequence is the predicted entity tag category of the whole sentence.

6. The method for improved Chinese named entity recognition based on L attice-L STM according to claim 1, wherein the S5 specifically comprises result evaluation indexes including accuracy, recall and comprehensive evaluation index F1, and the calculation method is as follows:

the precision ratio is as follows: p is TP/(TP + FP)

The recall ratio is as follows: r is TP/(TP + FN)

Comprehensive evaluation indexes are as follows: f₁＝2PR/(P+R)