CN113065349A

CN113065349A - Named entity recognition method based on conditional random field

Info

Publication number: CN113065349A
Application number: CN202110274547.0A
Authority: CN
Inventors: 刘义江; 李云超; 姜琳琳; 吴彦巧; 姜敬; 檀小亚; 师孜晗; 陈蕾; 侯栋梁; 池建昆; 范辉; 阎鹏飞; 魏明磊; 辛锐; 陈曦; 杨青; 沈静文
Original assignee: Xiongan New Area Power Supply Company State Grid Hebei Electric Power Co; State Grid Hebei Electric Power Co Ltd
Current assignee: Xiongan New Area Power Supply Company State Grid Hebei Electric Power Co; State Grid Hebei Electric Power Co Ltd
Priority date: 2021-03-15
Filing date: 2021-03-15
Publication date: 2021-07-02

Abstract

The invention belongs to the technical field of natural language processing, and relates to a named entity recognition method based on a conditional random field, which comprises the following steps: receiving a word sequence comprising Chinese text; arranging all words in the word sequence according to the context sequence of the words in the original sentence; encoding the word sequence into a word vector group using a word vector module of the named entity recognition network; the word vector group comprises named entity characteristic information of each word; extracting sequence characteristics of each word vector in the word vector group by using a long and short memory network module of the named entity recognition network, and outputting a state score matrix of a named entity classification space; and searching the score path with the highest score in the state score matrix by using a conditional random field module of the named entity recognition network, and outputting the score path as a named entity prediction result of each word in the word sequence. The invention provides an efficient named entity identification method, and particularly relates to entity identification of names of people, organizations and places related to financial bill pictures.

Description

Named entity recognition method based on conditional random field

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a named entity recognition system and method.

Background

To find out the specified information in a huge amount of text content is an important technical problem to be faced in natural language processing. Due to the appearance of mass data, the workload of manually searching for positioning information is too large, which does not meet the requirements of people, the technology of rapidly positioning related information from mass data by using a computer technology arises, natural language processing is used as a basic processing method of positioning information, and the natural language processing is a research focus rapidly in recent years.

The main task of Named Entity Recognition (NER) is to recognize meaningful proper nouns and quantitative phrases in text, such as names of people, places, organization names, time, date, currency, etc., which have significant impact on syntactic analysis, semantic analysis, etc.; furthermore, named entity recognition is an important basis for information extraction, information filtering, information retrieval, block analysis, question-answering system, machine translation, and other technologies.

The English named entity recognition technology has reached a higher level at present, and some systems are put into practical use. Chinese named entity recognition is much more backward and difficult than English named entity recognition. Mainly including but not limited to: chinese words have no space segmentation, various name formation forms, deficient Chinese corpus, punctuation in documents, text formats and the like all influence named entities.

As with most natural language processing techniques, named entity recognition methods can be largely divided into two broad categories, rule-based methods and statistical-based methods. In general, the rule-based method is closer to the thinking way of people, and the representation is more intuitive and natural, thereby facilitating reasoning. However, the writing of the rules often depends on specific languages and fields, the portability is poor, the process of writing the rules is time-consuming and labor-consuming, the rules are difficult to write completely, the rules can be completed only by specific field experts and language experts, and the total cost performance is not high. Compared with a rule-based method, the statistical-based method is more flexible, the method is objective, and does not need too much manual intervention and domain knowledge, but has problems in the aspects of performances such as space-time overhead and the like.

Disclosure of Invention

The invention comprehensively considers the defects of the prior art scheme, and aims to provide an efficient named entity identification method, in particular to the entity identification of the names, the names of organizations and places mainly related to the financial bill images of mixed Chinese texts.

The technical scheme provided by the invention is a named entity recognition method based on a conditional random field, which is realized by a processor executing a program instruction, wherein the program instruction comprises a named entity recognition network realization instruction consisting of a word vector (word2vec) module, a long-short term memory network (LSTM) module and a Conditional Random Field (CRF) module. The method comprises the following steps:

receiving a word sequence comprising Chinese text; arranging all words in the word sequence according to the context sequence of the words in the original sentence;

encoding the word sequence into a word vector group using a word vector module of a named entity recognition network; the word vector group comprises named entity characteristic information of each word;

extracting sequence characteristics of each word vector in the word vector group by using a long and short memory network module of the named entity recognition network, and outputting a state score matrix of a named entity classification space;

and searching the score path with the highest score in the state score matrix by using a conditional random field module of the named entity recognition network as a named entity prediction result of each word in the word sequence.

Preferably, the training of the named entity recognition network comprises: training the word vector module; and simultaneously training the long and short memory network module and the conditional random field module.

Preferably, the training of the word vector module comprises: performing coding training based on preset named entity classification on the neural network of the word vector module by using a one-hot coded word stock; performing coding training on the neural network of the word vector by using a word sequence with continuous fixed length in a language database sentence based on whether a word is adjacent to the word in the word sequence or not; further preferably, the word sequence has a fixed length of 3.

Preferably, the word vector module is trained based on the coding that determines whether a word is adjacent to a word in the word sequence, and the loss L is configured to be calculated by a cross entropy loss function based on a classification task:

wherein, y represents the true value,

representing the predicted value, and the value ranges of the two values are {0,1 }.

Preferably, the training of the long and short memory network module and the conditional random field module simultaneously comprises: training is performed using the set of word vectors output by the word vector module as samples.

Preferably, the long and short memory network module performs classification scoring on each word vector of the word vector group based on context information to obtain a classification vector of each word vector; the classification vectors are combined into a state score matrix.

Preferably, the conditional random field module is configured to decode each classification vector in the state score matrix into a state score matrix comprising all classification paths using its transition matrix, and to select as output the path with the largest sum of scores in each path.

Preferably, when the long and short memory network module and the conditional random field module are trained, the Loss function Loss of the overall neural network is designed as follows:

wherein, P₁,P₂,…,P_nTo compute each path from the conditional random vector field, assume P_real-pathIs the largest path among them.

Preferably, the classification dimensions of the named entity feature information are configured as PER, ORG, LOC and O dimensions

In some embodiments of the invention, the word vector module is configured to assume that an input sentence thereof consists of n words w₁,w₂,…,w_n]Sequentially arranged, the word vector module is mainly used for coding the n words and obtaining the corresponding vector [ e ] of each coded word in the word space₁,e₂,…,e_n]Wherein the word vector e_iCorresponding to the word w_iI ∈ n. Each vector in a group of word vectors is sequentially used as the input of the long and short memory network module, and the long and short memory network module is used for classifying and scoring based on the context information. In some embodiments of the present invention, the long and short memory network module is configured based on a bidirectional LSTM network (BiLSTM) structure, sequentially reads each word vector in a group of context word vectors output by the word vector module, and takes into account context information of each word, extracts sequence features for the encoded word vectors, performs classification-based scoring on each word, and outputs a score vector in a classification space. In other embodiments of the present invention, the long and short memory network module is configured based on any long and short memory network that is improved in the prior art and can process context information. In the invention, the classification configuration is entity naming, such as the name of a person, the name of an organization and the location of the organization are the classifications in three directions, and the higher the score of one classification is, the longer the word vector is considered to belong to the classification by the long-short memory network module. When each class is considered as a dimension, all pre-configured classes form the classification space of the present invention. The category corresponding to the highest scoring component in the score vector represents which type of named entity the word belongs to. To comprehensively consider the realistic significance of the LSTM output as an overall sequence, the invention further scores the LSTM output using a conditional random field moduleAnd step (4) evaluation, namely selecting the scoring path with the highest score as the final prediction result.

In some embodiments, feature generation and selection utilizes conditional random field models to generate and select features, and the finally selected features are stored in a feature library. Because the structure of the named entity has strong randomness and a good recognition effect is difficult to obtain only by analyzing the structure and the characters of the named entity, the related information of the context of the named entity needs to be fully mined, the conditional random field can express long-distance context dependent information and effectively fuse various related or unrelated information together.

In the prior art, in a conditional random field model, the feature space of a problem is generally very large, for a training corpus, generated features are often thousands of, but not all the features are useful, and excessive features can influence the operating efficiency of a system, so that feature selection is a key problem. Two methods of feature selection are common, the incremental method and the threshold method. The basic idea of the incremental method is to calculate the information gain of each feature, if the added features can improve the performance of the system, the features are retained, otherwise, the features are deleted, the feature selection effect of the incremental method is good, but the additional space-time overhead is also large. The threshold rule is to count the frequency of occurrence of a feature, and if the total number of occurrences of a feature in a corpus is less than a set threshold, delete the feature, otherwise, keep the feature, the operation of the threshold rule is simple and practical, but it cannot keep in mind that the selected feature set is minimal and complete, and there are probably some redundant features, which will reduce the operating efficiency of the system. Some preferred embodiments of the present invention employ a threshold method, with the threshold set at 2. Preferably, the model parameter training uses the characteristics and training linguistic data in the characteristic library to perform parameter training by an L-BFGS method to obtain model parameters; preferably, the named entity recognition adopts a trained conditional random field model to perform named entity recognition and outputs the result of labeling the named entity.

The method adopts the statistical algorithm conditional random field to identify the Chinese named entity, the conditional random field inherits the advantages of the maximum entropy model, various related or unrelated context information can be effectively integrated, external semantic knowledge is added into the feature selection of the conditional random field model, and the accuracy of named entity identification is effectively improved. The invention does not need to extract general named entity indicators from the labeled corpus and then increases the number of the named entity indicators through expansion.

Drawings

FIG. 1 is a schematic structural diagram of a named entity recognition system for implementing a named entity recognition method based on a conditional random field according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a neural network structure of a word vector module of the named entity recognition system according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a neural network structure of a long/short memory network module of the named entity recognition system according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a neural network structure of an LSTM module in the long and short memory network modules according to an embodiment of the present invention

FIG. 5 is a schematic diagram of the operation principle of a neural network of the conditional random field module of the named entity recognition system according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating a data processing procedure in the named entity recognition system according to an embodiment of the present disclosure;

FIG. 7 is a flowchart illustrating a method for recognizing a named entity based on a conditional random field according to an embodiment of the present invention.

Detailed Description

It should be noted that, in the prior art, a conditional random field model for named entity recognition needs to be trained by lifting an external semantic library with labels, the conditional random field model can integrate long-distance context related information, the context information is described by a feature function, and feature selection directly affects the performance of the conditional random field model. In Chinese, semantic features are important context information, but the semantic information in the Chinese is often hidden in the context and needs deep mining, in the prior art, the semantic information in the Chinese is used, an external semantic library is established for a conditional random field model to use, and the external semantic library established for the named entity identification comprises a name indicating word library, a Chinese name table and a common name table; place name indicator, common place name table; an organization name designator, a list of common organization names, an organization name feature suffix, and the like. The invention considers the performance index under a specific application scene, and avoids the labeling processing of training samples by training the neural network for word vector coding under the condition of less classification requirements.

Herein, unless otherwise specified, the term "model" as it relates to deep learning network models, word2vec models, etc., refers to a set of mathematical algorithms implemented by sequences of computer program instructions that, when read by a processor and used to implement processing of computer data to achieve a specified technical effect, with various configuration parameters and defined specified input data. In the following embodiments of the inventive concept disclosed herein, specific implementation function codes may be implemented specifically by means of common general knowledge in the art after understanding the specific concept, and thus, details of these embodiments are not repeated.

Referring to fig. 1 to 7, the embodiment is a named entity recognition method based on conditional random fields, which is implemented by a named entity recognition system 1000 executed on a processor, the processor receives chinese sentence data including chinese vocabulary and executes a program of the named entity recognition system 1000. The program instructions for operation of the named entity recognition system 1000 include program instructions for implementing a named entity recognition network comprised of program modules such as a word vector module 1001, a long short term memory network module 1002, and a conditional random field module 1003.

Exemplarily, the word vector module uses word2vec to encode Chinese characters and alphabetic symbols, the Chinese character library in the embodiment includes about 7000 common Chinese characters, and the encoder has the advantage of performing reasonable dimension reduction operation on one-hot encoding vectors. A word vector model network structure used in the word vector module training of this embodiment is shown in fig. 2, wherein a Layer1 network Layer is used to obtain word vectors of each word in a sequence of words, and the word vector network input is: in this embodiment, an encyclopedia 800w + entry sentence is used as a corpus used for training, and an open source component library "jieba participle" is first used to complete participle for each sentence in the whole corpus. We train the word vector network model using the CBOW model.

A preferable scheme of this embodiment is that, during training, the word vector module selects one-hot vectors corresponding to 3 words in the word sequence each time as input of the word vector network Layer1, and determines whether the middle word can be used as a middle filling word between the upper and lower words through two layers of neural networks (Layer1 and Layer 2). In the training, the loss function adopts a cross entropy loss function commonly used by classification tasks:

wherein, y represents the true value,

represents the predicted value, and the value ranges of the predicted value and the predicted value are {0,1}, so that the binary classification of Layer2 in fig. 2 is realized during training. After the training is finished, the word vectors of all words in the word sequence are output by using only Layer1, and it can be seen that Layer1 is trained to judge the position information of all words in the continuous word sequence in the four-dimensional classification space of the invention and the information of the degree of adaptation of the words adjacent to the front and the rear as filling words.

In this embodiment, referring to fig. 3, the BiLSTM in the long and short memory network module inputs the word vector corresponding to each segmented sentence, and outputs the score vector of each wordThe score vector is a 4-dimensional vector, each dimension corresponds to the classification of four named entities in the embodiment, the embodiment only considers the feature extraction of three named entities, namely PER, ORG and LOC, and possible named entities of the rest words are represented by O, i.e. a 4-dimensional named entity classification space is formed. Wherein for an input vector sequence x₁,x₂,…x_i]The order increases with time, the LSTM module at the a layer circularly considers an input's dependency information on its context, i.e. the iteration of the hidden state s is performed in the positive order, and the LSTM module at the a' layer circularly considers an input's dependency information on its context, i.e. the iteration of the hidden state s' is performed in the negative order. It can be seen that, in practice, due to the special model structure of the LSTM-like recurrent neural network, the input layer has no limit on the sequence length of the input itself.

Specifically, a single LSTM module of this embodiment is shown in fig. 4: the whole module receives the cell body state C from the previous step in the t step_t-1Hidden layer filling H_t-1Outputting the hidden layer H of the next layer_tAnd cell somatic state C_t. In the figure, M represents a sigmoid activation function:

the remaining part of the calculation formula is as follows:

F_t＝M(W_f·[h_t-1,x_t]+b_f)

I_t＝M(W_i·[h_t-1,x_t]+b_i)

L_t＝tanh(W_c·[h_t-1,x_t]+b_c)

C_t＝F_t*C_t-1+I_t*L_t

O_t＝M(W_o·[h_t-1,x_t]+b_o)

H_t＝O_t*tanh(C_t)

wherein each W and b is the coefficient and offset in the linear relationship model with its subscript, which is the Hadamard product, x_tThe input of step t.

Illustratively, in this embodiment, the processing procedure of the conditional random field module using its conditional random field model is: chinese word sequence [ w ]₁,w₂,w₃]The trained word vector module is processed into a group of word vectors v₁,v₂,v₃]The input is obtained after encoding of the bidirectional LSTM layer, and scores on various dimensions of the named entity classification of each word vector, namely, score vectors, are obtained. The conditional random field module receives the score vector of the LSTM, calculates scores of all paths by combining a state transition score matrix as input, takes a sequence corresponding to the path with the highest score as a final prediction result, and further selects a classification result corresponding to the path with the highest comprehensive probability from all paths. The invention utilizes CRF as decoder, and the working principle of conditional random field module is shown in FIG. 5. Exemplarily, it is assumed that the output of the long and short memory modules includes three classification vectors: a [0.1,0.2,0.8,0 ]],b[0.4,0.2,0.2,0],c[0.6,0.1,0.2,0.1]Four values in each classification vector correspond to four classifications, PER, ORG, LOC, O, respectively. The score for the first path (org, org, org) is: a 0]+b[0]+c[0]+FF_start-org+FF_org-org+FF_org-org+FF_org-end. When the classification space dimension is 4, it is easy to calculate that all paths in the state score matrix have 4 × 4 × 4 × 4 types.

It is easy to understand that in the present invention, when the conditional random field module receives the output y from the long and short memory network module₁,y₂,y₃,…,y_n]As the state score, the transition state score vector is a state score matrix of 6 × 6, and the state score matrix of the present embodiment is shown as follows, wherein the START row and column and the END row and column are respectively the code marks of the beginning and the END of the long and short memory loop, and are insensitive to the training result and are used as the starting point and the END point of the conditional random field selection path.

Start

Per

Org

O

Loc

End

Start

…

Per

…

Org

…

O

…

Loc

…

End

…

Suppose that our resulting predicted path sequence XL through LSTM [ ORG, O, PER]The path is divided into s ═ F_org+F_o+F_per+FF_start-org+FF_org-o+FF_o-per+FF_per-endWhere F represents the fraction of the LSTM output and FF represents the state transition fraction matrix (i.e., the transition matrix of the conditional random field in the present invention). In this embodiment, the conditional random field module is configured to obtain a final predicted path sequence by calculating the maximum score path obtained by the above method.

In this embodiment, after the word vector module is trained, it is used as a fixed preprocess for the input sample, and each output is used to perform the following long and short memory module and conditional random field moduleAnd (3) training simultaneously, wherein BILSTM needs to be trained on the coefficient and the offset of each cell body, CRF needs to be trained on the parameter of a state transition score matrix, and during training, each path P is obtained by calculating the CRF₁,P₂,…,P_nThen, assume that the maximum path therein is P_real-pathThe Loss function Loss of the overall neural network during training is designed as:

by minimizing the function, the whole neural network included in the conditional random field module and the long and short memory network module can be supervised and trained, and specifically, in the embodiment, an Adam optimizer is used for parameter learning.

After training of each neural network in the named entity recognition network of the embodiment is finished through the training, the named entity recognition result of the word sequence can be obtained through the named entity recognition method provided by the invention. The present invention is fully disclosed by the following complete implementation process of the present embodiment, which includes steps 100 to 400.

Step 100, receiving a word sequence containing a Chinese text; the words in the word sequence are arranged in the context order of their original sentence. Specifically, the word sequence in this embodiment is obtained by the following method: detecting and identifying text content in the bill picture, and converting the text content into a first text character string which can be used for computer processing; segmenting words of the text character string by using an external word segmentation tool to generate a second character string with word segmentation identification; and extracting three continuous words of the second character string to form a word sequence according to the word segmentation representation. Exemplarily, the first text string is a sentence segmented by punctuation marks to "zhang san in tiananmen. For example, a jieba word segmentation and source opening tool is used for segmenting the whole sentence to obtain three words of Zhang III, Tiananmen and the like, and the words are subjected to one-hot coding in the same way as the training word vector module to obtain the shape of [ w₁,w₂,w₃]The sequence of words of (a).

Step 200, using a word vector module of the named entity recognition network to encode the word sequence into a word vector group; the word vector group contains named entity feature information of each word. Respectively coding the three words by using the trained word vector module to obtain a group of three word vectors [ v₁,v₂,v₃]。

And step 300, extracting the sequence characteristics of each word vector in the word vector group by using a long and short memory network module of the named entity recognition network, and outputting a state score matrix of the named entity classification space. Specifically, a score vector of each input word vector is obtained through the BilSTM of the trained long and short memory network module, and then the post-processing part of the long and short memory network module takes the score vector as a state score and sends the state score and the corresponding transfer matrix to the conditional random field module.

And step 400, searching the score path with the highest score in the state score matrix by using a conditional random field module of the named entity recognition network, and outputting the score path as a named entity prediction result of each term in the term sequence. The trained conditional random field module obtains a state score matrix according to the score sequence and the transition matrix obtained by training, and the sequence path with the maximum score is the final path, for example, the path sequence XL: [ PER, O, LOC ]

The implementation steps are only exemplary, the implementation time of each step depends on the precondition of the step rather than the time sequence of the step, for example, the training of the neural network in the word vector module can be done in advance, and the step is not necessarily implemented after the word sequence.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In another aspect, the shown or discussed couplings or direct couplings or communication connections between each other may be through interfaces, indirect couplings or communication connections of devices or units, such as calls to external neural network units, and may be in a local, remote or mixed resource configuration form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing device, or each module may exist alone physically, or two or more modules are integrated into one processing device. The integrated module can be realized in a form of hardware or a form of a software functional unit.

The integrated module, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-0 nlymetry Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The named entity recognition method based on the conditional random field provided by the invention is described in detail, specific examples are applied in the method to explain the principle and the implementation mode of the invention, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for named entity recognition based on conditional random fields, implemented by a processor executing program instructions, the method comprising:

and searching the score path with the highest score in the state score matrix by using a conditional random field module of the named entity recognition network, and outputting the score path as a named entity prediction result of each word in the word sequence.

2. The named entity recognition method of claim 1, wherein training the named entity recognition network comprises: training the word vector module; and simultaneously training the long and short memory network module and the conditional random field module.

3. The named entity recognition method of claim 2, wherein training the word vector module comprises: performing coding training based on preset named entity classification on the neural network of the word vector module by using a one-hot coded word stock; and performing coding training on the neural network of the word vector by using a continuous fixed-length word sequence in a language database sentence based on the judgment of whether the adjacent words in the word sequence exist.

4. The named entity recognition method of claim 3, wherein the word sequence has a fixed length of 3.

5. The named entity recognition method of claim 4, wherein the word vector module is trained based on a code that determines whether a word in the sequence of words is adjacent to a word, and wherein the penalty L is configured to be calculated by a cross-entropy penalty function based on a classification task:

wherein, y represents the true value,

6. The named entity recognition method of claim 2, wherein the simultaneous training of the long and short memory network module and the conditional random field module comprises: training is performed using the set of word vectors output by the word vector module as samples.

7. The named entity recognition method of any one of claims 1-6, wherein each word vector of the set of word vectors is classified and scored based on context information by a long-short memory network module to obtain a classification vector of each word vector; the classification vectors are combined into a state score matrix.

8. The named entity recognition method of any one of claims 1-6, wherein the conditional random field module is configured to decode each classification vector in the state-score matrix into a state-score matrix comprising all classification paths using its transition matrix, and to select as output the path with the highest score among the paths.

9. The named entity recognition method of claim 6, wherein, when the long and short memory network module and the conditional random field module are trained, a Loss function Loss of the overall neural network is designed as:

10. The named entity recognition method of claim 1, wherein the classification dimensions of the named entity feature information are configured as PER, ORG, LOC, and O dimensions.