CN115906855A

CN115906855A - Word information fused Chinese address named entity recognition method and device

Info

Publication number: CN115906855A
Application number: CN202211690568.1A
Authority: CN
Inventors: 汪陈笑; 鲍迪恩; 蒋炜; 邓静; 陈盼盼
Original assignee: Hangzhou Bangrui Technology Co ltd; Zhejiang Bangsheng Technology Co ltd
Current assignee: Hangzhou Bangrui Technology Co ltd; Zhejiang Bangsheng Technology Co ltd
Priority date: 2022-12-27
Filing date: 2022-12-27
Publication date: 2023-04-04

Abstract

The invention discloses a method and a device for identifying a named entity of a Chinese address by fusing word information. The method mainly comprises the following three parts: the method comprises the steps of vocabulary information generation network construction, label distribution learning network construction and character label learning network construction. The invention aims to acquire and integrate vocabulary information in text representation, represents vocabularies through n-gram fragments and aims at overcoming the defect that a character model lacks enough context information. The invention ensures that the information of the merged words is based on the original data, accelerates the speed of obtaining the specific words by the model and improves the precision of the model.

Description

Word information fused Chinese address named entity recognition method and device

Technical Field

The invention relates to the field of recognition of named entities of Chinese addresses, in particular to a method and a device for recognizing named entities of Chinese addresses with fused word information.

Background

With the rapid development of informatization, fields highly related to addresses, such as take-out, postal service, financial pneumatic control and the like, are also stepped on with digitalization steps. The Chinese address named entity recognition refers to the recognition of various entities related to addresses from texts, and the subsequent related work is developed based on the entities, so that the entity recognition efficiency can greatly influence the subsequent task execution. Especially in the chinese domain, a single word has no special semantic information due to the lack of distinct spacing like spaces in the english domain. In the task of identifying the named entities in Chinese, firstly, correct word segmentation can be carried out on a Chinese sentence, and the method is very difficult in the absence of human priori knowledge. For example, the sentence "she says really is" is true, and the sentence "does", "really is" is also a correct word from the word segmentation point of view, but in the actual context, the sentence should be broken "she/say/really/ideal" based on the prior knowledge of a person. The named entity recognition also needs to recognize words and classify the words according to the context and the attributes of the words.

At present, the methods for integrating additional information into character vectors in the recognition of Chinese named entities are mainly divided into three types: firstly, searching words taking the current character as the ending word in a word list, and inputting all the found words as extra information and the characters into a model; secondly, searching vocabulary vectors containing the characters in a word list, integrating the vocabulary vectors according to a certain rule, fusing the obtained word vectors to the character vectors, and inputting the character vectors into the model; thirdly, integrating the label probability of the current character in all data, fusing the probability vector to the character vector and inputting the character vector into the model.

The first method needs to search words for each character, so that the number of word units required to be added to each piece of data is different, which causes the problem that the models cannot be trained in batch and the training speed is slow. The second approach described above searches for words in the vocabulary, but the words found do not necessarily conform to the text character vocabulary, and it is highly likely that erroneous vocabulary noise is introduced. The third method only adds label information to character information simply and roughly, and lacks the most critical vocabulary information.

In summary, the conventional time series data query method cannot simultaneously satisfy the following requirements:

1) The vocabulary information is quickly acquired, and the information of the vocabulary, the characters and the label probability is blended into the input character information.

2) The method has the advantages that more information is provided for the model, the recognition accuracy of the named entity of the model is improved, meanwhile, the model training and on-line prediction efficiency is guaranteed, and the model precision and the model speed are well balanced.

Disclosure of Invention

In the field of Chinese addresses, starting from text data, vocabulary information is effectively introduced into a model, character information and vocabulary information are integrated to carry out named entity recognition, and n-gram segment representation selected based on character lexeme information is selected as a generation source of the vocabulary information to provide sufficient information for the model to carry out a named entity recognition task.

The purpose of the invention is realized by the following technical scheme: in a first aspect, the invention provides a method for identifying a named entity of a Chinese address with fused word information, which comprises the following steps:

(1) The n-gram fragment vector for obtaining the Chinese address is expressed as X = (X) ₁ ，x ₂ ，...，x _n ) And obtaining a corresponding real vocabulary fragment Y = (Y) ₁ ，y ₂ ，...，y _m ) (ii) a Where n is the number of characters in the n-gram fragment and m is the number of characters in the real vocabulary fragment；

(2) Constructing a vocabulary information generation network, and adopting a structure of a double-tower model, wherein the network specifically operates as follows:

(2.1) inputting the n-gram fragments and the real vocabulary fragments into a vocabulary information generation network, and acquiring random character vector codes through an Embedding layer;

(2.2) encoding the character vector to learn the character vector representation via the ELMO layer and the Dense layer;

(2.3) the character vector represents that after passing through an average pooling layer (means), the text segment is characterized as a word vector;

(2.4) splicing the word vectors of the n-gram fragments and the word vectors of the real word fragments in a classification learning device, then continuously splicing the difference value and the point multiplication between the two word vectors to obtain the relation characteristics between the words, mapping the vector dimension into a two-dimensional space after passing through a full connection layer, and judging the similarity between the two vectors;

(3) Constructing a vocabulary information acquisition network, which comprises a label distribution learning network and a character label learning network;

the label distribution learning network obtains the character vector representation of the n-gram segment in the same way as the vocabulary information generation network, extracts the text characteristic code, uses the full connection layer as a decoder, and obtains the probability distribution P of the labels corresponding to the vocabulary _label As a state matrix of the conditional random field, label inference is carried out through the conditional random field;

the character tag learning network specifically operates as follows:

(3.1) selecting a character vector E output by a label distribution learning network through an Embedding layer _C As part of the embedding layer output;

(3.2) generating a network through vocabulary information according to different positions of the current word in the n-gram and the lexeme mark type q, and acquiring a word vector set before the last Dense layer

A word vector of a label type q;

(3.3) probability distribution P of the label obtained from the label distribution learning network _label, Learning the probability P that each character label belongs to each lexeme label _pos ；

(3.4) the word vector set E obtained according to the step (3.2) _τ And (4) the lexeme tagging probability T obtained in the step (3.3) _pos By tensor product

Obtaining vocabulary information E in an embedding layer _W ；

(3.5) join character vector E _C And vocabulary information E in the embedding layer _W Inputting WP-LSTM model, then using Dense layer and conditional random field as decoder and label to push fault, outputting Z = (Z) ₁ ，z ₂ ，...，z _n ) And finally learning the character relation in the Chinese address named entity recognition for the predicted label to realize the Chinese address named entity recognition.

Further, the ELMO is a network structure composed of two Bidirectional LSTM (Bidirectional LSTM); the ELMO layer final vector is expressed as:

wherein

Is the character vector of the ith position, gamma task is the coefficient related to the pre-training task, L is the number of layers, and/or is greater than or equal to>

Is a weight coefficient of the normalized correlation layer, is combined with the value of the preceding correlation layer>

For an output vector of layer j BilSTM>

Containing the preceding information->

Including postamble information.

Further, during the training process, the losses after and before the ELMO integration are the training target, i.e. the following loss is optimized:

/>

wherein theta is _x A vector representing the input of a character is shown,

represents a forward LSTM parameter, <' > is selected>

Representing inverse LSTM parameter, θ _S Denotes the softmax layer, p denotes the probability, t _k Representing the text at position k.

Further, a text segment may be characterized as a word vector specifically being:

E _vector ＝mean(sum(H _vector ))

wherein H _vector Representing the output vector of the previous layer, wherein the vector is X or Y, namely, encoding X and Y into the word vector feature E with the same dimension through a formula _X And E _Y 。

Further, the splicing operation of the step (2.4) is specifically as follows:

E＝[E _X ，E _Y ,E _X -E _Y ，E _X ⊙E _Y ]

wherein E is the spliced vector.

Further, the conditional random field passes through the label probability distribution Pl _abel And a genuine label Pg _old Learning transition probabilities among the labels, and inferring the probabilities of all labels by the following equation:

wherein

Representing all possible label orders of the text input, O = WoH + bo representing the current character label probability, wo, b _o Respectively representing a parameter matrix and a parameter vector, wherein H is the output of the previous layer; />

Is directed to a tag pair (y) _i-1 ，y _i ) Represents transition probabilities between them, p (y | s) represents the probability of label y under the condition of weight s; the goal of conditional random fields is to obtain the label order y that scores the greatest given the text input conditions ^* 。

Further, in step (3.2), for the word position labels appearing many times, one word vector is selected from all corresponding positions according to the equal probability to be used as the word vector of the word position label.

In a second aspect, the present invention provides a word information fused chinese address named entity recognition apparatus, including a memory and one or more processors, where the memory stores executable codes, and when the processors execute the executable codes, the word information fused chinese address named entity recognition method is implemented.

In a third aspect, the present invention provides a computer-readable storage medium, on which a program is stored, where the program, when executed by a processor, implements the method for recognizing a named entity in a chinese address with fused word information.

The invention has the beneficial effects that: the invention has the following technical characteristics:

1. the vocabulary information has less noise and high searching speed: in the vocabulary information generation stage, inputting each n-gram segment containing specific characters and the vocabulary thereof into a vocabulary information generation network together, so that the n-gram segment and the corresponding real vocabulary have similar word vector representation, and the association can be learned. Through the network, the model can obtain the vocabulary vector corresponding to each character in batch in the training data, and the obtained vocabulary vector is based on the real vocabulary in the text and does not need to depend on an additional external large vocabulary for searching.

2. The coding information is rich: in the vocabulary information acquisition stage, the vocabulary vectors are merged into the vocabulary vectors based on the lexeme distribution probability and the corresponding n-gram segments. In the training process, the model firstly predicts and outputs a classification soft label and a character vector through a label distribution learning network, and trains the network classification accuracy through a real label; then, the lexical information acquisition network obtains the lexical position distribution probability of the characters according to the soft labels, generates a network based on the lexical position distribution probability and lexical information to acquire lexical vectors, and predicts the final classification of the network after combining the character vectors. The network not only can combine the vocabulary vectors of the characters, but also can learn the weights of different vocabulary vectors through the distribution of the word position labels of the characters, finally obtains the vocabulary information, and provides richer word segmentation and labeling knowledge for the recognition task of the Chinese named entity.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic diagram of a method for identifying a named entity of a Chinese address with fused word information according to the present invention;

FIG. 2 is a schematic diagram of a lexical information generation network;

FIG. 3 is a schematic diagram of a vocabulary information acquisition network;

FIG. 4 is a schematic view of the WP-LSTM model structure;

fig. 5 is a structural diagram of a recognition apparatus for a named entity of chinese address with fused word information according to the present invention.

Detailed Description

The following description will explain embodiments of the present invention in further detail with reference to the accompanying drawings.

The invention provides a method for fusing character and vocabulary information in a Chinese text to assist in accurately identifying named entities. The invention is mainly suitable for the technical fields of financial wind control, user marketing and the like.

The invention provides a Chinese address named entity recognition method with word information fusion, which is divided into two parts, wherein the first part is a vocabulary information generation network for generating vocabulary information, the second part is a vocabulary information acquisition network for acquiring vocabulary information, and the relationship between the two parts is shown in figure 1:

in the vocabulary information generation network, the training target is to learn the similarity between n-gram fragments and real vocabulary vectors, and a structure of a double-tower model is adopted. The specific structure is shown in fig. 2:

in this network the input consists of two parts, the n-gram fragment vector is denoted x = (x) ₁ ，x ₂ ，...，x _n ) Where n is the number of characters in the segment, the real vocabulary vector is denoted as Y = (Y) ₁ ，y ₂ ，...，y _m ) Where m is the number of characters in the real vocabulary. The learning steps are as follows:

1) And acquiring the random character vector code through an Embedding layer.

2) Character vectors are learned through an ELMO layer and a Dense layer to indicate that the ELMO is a network structure consisting of two Bidirectional LSTMs (Bidirectional LSTMs). The ELMO layer final vector is expressed as:

wherein

Is the character vector of the ith position, gamma task is the coefficient related to the pre-training task, and L is the layerNumber or more>

Is a weight coefficient of the normalized correlation layer, is based on the normalized correlation value>

For an output vector of layer j BilSTM>

Including preceding information, in>

Including the postamble information.

During training, the losses before and after ELMO synthesis are used as training targets, i.e. the following l is optimized _oss ：

Wherein theta is _x A vector representing the input of a character,

represents a forward LSTM parameter, <' > is selected>

3) After an average pooling layer (meanpooling), a text fragment can be characterized as a word vector, i.e.:

E _vector ＝mean(sum(H _vector ))

wherein H _vector Representing the output vector of the previous layer, wherein the vector is X or Y, namely, encoding the X and the Y into the word vector characteristic E with the same dimension through a formula _X And E _Y 。

4) In a classification learning device, on the basis of splicing two word vectors of an n-gram segment and a real word segment, splicing a difference value and a point multiplication between the two word vectors to obtain a relational feature between words, namely:

E＝[E _X ，E _Y ，E _X -E _Y ，E _X ⊙E _Y ]

wherein E is the spliced vector.

The final vector constructed in the mode contains direct characteristics and indirect characteristics between words, and the similarity between the two vectors is judged.

Finally, after passing through a full connection layer, vector dimensions are mapped into a two-dimensional space, parameters are updated based on cross entropy loss of two classes, and finally the trained class learner can accurately judge the similarity degree of two vocabulary segments and can also consider that word vector results taken out from an encoder are also similar. And generating a network based on the vocabulary information, wherein the word vector of the n-gram can have similar performance with the real vocabulary vector in subsequent tasks without searching the real vocabulary in the vocabulary.

In the vocabulary information acquisition network, the merging mode of the main learning generation vocabulary information comprises (a) a part-label distribution learning network and (b) a part-character label learning network, and the structure is shown in fig. 3:

in a part of a vocabulary information acquisition network (a), namely a label distribution learning network, the learning steps are as follows:

1) After acquiring character vector code, passing through ELMO layer and D _e n _se And the layer uses the BilSTM as a text characteristic encoder to extract text characteristic codes.

2) Obtaining probability distribution P of label by using full connection layer as decoder _label The state matrix is used as a conditional random field, label inference is carried out through the conditional random field, and label transition probability in the transition matrix is learned. Conditional random field passing label probability distribution P _label And a genuine label P _gold Learning transition probabilities among the labels, and inferring the probabilities of all labels by the following equation:

wherein

Representing all possible label orders of text input C, O = WoH + bo representing the current character label probability, wo, b _o Respectively representing a parameter matrix and a parameter vector, wherein H is the output of the previous layer; />

Is directed to a tag pair (y) _i-1 ，y _i ) Represents the transition probability between them, and p (y | s) represents the probability of label y under the condition of weight s. The goal of conditional random fields is to obtain the label order y that scores the greatest given the text input C ^* 。

In the part of the vocabulary information acquisition network (b) -the character label learning network, the learning steps are as follows:

1) Selecting character vector E output by label distribution learning network through Embedding layer _C As part of the embedded layer output.

2) According to different positions of the current word in the n-gram and the lexeme mark type q, a network is generated through lexical information, and a word vector set before the last Dense layer is obtained

A word vector of a label type q; and q is the category of the word position label, and for the word position labels which appear for many times, a word vector is selected from all corresponding positions according to the equal probability to be used as the word vector of the word position label.

3) Probability distribution P of labels obtained according to label distribution learning network _label, Learning the probability P that each character label belongs to each lexeme label _pos 。

4) According to the word vector set E obtained in the step 2) _τ And the lexeme labeling probability P obtained in the step 3) _pos By tensor product

Obtaining vocabulary information E in an embedding layer _W 。

5) Combined character vector E _C And vocabulary information E in the embedding layer _w I.e., E = [, ] _E C，E _w ]As input to WP-LSTM. The WP-LSTM model structure is shown in FIG. 4, where < pad in FIG. 4>Filling symbols for characters, and filling a fixed vector initialized at random, wherein when the acquired position of the n-gram segment exceeds the length of the text, the distance is less than pad>Characters pointing to c in the fill map, such as "Zhe", "Jiang" and "province", are encoded as E _c Segments pointing to the w part are coded as E _w And finally fuse together.

6) Using the Dense layer and conditional random field as decoder and label-push layer, output Z = (Z =) ₁ ，z ₂ ，...，z _n ) And finally learning the character relation in the Chinese address named entity recognition for the predicted label to realize the Chinese address named entity recognition.

Corresponding to the embodiment of the method for identifying the named entity of the Chinese address fused with the word information, the invention also provides an embodiment of the device for identifying the named entity of the Chinese address fused with the word information.

Referring to fig. 5, the apparatus for identifying a named entity with a chinese address fused with word information according to an embodiment of the present invention includes a memory and one or more processors, where the memory stores executable codes, and the processors execute the executable codes to implement the method for identifying a named entity with a chinese address fused with word information according to the above embodiment.

The embodiment of the Chinese address named entity recognition device with word information fusion can be applied to any equipment with data processing capability, and the equipment with data processing capability can be equipment or devices such as computers. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for running through the processor of any device with data processing capability. In terms of hardware, as shown in fig. 5, a hardware structure diagram of any device with data processing capability where the apparatus for identifying a named entity of a chinese address with fused word and word information according to the present invention is located is shown, where in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 5, any device with data processing capability where the apparatus is located in the embodiment may generally include other hardware according to the actual function of the any device with data processing capability, and details thereof are not repeated.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the present invention. One of ordinary skill in the art can understand and implement it without inventive effort.

The embodiment of the invention also provides a computer-readable storage medium, on which a program is stored, and when the program is executed by a processor, the method for identifying a named entity of a chinese address based on word information fusion in the above embodiment is implemented.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be any external storage device of a device with data processing capabilities, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.

The above-described embodiments are intended to illustrate rather than to limit the invention, and any modifications and variations of the present invention are within the spirit of the invention and the scope of the appended claims.

Claims

1. A method for identifying a named entity of a Chinese address with fused word information is characterized by comprising the following steps:

(1) The n-gram fragment vector for obtaining the Chinese address is expressed as X = (X) ₁ ,x ₂ ,…,x _n ) And obtaining a corresponding real vocabulary fragment Y = (Y) ₁ ,y ₂ ,…,y _m ) (ii) a Wherein n is the number of characters in the n-gram fragment, and m is the number of characters in the real vocabulary fragment;

(2) Constructing a vocabulary information generation network, adopting a structure of a double-tower model, and specifically operating the network as follows:

(2.3) the character vector represents that after passing through an average pooling layer (mean pooling), the text segment is characterized as a word vector;

the label distribution learning network obtains the character vector representation of the n-gram segment in the same way as the vocabulary information generation network, extracts the text characteristic code, and obtains the probability distribution P of the label corresponding to the vocabulary by using the full connection layer as a decoder _label Performing label inference through the conditional random field as a state matrix of the conditional random field;

the character tag learning network specifically operates as follows:

(3.1) selecting a character vector E output by a label distribution learning network through an Embedding layer _C As part of the embedded layer output;

(3.2) according to different positions of the current word in the n-gram and the lexeme mark type q, generating a network through lexical information to obtain a word vector set before the last Dense layer

A word vector of a label type q;

(3.3) probability distribution P of the label obtained from the label distribution learning network _label Learning the probability P that each character label belongs to each lexeme label _pos ；

(3.4) the word vector set E obtained according to the step (3.2) _τ And the lexeme tagging probability P obtained in the step (3.3) _pos By tensor product

Obtaining vocabulary information E in an embedding layer _W ；

(3.5) combining character vector E _C And vocabulary information E in the embedding layer _W Inputting WP-LSTM model, then using Dense layer and conditional random field as decoder and label to push fault, outputting Z = (Z) ₁ ,z ₂ ,…,z _n ) And finally learning the character relation in the Chinese address named entity recognition for the predicted label to realize the Chinese address named entity recognition.

2. The method of claim 1, wherein the ELMO is a network structure consisting of two Bidirectional LSTM (Bidirectional LSTM); the ELMO layer final vector is expressed as:

wherein

Is the character vector of the ith position, gamma ^task For coefficients associated with a pre-training task, L is the number of layers, R is greater than or equal to>

For an output vector of layer j BilSTM>

Including preceding information, in>

Including the postamble information.

3. The method of claim 2, wherein losses of the ELMO before and after synthesis are a training objective during training, that is, the following loss is optimized:

wherein theta is _x A vector representing the input of a character is shown,

represents a forward LSTM parameter, <' > is selected>

4. The method for identifying a named entity of a chinese address with fused word information according to claim 1, wherein the text segment can be characterized as a word vector specifically comprising:

E _vector ＝mean(sum(H _vector ))

wherein H _vector Representing the output vector of the previous layer, vector is X or Y, namely, X and Y are coded into word vector characteristics E with the same dimension through a formula _X And E _Y 。

5. The method for identifying a named entity of a chinese address with fused word information according to claim 1, wherein the concatenation operation in step (2.4) is specifically as follows:

E＝[E _X ,E _Y ,E _X -E _Y ,E _X ⊙E _Y ]

wherein E is the spliced vector.

6. The method as claimed in claim 1, wherein the conditional random field passes through a label probability distribution P _label And a genuine label P _gold Learning transition probabilities among the labels, and inferring the probabilities of all labels by the following equation:

wherein

Representing all possible label orders of text input, O = W _o H+b _o Representing the current character label probability, W _o 、b _o Respectively representing a parameter matrix and a parameter vector, wherein H is the output of the previous layer; />

Is directed to a tag pair (y) _i-1 ,y _i ) Represents transition probability between them, p (y | s) represents probability of label y under weight s; the goal of conditional random fields is to obtain the label order y that scores the greatest given the text input conditions ^* 。

7. The method as claimed in claim 1, wherein in step (3.2), for the token labels appearing multiple times, a word vector is selected from all corresponding positions according to the same probability as the token labeled word vector.

8. A word information fused chinese address named entity recognition apparatus, comprising a memory and one or more processors, wherein the memory stores executable code, and the processors execute the executable code to implement a word information fused chinese address named entity recognition method according to any one of claims 1 to 7.

9. A computer-readable storage medium, on which a program is stored, which, when being executed by a processor, carries out a method for recognition of a chinese address-named entity with fusion of word information according to any one of claims 1 to 7.