CN114298047A

CN114298047A - Chinese named entity recognition method and system based on stroke volume and word vector

Info

Publication number: CN114298047A
Application number: CN202111641955.1A
Authority: CN
Inventors: 何东之; 张震; 王鹏飞; 孙亚茹; 郭隆杭
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2022-04-08

Abstract

The invention provides a Chinese named entity recognition method and a system based on stroke volume and word vector, which relate to the technical field of named entity recognition and comprise the following steps: acquiring a stroke sequence corresponding to each Chinese character in the text and a character feature vector of each Chinese character; inputting the stroke sequence into a stroke convolution neural network to obtain a stroke feature vector; setting a sliding window according to the maximum length of an entity in the text, and acquiring a word vector of each word in the sliding window through a self-attention mechanism; splicing the stroke characteristic vector, the word vector and the character characteristic vector of each Chinese character in the text, inputting the stroke characteristic vector, the word vector and the character characteristic vector into a BilSTM network, and acquiring the score of each Chinese character corresponding to each entity label; and determining an optimal entity label for each Chinese character in the text by adopting a CRF model. The method considers the influence of the stroke sequence of the Chinese character on the Chinese character, combines the stroke characteristic vector, the word characteristic vector and the character characteristic vector of the Chinese character, and then carries out named entity recognition, thereby improving the effect of named entity recognition.

Description

Chinese named entity recognition method and system based on stroke volume and word vector

Technical Field

The invention relates to the technical field of named entity recognition, in particular to a Chinese named entity recognition method and system based on stroke volume and word vectors.

Background

With the rapid development of internet technology, unstructured data is growing continuously, and the world is in a massive unstructured data era. How to efficiently manage data and extract effective information from unstructured data becomes a problem which needs to be solved urgently.

The purpose of Named Entity Recognition (NER) is to identify defined Named entities from unstructured text, such as person names, place names, organization names, etc., which are the basic core tasks for information retrieval and information extraction. The Chinese NER is a division of the NER in the Chinese field, and still has a plurality of problems due to the characteristics of Chinese characters. The main difficulties of Chinese NER are the following: 1) chinese characters usually have a word ambiguity, and in different text contexts, the meanings may be greatly different; 2) the Chinese text does not have obvious entity boundary identifiers such as spaces and the like in similar English texts; 3) the research of Chinese NER starts late, related labeled data sets are few, and the problems of single field exist.

The existing Chinese named entity recognition usually has two methods, namely a word-based sequence labeling method and a character-based sequence labeling method. A word-based labeling method firstly utilizes a word segmentation tool to segment a text, and then entity recognition is carried out, the word boundary of the method is also an entity boundary, and if errors occur in the word segmentation stage, the subsequent NER model cannot correctly recognize the entity. The word-based sequence labeling method generally has the condition of insufficient semantics, so people mainly consider how to better utilize word information, some appliers introduce external vocabulary information on the basis of the word-based sequence labeling method and integrate the external vocabulary information into word vector representation on an input layer, so that the model is changed, meanwhile, the introduction of the external word vector also causes the model training efficiency to be lower, and finally, the accuracy of named entity recognition is reduced; some applications establish an ElMo model based on stroke sequences only on the basis of a word-based sequence labeling method, and have defects in the aspects of effectiveness and accuracy of named entity identification.

Disclosure of Invention

In order to solve the problems, the invention provides a Chinese named entity recognition method and a Chinese named entity recognition system based on stroke volume and word vectors.

In order to achieve the above object, the present invention provides a method for identifying a named entity in chinese based on stroke volume and word vector, comprising:

acquiring a stroke sequence corresponding to each Chinese character in the text and a character feature vector of each Chinese character;

inputting the stroke sequence into a stroke convolution neural network to obtain a stroke feature vector;

setting a sliding window according to the maximum length of the entity in the text, and acquiring a word vector of each word in the sliding window through a self-attention mechanism;

splicing the stroke feature vector, the word vector and the character feature vector of each Chinese character in the text, and inputting the stroke feature vector, the word vector and the character feature vector into a BilSTM network to obtain the score of each Chinese character corresponding to each entity label;

and determining an optimal entity label for each Chinese character in the text by adopting a CRF model.

As a further improvement of the invention, a mapping table from Chinese characters to stroke sequences is constructed, and the stroke sequences corresponding to the Chinese characters are obtained through the mapping table.

As a further improvement of the present invention, the stroke convolution neural network convolves the stroke sequence by convolution kernels of different window sizes to obtain the stroke feature vector.

As a further improvement of the invention, the stroke convolution neural network obtains the stroke feature graph through convolution kernel convolution with different window sizes, and performs maximum pooling and full connection on the feature graph to obtain the stroke feature vector, wherein the formula is as follows:

wherein:

w represents weights in convolutional neural network training;

M_t,t+k-1a feature representing an input;

b represents the bias in the convolutional neural network training;

as a further improvement of the invention, a classification loss function L (cls) is added in the stroke convolution neural network training process:

L(cls)＝-logP(z|X)＝-logsoftmax(w*semb)

wherein the content of the first and second substances,

x represents an input stroke sequence;

z represents a Chinese label corresponding to the stroke sequence;

w represents a parameter in the network;

semb represents the stroke feature vector.

As a further improvement of the present invention, the obtaining, by a self-attention mechanism, a word vector of each word within the sliding window; the method comprises the following steps:

calculating the similarity between every two words in the sliding window through the self-attention mechanism;

and acquiring word vector quantity of each word in the sliding window according to the similarity by adopting a softmax function.

As a further improvement of the present invention,

for each Chinese character in the sliding window, generating a corresponding Query vector, a corresponding Key vector and a corresponding Value vector according to the character feature vector;

and calculating the dot product of the Query vector and the Key vector to obtain the score of each word, and multiplying the score by the Value vector of each word to obtain the word vector of the word in the sliding window.

As a further improvement of the present invention, the CRF model is used to determine an optimal entity tag for each chinese character in the text; the method comprises the following steps:

defining the character sequence of the input text as x ═ x (x)₁，x₂，...，x_n) The predicted tag sequence is y ═ y (y)₁，y₂，…，y_n)；

Definition of

Is the ith word output by the BilSTM network model and is marked as a label y_iA predicted score of (d);

defining a label transfer matrix

Wherein

Represents a score converted from label yi to label yi + 1;

by passing

Calculating a final score for each of the predicted tag sequences;

and taking the predicted tag sequence with the highest score as a final tag sequence, and acquiring the Chinese named entity according to the tag.

As a further improvement of the present invention,

calculating the conditional probability of each of said predicted tag sequences

And if the conditional probability of the predicted tag sequence with the highest score is also the highest, taking the predicted tag sequence with the highest score as the final tag sequence.

The invention also provides a Chinese named entity recognition system based on stroke volume and word vector, which comprises a pre-preparation module, a stroke characteristic acquisition module, a word vector acquisition module, a label prediction module and an optimal label acquisition module;

the pre-preparation module is configured to:

the stroke characteristic acquisition module is used for:

the word vector acquisition module is configured to:

the label prediction module is configured to:

the best tag obtaining module is configured to:

Compared with the prior art, the invention has the beneficial effects that:

the invention considers the influence of the stroke sequence of the Chinese character on the basis of the character-based sequence labeling method in the named entity recognition method, combines the stroke characteristic vector, the word characteristic vector and the character characteristic vector of the Chinese character, and then performs the named entity recognition, thereby improving the effect of the named entity recognition.

In the process of obtaining the stroke feature vector, the method extracts the stroke feature vector of the Chinese character by adopting a convolution method, and the convolution method is more suitable for the number range of strokes of the Chinese character; meanwhile, a convolution core with the size of multiple windows is selected in the convolution process to perform convolution on the stroke sequence, and the most effective stroke feature vector is obtained.

In the process of solving the word feature vector of the Chinese character, the word vector information in the sliding window is obtained through a self-attention mechanism, so that the defect of semantics is overcome, and the condition that the prediction accuracy is reduced under the condition of introducing external words in the prior art is avoided.

In the stroke convolution neural network training process, the classification loss function is added, so that the stroke convolution neural network training accuracy is improved.

Drawings

FIG. 1 is a flow chart of a method for identifying a named entity in Chinese based on stroke volume and word vector according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a system for identifying a named entity in Chinese based on stroke volume and word vector according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a stroke convolution neural network according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a model of a self-attention mechanism according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a bidirectional timing model and a CRF model according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention is described in further detail below with reference to the attached drawing figures:

as shown in fig. 1, the method for identifying a named entity in chinese based on stroke volume and word vector provided by the present invention includes:

s1, acquiring a stroke sequence corresponding to each Chinese character in the text and a character feature vector of each Chinese character;

wherein the content of the first and second substances,

and acquiring the stroke sequence of each Chinese character in a training set in the training process through a Chinese dictionary website, constructing a mapping table from the Chinese character to the stroke sequence, and acquiring the stroke sequence corresponding to each Chinese character in the text through the mapping table.

For example: as shown in FIG. 3, the stroke sequence obtained from the mapping table is "left-falling stroke")

Fold-back is one.

S2, inputting the stroke sequence into a stroke convolution neural network to obtain a stroke feature vector;

wherein the content of the first and second substances,

as shown in fig. 3, the stroke convolution neural network convolves the stroke sequence through convolution kernels of different window sizes, obtains a stroke feature map after convolution of the stroke convolution neural network, performs maximum pooling and full connection on the feature map to obtain a stroke feature vector, and has the formula:

wherein:

w represents weights in convolutional neural network training;

M_t,t+k-1a feature representing an input;

b represents the bias in the convolutional neural network training;

in the invention, a classification loss function is added in the stroke convolution neural network training process to improve the training accuracy, and the classification loss function is expressed as follows:

L(cls)＝-logP(z|X)＝-logsoftmax(w*semb)

wherein the content of the first and second substances,

x represents an input stroke sequence;

z represents a Chinese label corresponding to the stroke sequence;

w represents a parameter in the network;

semb represents the stroke feature vector.

S3, setting a sliding window according to the maximum length of an entity in the text, and acquiring a word vector of each word in the sliding window through a self-attention mechanism;

wherein the content of the first and second substances,

the word-based sequence labeling method generally has the problem of insufficient semantics, and in order to better utilize word vector information, the SA mechanism (self-attention mechanism) is used to acquire the word vector information in a sliding window to solve the problem.

Acquiring the maximum length of an entity in a training set in the training process, taking the maximum length as a sliding window, and calculating the similarity between every two characters in the sliding window through a self-attention mechanism; and then, a softmax function is adopted to obtain a word vector of each word in the sliding window according to the similarity.

Specifically, for each Chinese character in the sliding window, generating a corresponding Query vector, a corresponding Key vector and a corresponding Value vector according to the character feature vector;

For example:

as shown in fig. 4, if the text content is "beijing city", e¹、e²、e³Respectively corresponding to the character feature vectors of each word, and generating a Query vector, a Key vector and a Value vector for each word, wherein the vectors are the character feature vectors e corresponding to each word¹、e²、e³Multiplying by three weight matrixes created in the training process; calculating a score corresponding to each word through a dot product between the Query vector and the Key vector, and then multiplying the score and the corresponding Value vector to obtain a word vector corresponding to each word in the sliding window, wherein the formula is as follows:

s4, splicing the stroke feature vectors, word vectors and character feature vectors of all Chinese characters in the text, and inputting the stroke feature vectors, word vectors and character feature vectors into a BilSTM network to obtain the score of each Chinese character corresponding to each entity label;

wherein the content of the first and second substances,

the splicing is a direct splicing of vector dimensions, and if the stroke feature vector of a certain Chinese character can be represented as 1 × 20, the word vector can be represented as 1 × 30, and the character feature vector can be represented as 1 × 60, the spliced feature vector 1 × 110 can be obtained after the splicing.

The BilSTM (Bi-directional Long Short-Term Memory) is a bidirectional Long-time and Short-time Memory network; the LSTM (Long Short-Term Memory) is a Long-Short time Memory network, is an improved time sequence network, solves the problem of gradient information, realizes effective utilization of Long-distance information, can only acquire unidirectional time sequence information, but has important influence on NER (named entity identification) tasks by context information, and therefore, the application adopts the BilSTM network to acquire the context information;

as shown in fig. 5, taking "beijing smith" as an example, the score of each word corresponding to multiple labels is obtained through forward LSTM calculation and reverse LSTM calculation, where the labels are preset, and the method may include: address, time, person name, book name, etc.

And S5, determining an optimal entity label for each Chinese character in the text by adopting a CRF model.

Wherein the content of the first and second substances,

due to the strong constraint relationship between adjacent tags in the NER task, for example, after the B-LOC tag (the start tag of the address), the tag can only be an I-LOC tag or an O tag, but cannot be other tags such as a B-PER tag (the start tag of the name of a person). Therefore, after sequence modeling by the BiLSTM network, Conditional Random Field (CRF) is used herein to predict the tags of the entire sequence, specifically:

defining the character sequence of the input text as x ═ x (x)₁，x₂，...，x_n) The predicted tag sequence is y ═ y (y)₁，y₂，...，y_n) (ii) a Y (x) represents the set of all possible tag sequences for the text;

definition of

Is the ith character mark output by the BilSTM network modelNote as label y_iA predicted score of (d);

defining a label transfer matrix

Wherein

Represents a score converted from label yi to label yi + 1;

by passing

Calculating a final score for each predicted tag sequence;

Further, in the above-mentioned case,

a loss function may be set, such as:

calculating the conditional probability of each predicted tag sequence

And if the conditional probability of the predicted tag sequence with the highest score is also the maximum, taking the predicted tag sequence with the highest score as the final tag sequence.

Finally, the optimal label sequence is found through a Viterbi algorithm, and the formula is as follows:

as shown in fig. 2, the present invention further provides a chinese named entity recognition system based on stroke convolution kernel word vectors, which includes a pre-preparation module, a stroke feature acquisition module, a word vector acquisition module, a label prediction module, and an optimal label acquisition module;

a pre-preparation module to:

a stroke characteristic acquisition module for:

a word vector acquisition module to:

setting a sliding window according to the maximum length of an entity in the text, and acquiring a word vector of each word in the sliding window through a self-attention mechanism;

a label prediction module to:

splicing the stroke characteristic vector, the word vector and the character characteristic vector of each Chinese character in the text, inputting the stroke characteristic vector, the word vector and the character characteristic vector into a BilSTM network, and acquiring the score of each Chinese character corresponding to each entity label;

a best label acquisition module to:

The invention has the advantages that:

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The Chinese named entity recognition method based on stroke volume and word vector is characterized by comprising the following steps:

2. The method of claim 1, wherein the method comprises: and constructing a mapping table from the Chinese characters to the stroke sequences, and acquiring the stroke sequences corresponding to the Chinese characters through the mapping table.

3. The method of claim 1, wherein the method comprises: and the stroke convolution neural network performs convolution on the stroke sequence through convolution cores with different window sizes to obtain the stroke feature vector.

4. The method of claim 3, wherein the method comprises: the stroke convolution neural network obtains a stroke feature graph through convolution kernel convolution of different window sizes, performs maximum pooling and full connection on the feature graph to obtain a stroke feature vector, and the formula is as follows:

wherein:

w represents weights in convolutional neural network training;

M_t，t+k-1a feature representing an input;

b represents the bias in the convolutional neural network training.

5. The method of claim 1, wherein the method comprises: adding a classification loss function L (cls) in the stroke convolution neural network training process:

L(cls)＝-log P(z|X)＝-log softmax(w*semb)

wherein the content of the first and second substances,

x represents an input stroke sequence;

z represents a Chinese label corresponding to the stroke sequence;

w represents a parameter in the network;

semb represents the stroke feature vector.

6. The method of claim 1, wherein the method comprises: acquiring a word vector of each word in the sliding window through a self-attention mechanism; the method comprises the following steps:

and acquiring a word vector of each word in the sliding window according to the similarity by adopting a soffmax function.

7. The method of claim 6, wherein the method comprises:

8. The method for identifying named entities as claimed in claim 1, wherein the CRF model is used to determine an optimal entity label for each Chinese character in the text; the method comprises the following steps:

defining the character sequence of the input text as x ═ x (x)₁，x₂，...，x_n) The predicted tag sequence is y ═ y (y)₁，y₂，...，y_n)；

Definition of

Is the ith word output by the BilSTM network model and is marked as a label y_iA predicted score of (a);

defining a label transfer matrix

Wherein

Represents a score converted from label yi to label yi + 1;

by passing

Calculating a final score for each of the predicted tag sequences;

9. The method of claim 8, wherein the method comprises:

calculating the conditional probability of each of said predicted tag sequences

10. A system for implementing the method for identifying a named entity in chinese according to any one of claims 1 to 9, comprising a pre-preparation module, a stroke feature acquisition module, a word vector acquisition module, a label prediction module, and an optimal label acquisition module;

the pre-preparation module is configured to:

the stroke characteristic acquisition module is used for:

the word vector acquisition module is configured to:

the label prediction module is configured to:

the best tag obtaining module is configured to: