CN114781380A

CN114781380A - Chinese named entity recognition method, equipment and medium fusing multi-granularity information

Info

Publication number: CN114781380A
Application number: CN202210277553.6A
Authority: CN
Inventors: 李丽洁; 胡双阳; 韩启龙; 宋洪涛; 王也; 马志强; 张海涛
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2022-03-21
Filing date: 2022-03-21
Publication date: 2022-07-22

Abstract

The invention provides a Chinese named entity recognition method, equipment and medium fusing multi-granularity information. The method comprises the following steps: (1) acquiring a field corpus data set, preprocessing the data set, and dividing the data set into a training set, a test set and a verification set; (2) extracting and fusing characters, soft words and radical pre-training vectors in the preprocessed corpus data in the step (1); (3) constructing a Chinese named entity recognition model fusing multi-granularity information; (4) inputting the data obtained in the step (2) into a model for training; (5) and (5) processing and calculating the data to be recognized by using the recognition model obtained in the step (4) to obtain a named entity recognition result. Aiming at the defects existing in the Chinese named entity recognition, the method obtains word-level semantic information by fusing the internal inherent semantic information of characters in a sequence and utilizing an expanded soft word module through fusing radical-level information, and fuses the word-level semantic information and the word-level semantic information into a character embedding vector, thereby improving the precision of the Chinese named entity recognition.

Description

Chinese named entity recognition method, equipment and medium fusing multi-granularity information

Technical Field

The invention belongs to the technical field of named entity recognition, and particularly relates to a method, equipment and medium for recognizing a Chinese named entity by fusing multi-granularity information.

Background

With the continuous development of economic level and computer technology, a large amount of texts emerge from the network at every moment. These texts contain information in various aspects such as society, economy, life, science and technology, but because of the large amount and variety of the texts, the information contained in the texts is often difficult to be effectively utilized.

Named entity recognition solves the problem to a certain extent, and the core tasks are as follows: for a given text, named entities in sentences with names of people, places, organization names, etc. that contain key information are identified and refined. Named entity recognition is a fundamental task in the field of natural language processing, many downstream tasks such as: a question-answering system, knowledge graph construction, information extraction and the like are all independent of named entity identification. Because Chinese does not have obvious word boundaries such as spaces and the like, characters are closely arranged, and the difficulty of Chinese named entity recognition is generally higher than that of English named entity recognition. Named entity recognition is generally regarded as a sequence labeling problem, and most of traditional named entity recognition models are linear statistical models, such as hidden markov models, maximum entropy hidden markov models, conditional random fields and support vector machines. In recent years, the deep learning method has achieved a good effect in the task of named entity recognition by virtue of its powerful ability, and is gradually becoming the mainstream method of named entity recognition. Existing chinese named entity recognition models are classified into character-based models and word-based models. In the word-based model, a Chinese word segmentation system is needed first, and the input sequence is used as the input of the model after being subjected to word segmentation processing. However, due to the complexity of chinese, the segmentation system cannot avoid segmentation errors, and such segmentation errors can be continuously propagated to the end of the sequence, resulting in poor model effect. Character-based models, while avoiding this problem, make difficult to utilize word information in the sequence. A named entity is typically composed of one or more consecutive words, with word boundaries often coinciding with entity boundaries. The word information in the sequence is effectively utilized, and the performance of named entity recognition can be greatly improved. To incorporate word information into character-based models, many scholars attempt to match potential words in the sequence using an external dictionary. Typically, Zhang and Yang et al propose lattice models that incorporate the weights of potential word information in sequences that matches a dictionary into a character-based model using a gating mechanism. After that, Ma et al propose a simplified lattice model for the problem that the lattice model directed acyclic graph structure cannot be trained in batch, which is inefficient, and the lattice model is degraded. The simplified lattice model uses a soft word strategy and a weight fusion mechanism to replace a gating mechanism of the lattice model, so that the problem of model degradation is effectively avoided, the sentence length is fixed, and the efficiency is also greatly improved. Although the simplified lattice model directly and effectively utilizes word information, the following problems still exist: on the one hand, the soft-word method loses some of the information in the middle group for longer words, such as: for the input sequence: in the Chinese football team, the intermediate word dictionary candidate words corresponding to the ball word are the Chinese football team, the Chinese football team and the football team (the ball word is in the middle position in the three words). The soft word method does not distinguish the three words into the middle group and does not distinguish the specific positions of the three words, and the relative position information is very important for identifying the named entity. This problem becomes more severe as the physical length increases. On the other hand, the radical-level semantic information inside the characters in the sequence is not explored and utilized, the pre-training language model can effectively capture the context semantic information in the sequence, but cannot obtain the inherent semantic information contained in the character pictograph, and the semantic information is specifically represented as follows: radicals, character construction and writing sequence. In summary, the main problems of the current research work are that the model is susceptible to word segmentation errors or insufficient utilization of word information, and the radical-level semantic information in the characters is not considered, the lattice series model fails to fully utilize the semantic information of three different granularities, i.e., the word, the character and the radical, in the sequence, and the recognition accuracy still needs to be improved.

Disclosure of Invention

The invention aims to solve the problems that the traditional Chinese named entity identification method is difficult to fully utilize information in a sequence and has poor identification effect, and provides a Chinese named entity identification method, equipment and medium fusing multi-granularity information.

The invention is realized by the following technical scheme, and provides a Chinese named entity identification method fusing multi-granularity information, which specifically comprises the following steps:

step 1: acquiring a field corpus data set, preprocessing the data set, and dividing the data set into a training set, a test set and a verification set;

step 2: extracting characters, soft words and radical pre-training vectors in the corpus data preprocessed in the step 1 for vector fusion, and constructing a Chinese named entity recognition model fusing multi-granularity information;

and step 3: inputting the data obtained in the step (2) into a model for training;

and 4, step 4: and (4) processing and calculating the data to be recognized by using the Chinese named entity recognition model fused with the multi-granularity information obtained in the step (3) to obtain a named entity recognition result.

Further, the step 1 specifically includes the following steps:

step 1.1: identifying named entities in sentence-level corpus data and marking the named entities as predefined types, wherein the types comprise names of people, places and organization names;

step 1.2: dividing the marked result into character level linguistic data in a BMESO marking mode, wherein the form is as follows: character entity location-predefined type to which it belongs;

step 1.3: and dividing the preprocessed data set into a training set, a test set and a verification set according to a certain proportion.

Further, the step 2 specifically includes the following steps:

step 2.1: for the characters in the sequence, performing character mapping on the character sequence one by using a pre-training language model, and encoding each character in the input sequence into a low-dimensional dense embedded vector;

step 2.2: for candidate words corresponding to characters in the sequence: establishing a vocabulary lookup tree based on an external dictionary, matching candidate words corresponding to characters in a sentence, establishing an expanded soft word set, and weighting the expanded soft word set corresponding to the characters by using a weight fusion strategy to obtain word-level vectors corresponding to the characters;

step 2.3: for the radical-level features corresponding to characters in the sequence: constructing a radical level feature lookup table for common Chinese characters, expressing features as pre-trained embedded vectors, and extracting radical level feature embedded vectors by using a convolutional neural network;

step 2.4: sequentially splicing characters, soft words and radical level feature vectors;

step 2.5: performing a padding/truncation operation on each sentence in the data set to a fixed length; for sentences the length of which exceeds the specified length, discarding the part exceeding the specified length; for sentences with the length less than the specified length, filling operation is carried out, and the sentences are filled to the specified length;

step 2.6: the method comprises the steps that sentences with fixed length are input into a model with the Size of Batch _ Size as a group, and each subsequence in Batch is a sentence;

step 2.7: and carrying out forward LSTM coding and backward LSTM coding on the feature vectors in the Batch on a hidden layer, and carrying out vector splicing on the forward and backward hidden vectors to obtain the bidirectional feature vectors of the data.

Further, the step 2.2 specifically includes the following steps:

step 2.2.1: traversing an external dictionary and constructing a vocabulary prefix search tree;

step 2.2.2: matching candidate words in the sentence by using a vocabulary searching tree, and constructing a soft word set for the characters according to the positions of the characters in the candidate words;

step 2.2.3: counting the total times of the candidate words appearing in the corpus data and the times of the candidate words appearing in each position of the soft word set to obtain the weight of the candidate words in each position of the soft word set;

step 2.2.4: and weighting the candidate words at all positions corresponding to the characters, and splicing soft word level vectors.

Further, the step 2.3 specifically includes the following steps:

step 2.3.1: a radical level characteristic lookup table is constructed for common Chinese characters, and the radical level characteristics of the lookup table comprise: the simplified/traditional radicals of the characters, the structural composition of the characters and the writing sequence of the characters are as follows: character-radical-construct composition-writing order sequence;

step 2.3.2: searching a pre-trained embedded vector lookup table, and representing each part first-level feature corresponding to the character as an embedded vector with a dimension d, wherein the part first-level feature corresponding to the character is represented as an embedded vector matrix at the moment;

step 2.3.3: dimension of the fixed embedding matrix is

For the matrix with the length exceeding k, performing truncation operation to take the first k characteristics; for the matrix with the length less than k, randomly initializing the filling length to k;

step 2.3.4: and carrying out x times of continuous one-dimensional convolution on the radical-level feature embedding matrix with fixed dimensionality and carrying out maximum pooling operation to obtain a d-dimensional embedding vector to represent the radical-level features corresponding to the characters.

Further, the step 3 specifically includes the following steps:

step 3.1: performing iterative update calculation on the bidirectional feature vector in the hidden layer;

step 3.2: inputting the result into a CRF layer, iteratively updating the emission probability and the transition probability, and calculating a maximum score sequence;

step 3.3: and updating and storing the parameters of the trained model.

Further, the step 4 specifically includes the following steps:

step 4.1: taking characters as a unit of a Chinese text sequence to be recognized as input of a model;

step 4.2: and calculating and outputting an entity recognition result.

Further, the corpus data in the total number of occurrences of the statistical candidate word in the corpus data refers to a training set + a test set.

The invention provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the Chinese named entity identification method integrating multi-granularity information when executing the computer program.

The invention provides a computer readable storage medium for storing computer instructions, wherein the computer instructions, when executed by a processor, implement the steps of the Chinese named entity recognition method fusing multi-granularity information.

Compared with the prior art, the method has the beneficial effects that potential information in the sequence is fully utilized on the basis of the BilSTM model: the method has the advantages that the radical semantic information in the sequence is mined, the original soft word method is expanded, the challenges brought by the increase of the entity length are better met, and the accuracy of named entity identification is improved.

Drawings

FIG. 1 is a flow chart of a Chinese named entity recognition method incorporating multi-granularity information;

FIG. 2 is a model framework diagram of a Chinese named entity recognition method incorporating multi-granularity information;

FIG. 3 is a schematic diagram of a method for expanding soft words;

FIG. 4 is a detailed view of the radical level information of the character "hot";

fig. 5 is a block diagram of radical level feature extraction.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

As shown in fig. 1 to 5, the present invention provides a method for identifying a chinese named entity by fusing multi-granularity information, which specifically includes the following steps:

step 1: acquiring a field corpus data set, preprocessing the data set and dividing the data set into a training set, a testing set and a verification set;

the step 1 specifically comprises the following steps:

step 1.1: named entities in the sentence-level corpus data are identified and labeled as predefined entity types, such as: person name, place name, organization name, etc.;

step 1.3: the preprocessed data set is divided into a training set, a testing set and a verification set according to the ratio of 6:2: 2.

The character sequence s in the preprocessed data set is as follows:

s＝[c₁,c₂,c₃,…,c_n]

wherein, c_iRepresents the ith character in the character sequence, and i ∈ [1, n ]]；c_i,jRepresenting words consisting of the ith to jth character in a sequence of characters, i, j ∈ [1, n ]]And i is<J (some words may consist of single characters);

step 2: extracting characters, soft words and radical pre-training vectors in the corpus data preprocessed in the step 1 for vector fusion and constructing a Chinese named entity recognition model fusing multi-granularity information;

step 2.1: for the characters in the sequence, the character sequences are subjected to character mapping one by using a pre-training language model BERT-wwm, and each character c in the input sequence is subjected to character mapping_iThe code is a 768-dimensional embedded vector, as follows:

wherein e is^cA vector lookup table is embedded for the character.

As shown in FIG. 2, for the input sequence, each character c is first obtained using BERT-wwm_iThe TokenEmbedding, segmentEmbedding, PositionEmbedding are combined to be used as a pre-training vector of the character.

Step 2.2: for candidate words corresponding to characters in the sequence: establishing a vocabulary lookup tree based on an external dictionary, matching candidate words corresponding to characters in a sentence, establishing an expanded soft word set, and then weighting the expanded soft word set corresponding to the characters by using a weight fusion strategy to obtain word-level vectors corresponding to the characters, wherein the method specifically comprises the following steps:

step 2.2.1: traversing an external dictionary, and constructing a vocabulary prefix search tree;

step 2.2.2: using a vocabulary search tree to match candidate words in a sentence, and constructing an extended soft word set for each character in the sequence according to the positions (including a start position, an intermediate group (a first intermediate position, a second intermediate position, other intermediate positions), an end position and a single word forming position) of the characters in the candidate words;

as shown in fig. 3, for the input sequence "chinese football team", all potential words in the sequence are first found: "Chinese football", "Chinese football team", "football", "team". For the character "ball", the words containing it include "team", "football team", "ChinaFootball team, football, ball, and the like (considering the word formation of single character), the word of ball is in the initial position in the word of team, so that the word of team is added to the word set of Begin position in the extended soft word set of ball, and the word of ball is in the word of team₁Extended soft word set M of location, hence "ball₁The position word set is added with a word ' football team ' and the ' ball ' word is positioned in M in the word ' football team₂Extended soft word set M of location, hence "ball₂The position word set is added with a word ' Chinese football team ', and the word ' ball ' is positioned in M in the word ' Chinese football team_oExtended soft word set M of location, hence "ball_oThe position word set is added with a word "Chinese football team", and the above operation is executed to obtain an extended soft word set as shown in fig. 3.

Step 2.2.3: counting the times z (w) of the candidate words in the corpus data (training set + test set) at a certain position in the soft word set, and counting the times of the soft word set in the data, weighting to obtain word embedding expression v weighted at each position^w(W)：

Wherein W represents "BM₁M₂M_oES "soft word set, W represents a certain candidate word in W.

Step 2.2.4: carrying out vector splicing on soft word vectors at different positions of the characters to obtain soft word level vectors z corresponding to the characters^w。

In particular, z^wCalculated by the following formula:

wherein

A stitching operation, v, representing a vector^w() A word embedding vector representing a soft word position.

Step 2.3: for the radical-level features corresponding to characters in the sequence: the method comprises the following steps of constructing a radical level feature lookup table for common Chinese characters, expressing features as pre-trained embedded vectors, and extracting the radical level feature embedded vectors by using a convolutional neural network, wherein the method specifically comprises the following steps:

step 2.3.1: building a radical level characteristic lookup table e for common Chinese characters^rIts radical level features include: simplified/traditional radicals of characters, structural composition of characters and writing sequence of characters are as follows: character-radical-construct composition-writing order sequence;

as shown in fig. 4, for the character "hot", its radical is: "fire", consisting of: the sequence of the Chinese characters' Zhi (a Chinese character) is: the radical features of "Tang" and "Huo" reflect the meaning of the characters.

Step 2.3.2: look up the pre-trained embedded vector lookup table and convert the character c_i＝{r₁,r₂,…,r_mEach radical-level feature r_j,j∈[1,m]The embedded vector with dimension d is represented, and the radical-level feature corresponding to the character can be represented as an embedded vector matrix O at this time, as follows:

wherein e^r(r_j) The first level of the expression part is embedded into a vector lookup table and is obtained by training Chinesegga-Word by using Word2 vec.

Step 2.3.3: dimension of the fixed embedding matrix is

For the matrix with the length exceeding k, performing truncation operation to take the first k characteristics; for the matrix with the length less than k, randomly initializing the filling length to k; after this step, the embedded vector matrix becomes:

step 2.3.4: as shown in fig. 5, x times of continuous one-dimensional convolution and maximum pooling operations are performed on the radical-level feature embedding matrix of fixed dimension, and a d-dimensional embedding vector representing the radical-level feature corresponding to the character is obtained.

Step 2.4: sequentially splicing the character, the soft word and the radical level feature vector, wherein the following formula is as follows:

wherein x is^cEmbedding vectors, y, for characters^rEmbedding vectors, z, for radical level^wThe vectors are embedded for the soft-word level,

and representing the splicing operation of the vector, wherein the spliced result is a character embedded vector containing word-level information and radical-level information.

Step 2.5: performing a padding/truncation operation on each sentence in the data set to a fixed length, specifically, discarding a part exceeding a prescribed length for a sentence whose length exceeds the prescribed length; for sentences with the length less than the specified length, filling operation is carried out, and the sentences are filled to the specified length;

h_t＝o_t⊙tanh(c_t).

Where σ is the sigmoid function of the element, which indicates the product of the elements, and W and b are trainable parameters. The memory unit c can be regarded as long-term memory, and the hidden state h is short-term memory. Reverse LSTM shares the same definition as forward LSTM, but models the sequence in reverse order. The hidden state of step i, which is spliced between forward and backward LSTMs, forms c_iIs context dependent.

And 3, step 3: inputting the data obtained in the step 2 into a model for training, and specifically comprising the following steps:

where P is the output of BilSTM, representing tag y_iThe transfer matrix T represents the slave tag y_iTo y_i+1The transition probability of (2).

Step 3.3: and updating and storing the parameters of the trained model.

Specifically, the model is trained using a negative log-likelihood loss function, and overfitting is mitigated using L2 regularization, as follows:

where θ represents a parameter set and λ is a regularization parameter.

And 4, step 4: processing and calculating the data to be recognized by using the Chinese named entity recognition model fusing the multi-granularity information obtained in the step 3 to obtain a named entity recognition result, and specifically comprising the following steps of:

and 4.2: and calculating and outputting an entity recognition result.

The invention provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the step of the Chinese named entity identification method fusing multi-granularity information when executing the computer program.

The invention provides a computer readable storage medium for storing computer instructions, which when executed by a processor implement the steps of the Chinese named entity recognition method fusing multi-granularity information.

The method, the device and the medium for identifying the Chinese named entity fusing multi-granularity information, which are provided by the invention, are introduced in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A Chinese named entity recognition method fusing multi-granularity information is characterized by comprising the following steps:

and 2, step: extracting characters, soft words and radical pre-training vectors in the corpus data preprocessed in the step 1 for vector fusion, and constructing a Chinese named entity recognition model fusing multi-granularity information;

and 3, step 3: inputting the data obtained in the step (2) into a model for training;

2. The method according to claim 1, characterized in that said step 1 comprises in particular the steps of:

step 1.2: dividing the marked result into character level linguistic data in a BMESO marking mode, wherein the form is as follows: character entity location-the predefined type to which it belongs;

step 1.3: and dividing the preprocessed data set into a training set, a testing set and a verification set according to a certain proportion.

3. The method according to claim 2, wherein step 2 comprises in particular the steps of:

step 2.3: for the radical level features corresponding to characters in the sequence: constructing a radical level feature lookup table for common Chinese characters, expressing features as pre-trained embedded vectors, and extracting radical level feature embedded vectors by using a convolutional neural network;

step 2.5: performing filling/truncation operation on each sentence in the data set to be in a fixed length; discarding the part exceeding the specified length for the sentence with the length exceeding the specified length; for sentences with the length less than the specified length, filling operation is carried out, and the sentences are filled to the specified length;

4. The method according to claim 3, characterized in that said step 2.2 comprises in particular the steps of:

step 2.2.2: using a vocabulary search tree to match candidate words in a sentence, and constructing a soft word set for the characters according to the positions of the characters in the candidate words;

5. The method according to claim 4, characterized in that said step 2.3 comprises in particular the steps of:

step 2.3.1: a radical level characteristic lookup table is constructed for common Chinese characters, and the radical level characteristics of the lookup table comprise: the simplified/traditional radicals of characters, the structural composition of characters and the writing sequence of characters are in the form: character-radical-construct composition-writing order sequence;

step 2.3.3: dimension of the fixed embedding matrix is

step 2.3.4: and carrying out x times of continuous one-dimensional convolution on the radical level feature embedding matrix with fixed dimensionality and carrying out maximum pooling operation to obtain a d-dimensional embedding vector to represent the radical level features corresponding to the characters.

6. The method according to claim 5, wherein the step 3 comprises the following steps:

step 3.3: and updating and storing the parameters of the trained model.

7. The method according to claim 6, wherein the step 4 comprises the steps of:

step 4.2: and calculating and outputting an entity recognition result.

8. The method of claim 4, wherein the corpus data in the total number of occurrences of the statistical candidate word in the corpus data is referred to as a training set + a test set.

9. An electronic device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method according to any one of claims 1-8 when executing the computer program.

10. A computer-readable storage medium storing computer instructions, which when executed by a processor implement the steps of the method of any one of claims 1 to 8.