CN114943230A

CN114943230A - Chinese specific field entity linking method fusing common knowledge

Info

Publication number: CN114943230A
Application number: CN202210400706.1A
Authority: CN
Inventors: 王柱; 康天雨; 刘囡囡; 郭斌; 於志文
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-04-17
Filing date: 2022-04-17
Publication date: 2022-08-26
Anticipated expiration: 2042-04-17
Also published as: CN114943230B

Abstract

The invention discloses a Chinese specific field entity linking method fusing common knowledge, which comprises the steps of firstly obtaining and preprocessing common knowledge, then constructing and completing an encyclopedic corpus knowledge base based on a specified field, then identifying a named entity based on a BERT-BiGRU-CRF model and a bidirectional matching strategy, and finally realizing an entity linking process based on knowledge representation learning. The invention can effectively solve the problems of entity boundary identification error and entity identification completion, and greatly improves the accuracy of the named entity identification task and the entity link task.

Description

Chinese specific field entity linking method fusing common knowledge

Technical Field

The invention belongs to the technical field of deep learning, and particularly relates to a Chinese specific field entity linking method.

Background

In recent years, the method for linking entities across fields at home and abroad has extensive research, and currently, the most common method is an end-to-end method which is totally divided into two steps: named Entity Recognition (NER) and Linking (Linking). The improvement of the accuracy of the former can greatly influence the accuracy of the link of the latter. From the current research situation, the traditional NER research mainly focuses on the identification of general named entities such as human names, place names, organizational structure names, time expressions and the like, and the research on the identification of entities in specific fields is lacked. In addition, due to the particularity of Chinese itself, such as the ambiguity, the difficulty of pronouncing multiple words and sentences, and the inaccuracy and imperfection of Chinese-English conversion, the accuracy in the task of identifying the universal Chinese named entity is usually about 10% lower than that of the task of identifying the English named entity. The task of named entity recognition for Chinese domain-specific knowledge is therefore a major challenge in this field.

The most common model used in the NER task at the present stage is the BERT-CRF model, which is an artificial-independent end-to-end deep learning method. However, the method only utilizes the information of the short text, only can realize the one-way matching link from the short text to the knowledge base, and does not utilize the entity information in the knowledge base. And the problems of wrong entity boundary identification, incomplete entity identification in sentences and the like still exist.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a Chinese specific field entity linking method fusing common knowledge, which comprises the steps of firstly obtaining and preprocessing common knowledge, then constructing and completing an encyclopedic corpus knowledge base based on a specified field, then identifying a named entity based on a BERT-BiGRU-CRF model and a bidirectional matching strategy, and finally realizing an entity linking process based on knowledge representation learning. The invention can effectively solve the problems of entity boundary identification error and entity identification completion, and greatly improves the accuracy of the named entity identification task and the entity link task.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

step 1: constructing a general knowledge corpus of the specified field: crawling documents in the designated field including psychology and sociology, extracting texts of an abstract part and a summary part in the documents, carrying out sentence segmentation, punctuation removal and stop word removal on the extracted texts, and taking the text of each processed text field, the entity description _ data in the text and the number kb _ id of the entity in the encyclopedia knowledge base as training samples to obtain a common sense knowledge corpus in the designated field; the mentioned entity _ data in the text comprises entity to be linked after identification;

step 2: constructing and complementing an encyclopedia knowledge base: firstly, according to encyclopedic knowledge related to social network user behaviors in encyclopedic, constructing an encyclopedic knowledge graph by using encyclopedic entries in a structure of a triple (h, r and t), wherein h is Entity, r is Predicate, and t is Object; then, correcting and complementing the encyclopedic knowledge graph, which specifically comprises the following steps:

(1) carrying out lower case conversion on the special proprietary entity name formed by capital English letters;

(2) converting special symbols including quotation marks, commas and periods into English characters, and adding the converted names into the alias corresponding to the entity;

(3) crawling term entries of special nouns in specified fields including psychology and sociology, converting data formats into a structure of a triple (h, r, t), and adding the structure into a constructed encyclopedia knowledge map;

and step 3: and (3) rebuilding entity description text: connecting all predicates and objects in the encyclopedic knowledge base to obtain an entity description text, if the length of the entity description text is greater than d, performing truncation processing on the description text by taking d as a unit, wherein d is a preset length; five dictionaries were constructed including:

(1) taking the entity name in the encyclopedic knowledge base as a main key, and constructing to obtain an entity index dictionary entry _ id;

(2) taking an index of an entity in the encyclopedic knowledge base as a main key, and constructing to obtain an index entity dictionary id _ entry;

(3) taking an index of an entity description text in an encyclopedic knowledge base as a main key, and constructing to obtain an entity description text dictionary id _ text;

(4) taking an index of an entity in the encyclopedic knowledge base as a main key, and constructing to obtain an entity type dictionary id _ type;

(5) taking an index of an entity category in an encyclopedic knowledge base as a main key, and constructing to obtain a category dictionary type _ index;

and 4, step 4: constructing a Chinese named entity recognition BERT-BiGRUs-CRF model: the system comprises an input layer, a deep bidirectional pre-training language model BERT layer, a bidirectional gated recurrent neural network BiGRUs layer and a conditional random field CRF layer;

step 4-1: the structure of the BERT layer of the deep bidirectional pre-training language model consists of an embedded layer, an encoder and a pooling layer; inputting a text from a common knowledge corpus, and generating a word vector based on context information after passing through a BERT layer;

step 4-2: the bidirectional gated recurrent neural network BiGRUs layer comprises 2 gated loop unit GRU networks and 1 global pooling layer in opposite directions, word vectors output by the BERT layer are respectively input into the forward GRU network and the reverse GRU network, and front and rear semantic information vectors corresponding to the entity increment are respectively obtained

And

splicing the two vectors to obtain H _con (ii) a Then, performing maximum pooling operation on the pooled layer to obtain global semantic information H of words in the text _max Then inputting the result into a conditional random field CRF layer to wait for outputting a sequence labeling result; wherein, the hidden layer state h of the BiGRUs layer at the time t _t Calculated as follows:

wherein the content of the first and second substances,

representing the hidden state of the forward GRU network at time t,

representing the hidden state, w, of the reverse GRU network at time t _t Weight, v, representing hidden layer state of forward GRU network at time t _t Weight representing hidden state of reverse GRU network at time t, b _t Representing the bias corresponding to the hidden layer state at the time t;

and

respectively calculated according to the following formulas:

wherein, GRU represents the non-linear transformation of the input word vector, and encodes the word vector into the corresponding hidden state of GRU; x is the number of _t Representing the word vector of the current input,

representing the hidden state of the forward GRU network at time t-1,

representing the hidden layer state of the reverse GRU network at the time of t-1;

step 4-3: the conditional random field CRF layer performs optimal sequence prediction by using the adjacent label relation of each word in the text, and the calculation process is as follows:

firstly, a prediction score s of a prediction sequence Y for an input sequence X is calculated according to the following formula:

wherein X is (X) ₁ ,x ₂ ,…,x _n ) Representing a sequence of word vectors input into the CRF layer, i.e. global semantic information H _max ，x _i Denotes the ith word vector input, n denotes the total number of word vectors input, Y ═ (Y1, Y2, …, yn) denotes the predicted sequence, yi denotes the prediction labeling result of the ith word, s (X, Y) denotes the prediction score of the predicted sequence Y for the input sequence X, P denotes the prediction score of the predicted sequence Y for the input sequence X, and _i,yi a score representing the ith word labeled as the yi label; a represents the transfer score, A _yi,yi+1 A score representing the transfer of label yi to label yi + 1;

and then calculating the probability p (Y | X) of generating the prediction sequence Y according to the following formula:

wherein the content of the first and second substances,

representing the actual annotated sequence, Y _X Represents the set of all possible annotation sequences,

representing true annotation sequences

A predicted score for input sequence X;

taking the logarithm on both sides of the equation of equation (5) yields the likelihood function ln (p (Y | X)) for the predicted sequence Y:

finally, the output sequence Y with the highest prediction score is calculated according to the following formula (7) ^* ：

And 5: training a named entity recognition model: randomly dividing 9 data sets of training samples in the general knowledge corpus in the designated field obtained in the step 1, inputting the training samples into the BERT-BiGRUs-CRF model constructed in the step 4, and training the model by adopting a 9-fold cross validation mode to obtain a trained BERT-BiGRUs-CRF model;

step 6: common sense text knowledge named entity recognition: processing the texts in the common knowledge corpus in the step 1 by using the trained BERT-BiGRUs-CRF model to obtain a labeling sequence in each text;

and 7: the mutual-direction matching between the general knowledge corpus maintenance and the encyclopedic knowledge base entity: inputting an entity description text into a BERT layer of a deep bidirectional pre-training language model to obtain entity vector representation, splicing the entity description text with the text labeling sequence obtained in the step 6 again, and finally outputting a named entity recognition result through a layer of convolutional neural network and an activation function, wherein the output result is a one-dimensional 01 vector, 0 represents no recognition, and 1 represents successful recognition;

and 8: and (3) entity link model training: in the training sample during model training, selecting part of correctly-linked entities as positive examples, and using the rest of incorrectly-connected candidate entities as negative examples;

the entity link model is used for connecting the fragment vector corresponding to the '1' output in the step 7 with the entity description text of the matched candidate entity in the encyclopedic knowledge base and inputting the vector into a BERT layer of the deep bidirectional pre-training language model to obtain vector representation; then, through a full connection layer, activating through an activation function to obtain the probability score of the candidate entity, and selecting the candidate entity with the highest probability to establish a link; finally, outputting a sequence for establishing the link, wherein if the correct link value is 1, otherwise, the value is 0;

and step 9: splicing the identified comment vector in the text in the common-sense knowledge corpus and the entity description text vector to be linked in the encyclopedic knowledge base, inputting the spliced comment vector to the trained entity link model, outputting a sequence whether to finally establish a link or not, and screening out the spliced vector corresponding to the value of 1, namely obtaining the domain knowledge base for completing entity link.

The invention has the following beneficial effects:

the BERT-BiGRU-CRF algorithm and the bidirectional matching strategy adopted in the invention are a method which utilizes context information in a short text and entity description information in a knowledge base, so that a two-phase matching process is realized, the problems of entity boundary identification error and entity identification completion can be effectively solved, and the accuracy of a named entity identification task and an entity link task is greatly improved.

Drawings

FIG. 1 is a diagram of a process framework of the present invention;

fig. 2 is a diagram of an algorithmic network structure for the named entity recognition matching process of the present invention.

FIG. 3 is a diagram of the process of entity representation learning by the encyclopedic knowledge base of the present invention.

FIG. 4 is a model diagram of entity linking after successful bidirectional entity matching according to the present invention.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

As shown in fig. 1, a method for linking entities in specific Chinese fields by fusing common knowledge includes the following steps:

and step 3: and (3) reconstructing entity description text: connecting all predicates and objects in the encyclopedic knowledge base to obtain an entity description text, if the length of the entity description text is greater than d, performing truncation processing on the description text by taking d as a unit, wherein d is a preset length; five dictionaries were constructed including:

and 4, step 4: constructing a Chinese named entity recognition BERT-BiGRUs-CRF model: the system comprises an input layer, a deep bidirectional pre-training language model BERT layer, a bidirectional gating recurrent neural network BiGRUs layer and a conditional random field CRF layer;

step 4-2: the bidirectional gated recurrent neural network BiGRUs layer comprises 2 gated loop unit GRU networks and 1 global pooling layer in opposite directions, word vectors output by the BERT layer are respectively input into the forward GRU network and the reverse GRU network, and front and rear semantic information vectors corresponding to the entity segments are respectively obtained

And

splicing the two vectors to obtain H _con (ii) a Then, the maximum pooling operation is carried out on the text to obtain the global semantic information H of the words in the text _max Then inputting the result into a conditional random field CRF layer to wait for outputting a sequence labeling result; wherein, the hidden layer state h of the BiGRUs layer at the time t _t Calculated as follows:

wherein, the first and the second end of the pipe are connected with each other,

representing the hidden state of the forward GRU network at time t,

and

respectively calculated according to the following formula:

wherein, GRU represents the nonlinear transformation of the input word vector, and encodes the word vector into the corresponding hidden state of GRU; x is the number of _t Representing the currently entered word vector and,

representing the hidden state of the forward GRU network at time t-1,

wherein X is (X) ₁ ,x ₂ ,…,x _n ) Representing a sequence of word vectors input into the CRF layer, i.e. global semantic information H _max ，x _i Denotes the ith word vector input, n denotes the total number of word vectors input, Y ═ Y1, Y2, …, yn denotes the prediction sequence, yi denotes the prediction labeling result of the ith word, and s (X, Y) denotes the prediction sequence Y for the inputPrediction score of sequence X, P _i,yi A score representing the ith word labeled as the yi label; a represents the transfer score, A _yi,yi+1 Represents the score for the transfer of label yi to label yi + 1;

representing the actual annotation sequence, Y _X Represents the set of all possible annotation sequences,

representing true annotation sequences

A predicted score for input sequence X;

And 5: as shown in fig. 2, the named entity recognition model trains: randomly dividing 9 data sets of training samples in the general knowledge corpus in the designated field obtained in the step 1, inputting the training samples into the BERT-BiGRUs-CRF model constructed in the step 4, and training the model by adopting a 9-fold cross validation mode to obtain a trained BERT-BiGRUs-CRF model;

and 6: as shown in FIG. 3, the common sense textual knowledge-named entity identifies: processing the texts in the common knowledge corpus in the step 1 by using the trained BERT-BiGRUs-CRF model to obtain a labeling sequence in each text; marking the entities mentioned in the text by using a BIO marking method, namely { B (begin), I (inside), O (Outside) }, wherein the continuous vector segments marked as 'B' and 'I' are the identified segments;

the entity link model is used for connecting the fragment vector corresponding to the '1' output in the step 7 with the entity description text of the matched candidate entity in the encyclopedic knowledge base and inputting the vector into a BERT layer of the deep bidirectional pre-training language model to obtain vector representation; then through a full connection layer, activating through an activation function to obtain probability scores of candidate entities, and selecting the candidate entity with the highest probability to establish a link; finally, outputting a sequence for establishing the link, wherein if the correct link value is 1, otherwise, the value is 0; as shown in fig. 4;

and step 9: splicing the identified comment vector in the text in the common knowledge corpus and the entity description text vector to be linked in the encyclopedic knowledge base, inputting the spliced vectors into the trained entity link model, outputting a sequence whether to finally establish a link, and screening out the spliced vector corresponding to the value of 1, namely obtaining the domain knowledge base for completing entity link.

Claims

1. A Chinese specific field entity linking method fusing common knowledge is characterized by comprising the following steps:

and 2, step: constructing and complementing an encyclopedia knowledge base: firstly, according to encyclopedic knowledge related to social network user behaviors in encyclopedic, constructing an encyclopedic knowledge graph by using encyclopedic entries in a structure of a triple (h, r and t), wherein h is Entity, r is Predicate, and t is Object; then, correcting and complementing the encyclopedic knowledge graph, which specifically comprises the following steps:

(1) carrying out lower case conversion on a special entity name formed by capital English letters;

and step 3: and (3) rebuilding entity description text: connecting all predicates and objects in the encyclopedic knowledge base to obtain an entity description text, and if the length of the entity description text is greater than d, performing truncation processing on the description text by taking d as a unit, wherein d is a preset length; five dictionaries were constructed including:

(5) taking an index of an entity category in an encyclopedic knowledge base as a primary key, and constructing to obtain a category dictionary type _ index;

And

splicing the two vectors to obtain H _con (ii) a Then, performing maximum pooling operation on the pooled layer to obtain global semantic information H of words in the text _max Then inputting the result into a conditional random field CRF layer to wait for outputting a sequence labeling result; wherein the hidden layer state h of the BiGRUs layer at the time t _t Calculated as follows:

wherein the content of the first and second substances,

representing the hidden state of the forward GRU network at time t,

and

respectively calculated according to the following formula:

wherein, GRU represents the non-linear transformation of the input word vector, and encodes the word vector into the corresponding hidden state of GRU; x is the number of _t Representing the currently entered word vector and,

representing the hidden state of the forward GRU network at time t-1,

wherein X is (X) ₁ ,x ₂ ,…,x _n ) Representing a sequence of word vectors input into the CRF layer, i.e. global semantic information H _max ，x _i Denotes an i-th word vector input, n denotes the total number of word vectors input, Y ═ Y1, Y2, …, yn denotes a prediction sequence, yi denotes the prediction labeling result of the i-th word, s (X, Y) denotes the prediction score of the prediction sequence Y for the input sequence X, P denotes the prediction score of the prediction sequence Y for the input sequence X, and _i,yi a score representing the ith word labeled as the yi label; a represents the transfer score, A _yi,yi+1 A score representing the transfer of label yi to label yi + 1;

wherein the content of the first and second substances,

representing true annotation sequences

A predicted score for input sequence X;

taking the logarithm of both sides of the equation of equation (5) yields the likelihood function ln (p (Y | X)) of the predicted sequence Y:

the entity link model is to connect the segment vector corresponding to the '1' output in the step 7 with the entity description text of the matched candidate entity in the encyclopedic knowledge base and input the linked vector into a BERT layer of the deep bidirectional pre-training language model to obtain vector representation; then, through a full connection layer, activating through an activation function to obtain the probability score of the candidate entity, and selecting the candidate entity with the highest probability to establish a link; finally, outputting a sequence for establishing the link, wherein if the correct link value is 1, otherwise, the value is 0;