CN114943230A - Chinese specific field entity linking method fusing common knowledge - Google Patents
Chinese specific field entity linking method fusing common knowledge Download PDFInfo
- Publication number
- CN114943230A CN114943230A CN202210400706.1A CN202210400706A CN114943230A CN 114943230 A CN114943230 A CN 114943230A CN 202210400706 A CN202210400706 A CN 202210400706A CN 114943230 A CN114943230 A CN 114943230A
- Authority
- CN
- China
- Prior art keywords
- entity
- layer
- sequence
- text
- representing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 23
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 22
- 230000008569 process Effects 0.000 claims abstract description 9
- 239000013598 vector Substances 0.000 claims description 58
- 238000012549 training Methods 0.000 claims description 35
- 238000002372 labelling Methods 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 9
- 238000011176 pooling Methods 0.000 claims description 9
- 230000004913 activation Effects 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 6
- 230000009193 crawling Effects 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 6
- 230000000306 recurrent effect Effects 0.000 claims description 6
- 238000012546 transfer Methods 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 239000000126 substance Substances 0.000 claims description 4
- 230000003213 activating effect Effects 0.000 claims description 3
- 230000006399 behavior Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 238000002790 cross-validation Methods 0.000 claims description 3
- 238000012423 maintenance Methods 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a Chinese specific field entity linking method fusing common knowledge, which comprises the steps of firstly obtaining and preprocessing common knowledge, then constructing and completing an encyclopedic corpus knowledge base based on a specified field, then identifying a named entity based on a BERT-BiGRU-CRF model and a bidirectional matching strategy, and finally realizing an entity linking process based on knowledge representation learning. The invention can effectively solve the problems of entity boundary identification error and entity identification completion, and greatly improves the accuracy of the named entity identification task and the entity link task.
Description
Technical Field
The invention belongs to the technical field of deep learning, and particularly relates to a Chinese specific field entity linking method.
Background
In recent years, the method for linking entities across fields at home and abroad has extensive research, and currently, the most common method is an end-to-end method which is totally divided into two steps: named Entity Recognition (NER) and Linking (Linking). The improvement of the accuracy of the former can greatly influence the accuracy of the link of the latter. From the current research situation, the traditional NER research mainly focuses on the identification of general named entities such as human names, place names, organizational structure names, time expressions and the like, and the research on the identification of entities in specific fields is lacked. In addition, due to the particularity of Chinese itself, such as the ambiguity, the difficulty of pronouncing multiple words and sentences, and the inaccuracy and imperfection of Chinese-English conversion, the accuracy in the task of identifying the universal Chinese named entity is usually about 10% lower than that of the task of identifying the English named entity. The task of named entity recognition for Chinese domain-specific knowledge is therefore a major challenge in this field.
The most common model used in the NER task at the present stage is the BERT-CRF model, which is an artificial-independent end-to-end deep learning method. However, the method only utilizes the information of the short text, only can realize the one-way matching link from the short text to the knowledge base, and does not utilize the entity information in the knowledge base. And the problems of wrong entity boundary identification, incomplete entity identification in sentences and the like still exist.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a Chinese specific field entity linking method fusing common knowledge, which comprises the steps of firstly obtaining and preprocessing common knowledge, then constructing and completing an encyclopedic corpus knowledge base based on a specified field, then identifying a named entity based on a BERT-BiGRU-CRF model and a bidirectional matching strategy, and finally realizing an entity linking process based on knowledge representation learning. The invention can effectively solve the problems of entity boundary identification error and entity identification completion, and greatly improves the accuracy of the named entity identification task and the entity link task.
The technical scheme adopted by the invention for solving the technical problem comprises the following steps:
step 1: constructing a general knowledge corpus of the specified field: crawling documents in the designated field including psychology and sociology, extracting texts of an abstract part and a summary part in the documents, carrying out sentence segmentation, punctuation removal and stop word removal on the extracted texts, and taking the text of each processed text field, the entity description _ data in the text and the number kb _ id of the entity in the encyclopedia knowledge base as training samples to obtain a common sense knowledge corpus in the designated field; the mentioned entity _ data in the text comprises entity to be linked after identification;
step 2: constructing and complementing an encyclopedia knowledge base: firstly, according to encyclopedic knowledge related to social network user behaviors in encyclopedic, constructing an encyclopedic knowledge graph by using encyclopedic entries in a structure of a triple (h, r and t), wherein h is Entity, r is Predicate, and t is Object; then, correcting and complementing the encyclopedic knowledge graph, which specifically comprises the following steps:
(1) carrying out lower case conversion on the special proprietary entity name formed by capital English letters;
(2) converting special symbols including quotation marks, commas and periods into English characters, and adding the converted names into the alias corresponding to the entity;
(3) crawling term entries of special nouns in specified fields including psychology and sociology, converting data formats into a structure of a triple (h, r, t), and adding the structure into a constructed encyclopedia knowledge map;
and step 3: and (3) rebuilding entity description text: connecting all predicates and objects in the encyclopedic knowledge base to obtain an entity description text, if the length of the entity description text is greater than d, performing truncation processing on the description text by taking d as a unit, wherein d is a preset length; five dictionaries were constructed including:
(1) taking the entity name in the encyclopedic knowledge base as a main key, and constructing to obtain an entity index dictionary entry _ id;
(2) taking an index of an entity in the encyclopedic knowledge base as a main key, and constructing to obtain an index entity dictionary id _ entry;
(3) taking an index of an entity description text in an encyclopedic knowledge base as a main key, and constructing to obtain an entity description text dictionary id _ text;
(4) taking an index of an entity in the encyclopedic knowledge base as a main key, and constructing to obtain an entity type dictionary id _ type;
(5) taking an index of an entity category in an encyclopedic knowledge base as a main key, and constructing to obtain a category dictionary type _ index;
and 4, step 4: constructing a Chinese named entity recognition BERT-BiGRUs-CRF model: the system comprises an input layer, a deep bidirectional pre-training language model BERT layer, a bidirectional gated recurrent neural network BiGRUs layer and a conditional random field CRF layer;
step 4-1: the structure of the BERT layer of the deep bidirectional pre-training language model consists of an embedded layer, an encoder and a pooling layer; inputting a text from a common knowledge corpus, and generating a word vector based on context information after passing through a BERT layer;
step 4-2: the bidirectional gated recurrent neural network BiGRUs layer comprises 2 gated loop unit GRU networks and 1 global pooling layer in opposite directions, word vectors output by the BERT layer are respectively input into the forward GRU network and the reverse GRU network, and front and rear semantic information vectors corresponding to the entity increment are respectively obtainedAndsplicing the two vectors to obtain H con (ii) a Then, performing maximum pooling operation on the pooled layer to obtain global semantic information H of words in the text max Then inputting the result into a conditional random field CRF layer to wait for outputting a sequence labeling result; wherein, the hidden layer state h of the BiGRUs layer at the time t t Calculated as follows:
wherein the content of the first and second substances,representing the hidden state of the forward GRU network at time t,representing the hidden state, w, of the reverse GRU network at time t t Weight, v, representing hidden layer state of forward GRU network at time t t Weight representing hidden state of reverse GRU network at time t, b t Representing the bias corresponding to the hidden layer state at the time t;
wherein, GRU represents the non-linear transformation of the input word vector, and encodes the word vector into the corresponding hidden state of GRU; x is the number of t Representing the word vector of the current input,representing the hidden state of the forward GRU network at time t-1,representing the hidden layer state of the reverse GRU network at the time of t-1;
step 4-3: the conditional random field CRF layer performs optimal sequence prediction by using the adjacent label relation of each word in the text, and the calculation process is as follows:
firstly, a prediction score s of a prediction sequence Y for an input sequence X is calculated according to the following formula:
wherein X is (X) 1 ,x 2 ,…,x n ) Representing a sequence of word vectors input into the CRF layer, i.e. global semantic information H max ,x i Denotes the ith word vector input, n denotes the total number of word vectors input, Y ═ (Y1, Y2, …, yn) denotes the predicted sequence, yi denotes the prediction labeling result of the ith word, s (X, Y) denotes the prediction score of the predicted sequence Y for the input sequence X, P denotes the prediction score of the predicted sequence Y for the input sequence X, and i,yi a score representing the ith word labeled as the yi label; a represents the transfer score, A yi,yi+1 A score representing the transfer of label yi to label yi + 1;
and then calculating the probability p (Y | X) of generating the prediction sequence Y according to the following formula:
wherein the content of the first and second substances,representing the actual annotated sequence, Y X Represents the set of all possible annotation sequences,representing true annotation sequencesA predicted score for input sequence X;
taking the logarithm on both sides of the equation of equation (5) yields the likelihood function ln (p (Y | X)) for the predicted sequence Y:
finally, the output sequence Y with the highest prediction score is calculated according to the following formula (7) * :
And 5: training a named entity recognition model: randomly dividing 9 data sets of training samples in the general knowledge corpus in the designated field obtained in the step 1, inputting the training samples into the BERT-BiGRUs-CRF model constructed in the step 4, and training the model by adopting a 9-fold cross validation mode to obtain a trained BERT-BiGRUs-CRF model;
step 6: common sense text knowledge named entity recognition: processing the texts in the common knowledge corpus in the step 1 by using the trained BERT-BiGRUs-CRF model to obtain a labeling sequence in each text;
and 7: the mutual-direction matching between the general knowledge corpus maintenance and the encyclopedic knowledge base entity: inputting an entity description text into a BERT layer of a deep bidirectional pre-training language model to obtain entity vector representation, splicing the entity description text with the text labeling sequence obtained in the step 6 again, and finally outputting a named entity recognition result through a layer of convolutional neural network and an activation function, wherein the output result is a one-dimensional 01 vector, 0 represents no recognition, and 1 represents successful recognition;
and 8: and (3) entity link model training: in the training sample during model training, selecting part of correctly-linked entities as positive examples, and using the rest of incorrectly-connected candidate entities as negative examples;
the entity link model is used for connecting the fragment vector corresponding to the '1' output in the step 7 with the entity description text of the matched candidate entity in the encyclopedic knowledge base and inputting the vector into a BERT layer of the deep bidirectional pre-training language model to obtain vector representation; then, through a full connection layer, activating through an activation function to obtain the probability score of the candidate entity, and selecting the candidate entity with the highest probability to establish a link; finally, outputting a sequence for establishing the link, wherein if the correct link value is 1, otherwise, the value is 0;
and step 9: splicing the identified comment vector in the text in the common-sense knowledge corpus and the entity description text vector to be linked in the encyclopedic knowledge base, inputting the spliced comment vector to the trained entity link model, outputting a sequence whether to finally establish a link or not, and screening out the spliced vector corresponding to the value of 1, namely obtaining the domain knowledge base for completing entity link.
The invention has the following beneficial effects:
the BERT-BiGRU-CRF algorithm and the bidirectional matching strategy adopted in the invention are a method which utilizes context information in a short text and entity description information in a knowledge base, so that a two-phase matching process is realized, the problems of entity boundary identification error and entity identification completion can be effectively solved, and the accuracy of a named entity identification task and an entity link task is greatly improved.
Drawings
FIG. 1 is a diagram of a process framework of the present invention;
fig. 2 is a diagram of an algorithmic network structure for the named entity recognition matching process of the present invention.
FIG. 3 is a diagram of the process of entity representation learning by the encyclopedic knowledge base of the present invention.
FIG. 4 is a model diagram of entity linking after successful bidirectional entity matching according to the present invention.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
As shown in fig. 1, a method for linking entities in specific Chinese fields by fusing common knowledge includes the following steps:
step 1: constructing a general knowledge corpus of the specified field: crawling documents in the designated field including psychology and sociology, extracting texts of an abstract part and a summary part in the documents, carrying out sentence segmentation, punctuation removal and stop word removal on the extracted texts, and taking the text of each processed text field, the entity description _ data in the text and the number kb _ id of the entity in the encyclopedia knowledge base as training samples to obtain a common sense knowledge corpus in the designated field; the mentioned entity _ data in the text comprises entity to be linked after identification;
step 2: constructing and complementing an encyclopedia knowledge base: firstly, according to encyclopedic knowledge related to social network user behaviors in encyclopedic, constructing an encyclopedic knowledge graph by using encyclopedic entries in a structure of a triple (h, r and t), wherein h is Entity, r is Predicate, and t is Object; then, correcting and complementing the encyclopedic knowledge graph, which specifically comprises the following steps:
(1) carrying out lower case conversion on the special proprietary entity name formed by capital English letters;
(2) converting special symbols including quotation marks, commas and periods into English characters, and adding the converted names into the alias corresponding to the entity;
(3) crawling term entries of special nouns in specified fields including psychology and sociology, converting data formats into a structure of a triple (h, r, t), and adding the structure into a constructed encyclopedia knowledge map;
and step 3: and (3) reconstructing entity description text: connecting all predicates and objects in the encyclopedic knowledge base to obtain an entity description text, if the length of the entity description text is greater than d, performing truncation processing on the description text by taking d as a unit, wherein d is a preset length; five dictionaries were constructed including:
(1) taking the entity name in the encyclopedic knowledge base as a main key, and constructing to obtain an entity index dictionary entry _ id;
(2) taking an index of an entity in the encyclopedic knowledge base as a main key, and constructing to obtain an index entity dictionary id _ entry;
(3) taking an index of an entity description text in an encyclopedic knowledge base as a main key, and constructing to obtain an entity description text dictionary id _ text;
(4) taking an index of an entity in the encyclopedic knowledge base as a main key, and constructing to obtain an entity type dictionary id _ type;
(5) taking an index of an entity category in an encyclopedic knowledge base as a main key, and constructing to obtain a category dictionary type _ index;
and 4, step 4: constructing a Chinese named entity recognition BERT-BiGRUs-CRF model: the system comprises an input layer, a deep bidirectional pre-training language model BERT layer, a bidirectional gating recurrent neural network BiGRUs layer and a conditional random field CRF layer;
step 4-1: the structure of the BERT layer of the deep bidirectional pre-training language model consists of an embedded layer, an encoder and a pooling layer; inputting a text from a common knowledge corpus, and generating a word vector based on context information after passing through a BERT layer;
step 4-2: the bidirectional gated recurrent neural network BiGRUs layer comprises 2 gated loop unit GRU networks and 1 global pooling layer in opposite directions, word vectors output by the BERT layer are respectively input into the forward GRU network and the reverse GRU network, and front and rear semantic information vectors corresponding to the entity segments are respectively obtainedAndsplicing the two vectors to obtain H con (ii) a Then, the maximum pooling operation is carried out on the text to obtain the global semantic information H of the words in the text max Then inputting the result into a conditional random field CRF layer to wait for outputting a sequence labeling result; wherein, the hidden layer state h of the BiGRUs layer at the time t t Calculated as follows:
wherein, the first and the second end of the pipe are connected with each other,representing the hidden state of the forward GRU network at time t,representing the hidden state, w, of the reverse GRU network at time t t Weight, v, representing hidden layer state of forward GRU network at time t t Weight representing hidden state of reverse GRU network at time t, b t Representing the bias corresponding to the hidden layer state at the time t;
wherein, GRU represents the nonlinear transformation of the input word vector, and encodes the word vector into the corresponding hidden state of GRU; x is the number of t Representing the currently entered word vector and,representing the hidden state of the forward GRU network at time t-1,representing the hidden layer state of the reverse GRU network at the time of t-1;
step 4-3: the conditional random field CRF layer performs optimal sequence prediction by using the adjacent label relation of each word in the text, and the calculation process is as follows:
firstly, a prediction score s of a prediction sequence Y for an input sequence X is calculated according to the following formula:
wherein X is (X) 1 ,x 2 ,…,x n ) Representing a sequence of word vectors input into the CRF layer, i.e. global semantic information H max ,x i Denotes the ith word vector input, n denotes the total number of word vectors input, Y ═ Y1, Y2, …, yn denotes the prediction sequence, yi denotes the prediction labeling result of the ith word, and s (X, Y) denotes the prediction sequence Y for the inputPrediction score of sequence X, P i,yi A score representing the ith word labeled as the yi label; a represents the transfer score, A yi,yi+1 Represents the score for the transfer of label yi to label yi + 1;
and then calculating the probability p (Y | X) of generating the prediction sequence Y according to the following formula:
wherein, the first and the second end of the pipe are connected with each other,representing the actual annotation sequence, Y X Represents the set of all possible annotation sequences,representing true annotation sequencesA predicted score for input sequence X;
taking the logarithm on both sides of the equation of equation (5) yields the likelihood function ln (p (Y | X)) for the predicted sequence Y:
finally, the output sequence Y with the highest prediction score is calculated according to the following formula (7) * :
And 5: as shown in fig. 2, the named entity recognition model trains: randomly dividing 9 data sets of training samples in the general knowledge corpus in the designated field obtained in the step 1, inputting the training samples into the BERT-BiGRUs-CRF model constructed in the step 4, and training the model by adopting a 9-fold cross validation mode to obtain a trained BERT-BiGRUs-CRF model;
and 6: as shown in FIG. 3, the common sense textual knowledge-named entity identifies: processing the texts in the common knowledge corpus in the step 1 by using the trained BERT-BiGRUs-CRF model to obtain a labeling sequence in each text; marking the entities mentioned in the text by using a BIO marking method, namely { B (begin), I (inside), O (Outside) }, wherein the continuous vector segments marked as 'B' and 'I' are the identified segments;
and 7: the mutual-direction matching between the general knowledge corpus maintenance and the encyclopedic knowledge base entity: inputting an entity description text into a BERT layer of a deep bidirectional pre-training language model to obtain entity vector representation, splicing the entity description text with the text labeling sequence obtained in the step 6 again, and finally outputting a named entity recognition result through a layer of convolutional neural network and an activation function, wherein the output result is a one-dimensional 01 vector, 0 represents no recognition, and 1 represents successful recognition;
and 8: and (3) entity link model training: in the training sample during model training, selecting part of correctly-linked entities as positive examples, and using the rest of incorrectly-connected candidate entities as negative examples;
the entity link model is used for connecting the fragment vector corresponding to the '1' output in the step 7 with the entity description text of the matched candidate entity in the encyclopedic knowledge base and inputting the vector into a BERT layer of the deep bidirectional pre-training language model to obtain vector representation; then through a full connection layer, activating through an activation function to obtain probability scores of candidate entities, and selecting the candidate entity with the highest probability to establish a link; finally, outputting a sequence for establishing the link, wherein if the correct link value is 1, otherwise, the value is 0; as shown in fig. 4;
and step 9: splicing the identified comment vector in the text in the common knowledge corpus and the entity description text vector to be linked in the encyclopedic knowledge base, inputting the spliced vectors into the trained entity link model, outputting a sequence whether to finally establish a link, and screening out the spliced vector corresponding to the value of 1, namely obtaining the domain knowledge base for completing entity link.
Claims (1)
1. A Chinese specific field entity linking method fusing common knowledge is characterized by comprising the following steps:
step 1: constructing a general knowledge corpus of the specified field: crawling documents in the designated field including psychology and sociology, extracting texts of an abstract part and a summary part in the documents, carrying out sentence segmentation, punctuation removal and stop word removal on the extracted texts, and taking the text of each processed text field, the entity description _ data in the text and the number kb _ id of the entity in the encyclopedia knowledge base as training samples to obtain a common sense knowledge corpus in the designated field; the mentioned entity _ data in the text comprises entity to be linked after identification;
and 2, step: constructing and complementing an encyclopedia knowledge base: firstly, according to encyclopedic knowledge related to social network user behaviors in encyclopedic, constructing an encyclopedic knowledge graph by using encyclopedic entries in a structure of a triple (h, r and t), wherein h is Entity, r is Predicate, and t is Object; then, correcting and complementing the encyclopedic knowledge graph, which specifically comprises the following steps:
(1) carrying out lower case conversion on a special entity name formed by capital English letters;
(2) converting special symbols including quotation marks, commas and periods into English characters, and adding the converted names into the alias corresponding to the entity;
(3) crawling term entries of special nouns in specified fields including psychology and sociology, converting data formats into a structure of a triple (h, r, t), and adding the structure into a constructed encyclopedia knowledge map;
and step 3: and (3) rebuilding entity description text: connecting all predicates and objects in the encyclopedic knowledge base to obtain an entity description text, and if the length of the entity description text is greater than d, performing truncation processing on the description text by taking d as a unit, wherein d is a preset length; five dictionaries were constructed including:
(1) taking the entity name in the encyclopedic knowledge base as a main key, and constructing to obtain an entity index dictionary entry _ id;
(2) taking an index of an entity in the encyclopedic knowledge base as a main key, and constructing to obtain an index entity dictionary id _ entry;
(3) taking an index of an entity description text in an encyclopedic knowledge base as a main key, and constructing to obtain an entity description text dictionary id _ text;
(4) taking an index of an entity in the encyclopedic knowledge base as a main key, and constructing to obtain an entity type dictionary id _ type;
(5) taking an index of an entity category in an encyclopedic knowledge base as a primary key, and constructing to obtain a category dictionary type _ index;
and 4, step 4: constructing a Chinese named entity recognition BERT-BiGRUs-CRF model: the system comprises an input layer, a deep bidirectional pre-training language model BERT layer, a bidirectional gating recurrent neural network BiGRUs layer and a conditional random field CRF layer;
step 4-1: the structure of the BERT layer of the deep bidirectional pre-training language model consists of an embedded layer, an encoder and a pooling layer; inputting a text from a common knowledge corpus, and generating a word vector based on context information after passing through a BERT layer;
step 4-2: the bidirectional gated recurrent neural network BiGRUs layer comprises 2 gated loop unit GRU networks and 1 global pooling layer in opposite directions, word vectors output by the BERT layer are respectively input into the forward GRU network and the reverse GRU network, and front and rear semantic information vectors corresponding to the entity increment are respectively obtainedAndsplicing the two vectors to obtain H con (ii) a Then, performing maximum pooling operation on the pooled layer to obtain global semantic information H of words in the text max Then inputting the result into a conditional random field CRF layer to wait for outputting a sequence labeling result; wherein the hidden layer state h of the BiGRUs layer at the time t t Calculated as follows:
wherein the content of the first and second substances,representing the hidden state of the forward GRU network at time t,representing the hidden state, w, of the reverse GRU network at time t t Weight, v, representing hidden layer state of forward GRU network at time t t Weight representing hidden state of reverse GRU network at time t, b t Representing the bias corresponding to the hidden layer state at the time t;
wherein, GRU represents the non-linear transformation of the input word vector, and encodes the word vector into the corresponding hidden state of GRU; x is the number of t Representing the currently entered word vector and,representing the hidden state of the forward GRU network at time t-1,representing the hidden layer state of the reverse GRU network at the time of t-1;
step 4-3: the conditional random field CRF layer performs optimal sequence prediction by using the adjacent label relation of each word in the text, and the calculation process is as follows:
firstly, a prediction score s of a prediction sequence Y for an input sequence X is calculated according to the following formula:
wherein X is (X) 1 ,x 2 ,…,x n ) Representing a sequence of word vectors input into the CRF layer, i.e. global semantic information H max ,x i Denotes an i-th word vector input, n denotes the total number of word vectors input, Y ═ Y1, Y2, …, yn denotes a prediction sequence, yi denotes the prediction labeling result of the i-th word, s (X, Y) denotes the prediction score of the prediction sequence Y for the input sequence X, P denotes the prediction score of the prediction sequence Y for the input sequence X, and i,yi a score representing the ith word labeled as the yi label; a represents the transfer score, A yi,yi+1 A score representing the transfer of label yi to label yi + 1;
and then calculating the probability p (Y | X) of generating the prediction sequence Y according to the following formula:
wherein the content of the first and second substances,representing the actual annotation sequence, Y X Represents the set of all possible annotation sequences,representing true annotation sequencesA predicted score for input sequence X;
taking the logarithm of both sides of the equation of equation (5) yields the likelihood function ln (p (Y | X)) of the predicted sequence Y:
finally, the output sequence Y with the highest prediction score is calculated according to the following formula (7) * :
And 5: training a named entity recognition model: randomly dividing 9 data sets of training samples in the general knowledge corpus in the designated field obtained in the step 1, inputting the training samples into the BERT-BiGRUs-CRF model constructed in the step 4, and training the model by adopting a 9-fold cross validation mode to obtain a trained BERT-BiGRUs-CRF model;
step 6: common sense text knowledge named entity recognition: processing the texts in the common knowledge corpus in the step 1 by using the trained BERT-BiGRUs-CRF model to obtain a labeling sequence in each text;
and 7: the mutual-direction matching between the general knowledge corpus maintenance and the encyclopedic knowledge base entity: inputting an entity description text into a BERT layer of a deep bidirectional pre-training language model to obtain entity vector representation, splicing the entity description text with the text labeling sequence obtained in the step 6 again, and finally outputting a named entity recognition result through a layer of convolutional neural network and an activation function, wherein the output result is a one-dimensional 01 vector, 0 represents no recognition, and 1 represents successful recognition;
and 8: and (3) entity link model training: in the training sample during model training, selecting part of correctly-linked entities as positive examples, and using the rest of incorrectly-connected candidate entities as negative examples;
the entity link model is to connect the segment vector corresponding to the '1' output in the step 7 with the entity description text of the matched candidate entity in the encyclopedic knowledge base and input the linked vector into a BERT layer of the deep bidirectional pre-training language model to obtain vector representation; then, through a full connection layer, activating through an activation function to obtain the probability score of the candidate entity, and selecting the candidate entity with the highest probability to establish a link; finally, outputting a sequence for establishing the link, wherein if the correct link value is 1, otherwise, the value is 0;
and step 9: splicing the identified comment vector in the text in the common knowledge corpus and the entity description text vector to be linked in the encyclopedic knowledge base, inputting the spliced vectors into the trained entity link model, outputting a sequence whether to finally establish a link, and screening out the spliced vector corresponding to the value of 1, namely obtaining the domain knowledge base for completing entity link.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210400706.1A CN114943230B (en) | 2022-04-17 | 2022-04-17 | Method for linking entities in Chinese specific field by fusing common sense knowledge |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210400706.1A CN114943230B (en) | 2022-04-17 | 2022-04-17 | Method for linking entities in Chinese specific field by fusing common sense knowledge |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114943230A true CN114943230A (en) | 2022-08-26 |
CN114943230B CN114943230B (en) | 2024-02-20 |
Family
ID=82908036
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210400706.1A Active CN114943230B (en) | 2022-04-17 | 2022-04-17 | Method for linking entities in Chinese specific field by fusing common sense knowledge |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114943230B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115422369A (en) * | 2022-08-30 | 2022-12-02 | 中国人民解放军国防科技大学 | Knowledge graph completion method and device based on improved TextRank |
CN115796280A (en) * | 2023-01-31 | 2023-03-14 | 南京万得资讯科技有限公司 | Entity identification entity linking system suitable for high efficiency and controllability in financial field |
CN116010583A (en) * | 2023-03-17 | 2023-04-25 | 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) | Cascade coupling knowledge enhancement dialogue generation method |
CN116451690A (en) * | 2023-03-21 | 2023-07-18 | 麦博(上海)健康科技有限公司 | Medical field named entity identification method |
CN117151220A (en) * | 2023-10-27 | 2023-12-01 | 北京长河数智科技有限责任公司 | Industry knowledge base system and method based on entity link and relation extraction |
CN117743568A (en) * | 2024-02-19 | 2024-03-22 | 中国电子科技集团公司第十五研究所 | Content generation method and system based on fusion of resource flow and confidence |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112100356A (en) * | 2020-09-17 | 2020-12-18 | 武汉纺织大学 | Knowledge base question-answer entity linking method and system based on similarity |
AU2020103654A4 (en) * | 2019-10-28 | 2021-01-14 | Nanjing Normal University | Method for intelligent construction of place name annotated corpus based on interactive and iterative learning |
CN113779992A (en) * | 2021-07-19 | 2021-12-10 | 西安理工大学 | Method for realizing BcBERT-SW-BilSTM-CRF model based on vocabulary enhancement and pre-training |
-
2022
- 2022-04-17 CN CN202210400706.1A patent/CN114943230B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU2020103654A4 (en) * | 2019-10-28 | 2021-01-14 | Nanjing Normal University | Method for intelligent construction of place name annotated corpus based on interactive and iterative learning |
WO2021082366A1 (en) * | 2019-10-28 | 2021-05-06 | 南京师范大学 | Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus |
CN112100356A (en) * | 2020-09-17 | 2020-12-18 | 武汉纺织大学 | Knowledge base question-answer entity linking method and system based on similarity |
CN113779992A (en) * | 2021-07-19 | 2021-12-10 | 西安理工大学 | Method for realizing BcBERT-SW-BilSTM-CRF model based on vocabulary enhancement and pre-training |
Non-Patent Citations (1)
Title |
---|
张晓;李业刚;王栋;史树敏;: "基于ERNIE的命名实体识别", 智能计算机与应用, no. 03, 1 March 2020 (2020-03-01) * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115422369A (en) * | 2022-08-30 | 2022-12-02 | 中国人民解放军国防科技大学 | Knowledge graph completion method and device based on improved TextRank |
CN115422369B (en) * | 2022-08-30 | 2023-11-03 | 中国人民解放军国防科技大学 | Knowledge graph completion method and device based on improved TextRank |
CN115796280A (en) * | 2023-01-31 | 2023-03-14 | 南京万得资讯科技有限公司 | Entity identification entity linking system suitable for high efficiency and controllability in financial field |
CN116010583A (en) * | 2023-03-17 | 2023-04-25 | 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) | Cascade coupling knowledge enhancement dialogue generation method |
CN116010583B (en) * | 2023-03-17 | 2023-07-18 | 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) | Cascade coupling knowledge enhancement dialogue generation method |
CN116451690A (en) * | 2023-03-21 | 2023-07-18 | 麦博(上海)健康科技有限公司 | Medical field named entity identification method |
CN117151220A (en) * | 2023-10-27 | 2023-12-01 | 北京长河数智科技有限责任公司 | Industry knowledge base system and method based on entity link and relation extraction |
CN117151220B (en) * | 2023-10-27 | 2024-02-02 | 北京长河数智科技有限责任公司 | Entity link and relationship based extraction industry knowledge base system and method |
CN117743568A (en) * | 2024-02-19 | 2024-03-22 | 中国电子科技集团公司第十五研究所 | Content generation method and system based on fusion of resource flow and confidence |
CN117743568B (en) * | 2024-02-19 | 2024-04-26 | 中国电子科技集团公司第十五研究所 | Content generation method and system based on fusion of resource flow and confidence |
Also Published As
Publication number | Publication date |
---|---|
CN114943230B (en) | 2024-02-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11501182B2 (en) | Method and apparatus for generating model | |
CN110110054B (en) | Method for acquiring question-answer pairs from unstructured text based on deep learning | |
CN114943230A (en) | Chinese specific field entity linking method fusing common knowledge | |
Zhu et al. | Knowledge-based question answering by tree-to-sequence learning | |
CN109508459B (en) | Method for extracting theme and key information from news | |
CN111400455A (en) | Relation detection method of question-answering system based on knowledge graph | |
CN112101014B (en) | Chinese chemical industry document word segmentation method based on mixed feature fusion | |
CN111143574A (en) | Query and visualization system construction method based on minority culture knowledge graph | |
CN113360667B (en) | Biomedical trigger word detection and named entity identification method based on multi-task learning | |
CN113268576B (en) | Deep learning-based department semantic information extraction method and device | |
CN111274829A (en) | Sequence labeling method using cross-language information | |
CN113723103A (en) | Chinese medical named entity and part-of-speech combined learning method integrating multi-source knowledge | |
CN113095074A (en) | Word segmentation method and system for Chinese electronic medical record | |
CN114580639A (en) | Knowledge graph construction method based on automatic extraction and alignment of government affair triples | |
CN115019906A (en) | Multi-task sequence labeled drug entity and interaction combined extraction method | |
CN113312922A (en) | Improved chapter-level triple information extraction method | |
CN112800184A (en) | Short text comment emotion analysis method based on Target-Aspect-Opinion joint extraction | |
CN111666374A (en) | Method for integrating additional knowledge information into deep language model | |
CN111444720A (en) | Named entity recognition method for English text | |
CN116522165B (en) | Public opinion text matching system and method based on twin structure | |
CN112507717A (en) | Medical field entity classification method fusing entity keyword features | |
CN114880994B (en) | Text style conversion method and device from direct white text to irony text | |
Fei et al. | GFMRC: A machine reading comprehension model for named entity recognition | |
CN116306653A (en) | Regularized domain knowledge-aided named entity recognition method | |
CN112784576B (en) | Text dependency syntactic analysis method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |