CN114943230A - Chinese specific field entity linking method fusing common knowledge - Google Patents

Chinese specific field entity linking method fusing common knowledge Download PDF

Info

Publication number
CN114943230A
CN114943230A CN202210400706.1A CN202210400706A CN114943230A CN 114943230 A CN114943230 A CN 114943230A CN 202210400706 A CN202210400706 A CN 202210400706A CN 114943230 A CN114943230 A CN 114943230A
Authority
CN
China
Prior art keywords
entity
layer
sequence
text
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210400706.1A
Other languages
Chinese (zh)
Other versions
CN114943230B (en
Inventor
王柱
康天雨
刘囡囡
郭斌
於志文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202210400706.1A priority Critical patent/CN114943230B/en
Publication of CN114943230A publication Critical patent/CN114943230A/en
Application granted granted Critical
Publication of CN114943230B publication Critical patent/CN114943230B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a Chinese specific field entity linking method fusing common knowledge, which comprises the steps of firstly obtaining and preprocessing common knowledge, then constructing and completing an encyclopedic corpus knowledge base based on a specified field, then identifying a named entity based on a BERT-BiGRU-CRF model and a bidirectional matching strategy, and finally realizing an entity linking process based on knowledge representation learning. The invention can effectively solve the problems of entity boundary identification error and entity identification completion, and greatly improves the accuracy of the named entity identification task and the entity link task.

Description

Chinese specific field entity linking method fusing common knowledge
Technical Field
The invention belongs to the technical field of deep learning, and particularly relates to a Chinese specific field entity linking method.
Background
In recent years, the method for linking entities across fields at home and abroad has extensive research, and currently, the most common method is an end-to-end method which is totally divided into two steps: named Entity Recognition (NER) and Linking (Linking). The improvement of the accuracy of the former can greatly influence the accuracy of the link of the latter. From the current research situation, the traditional NER research mainly focuses on the identification of general named entities such as human names, place names, organizational structure names, time expressions and the like, and the research on the identification of entities in specific fields is lacked. In addition, due to the particularity of Chinese itself, such as the ambiguity, the difficulty of pronouncing multiple words and sentences, and the inaccuracy and imperfection of Chinese-English conversion, the accuracy in the task of identifying the universal Chinese named entity is usually about 10% lower than that of the task of identifying the English named entity. The task of named entity recognition for Chinese domain-specific knowledge is therefore a major challenge in this field.
The most common model used in the NER task at the present stage is the BERT-CRF model, which is an artificial-independent end-to-end deep learning method. However, the method only utilizes the information of the short text, only can realize the one-way matching link from the short text to the knowledge base, and does not utilize the entity information in the knowledge base. And the problems of wrong entity boundary identification, incomplete entity identification in sentences and the like still exist.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a Chinese specific field entity linking method fusing common knowledge, which comprises the steps of firstly obtaining and preprocessing common knowledge, then constructing and completing an encyclopedic corpus knowledge base based on a specified field, then identifying a named entity based on a BERT-BiGRU-CRF model and a bidirectional matching strategy, and finally realizing an entity linking process based on knowledge representation learning. The invention can effectively solve the problems of entity boundary identification error and entity identification completion, and greatly improves the accuracy of the named entity identification task and the entity link task.
The technical scheme adopted by the invention for solving the technical problem comprises the following steps:
step 1: constructing a general knowledge corpus of the specified field: crawling documents in the designated field including psychology and sociology, extracting texts of an abstract part and a summary part in the documents, carrying out sentence segmentation, punctuation removal and stop word removal on the extracted texts, and taking the text of each processed text field, the entity description _ data in the text and the number kb _ id of the entity in the encyclopedia knowledge base as training samples to obtain a common sense knowledge corpus in the designated field; the mentioned entity _ data in the text comprises entity to be linked after identification;
step 2: constructing and complementing an encyclopedia knowledge base: firstly, according to encyclopedic knowledge related to social network user behaviors in encyclopedic, constructing an encyclopedic knowledge graph by using encyclopedic entries in a structure of a triple (h, r and t), wherein h is Entity, r is Predicate, and t is Object; then, correcting and complementing the encyclopedic knowledge graph, which specifically comprises the following steps:
(1) carrying out lower case conversion on the special proprietary entity name formed by capital English letters;
(2) converting special symbols including quotation marks, commas and periods into English characters, and adding the converted names into the alias corresponding to the entity;
(3) crawling term entries of special nouns in specified fields including psychology and sociology, converting data formats into a structure of a triple (h, r, t), and adding the structure into a constructed encyclopedia knowledge map;
and step 3: and (3) rebuilding entity description text: connecting all predicates and objects in the encyclopedic knowledge base to obtain an entity description text, if the length of the entity description text is greater than d, performing truncation processing on the description text by taking d as a unit, wherein d is a preset length; five dictionaries were constructed including:
(1) taking the entity name in the encyclopedic knowledge base as a main key, and constructing to obtain an entity index dictionary entry _ id;
(2) taking an index of an entity in the encyclopedic knowledge base as a main key, and constructing to obtain an index entity dictionary id _ entry;
(3) taking an index of an entity description text in an encyclopedic knowledge base as a main key, and constructing to obtain an entity description text dictionary id _ text;
(4) taking an index of an entity in the encyclopedic knowledge base as a main key, and constructing to obtain an entity type dictionary id _ type;
(5) taking an index of an entity category in an encyclopedic knowledge base as a main key, and constructing to obtain a category dictionary type _ index;
and 4, step 4: constructing a Chinese named entity recognition BERT-BiGRUs-CRF model: the system comprises an input layer, a deep bidirectional pre-training language model BERT layer, a bidirectional gated recurrent neural network BiGRUs layer and a conditional random field CRF layer;
step 4-1: the structure of the BERT layer of the deep bidirectional pre-training language model consists of an embedded layer, an encoder and a pooling layer; inputting a text from a common knowledge corpus, and generating a word vector based on context information after passing through a BERT layer;
step 4-2: the bidirectional gated recurrent neural network BiGRUs layer comprises 2 gated loop unit GRU networks and 1 global pooling layer in opposite directions, word vectors output by the BERT layer are respectively input into the forward GRU network and the reverse GRU network, and front and rear semantic information vectors corresponding to the entity increment are respectively obtained
Figure RE-GDA0003738667920000021
And
Figure RE-GDA0003738667920000022
splicing the two vectors to obtain H con (ii) a Then, performing maximum pooling operation on the pooled layer to obtain global semantic information H of words in the text max Then inputting the result into a conditional random field CRF layer to wait for outputting a sequence labeling result; wherein, the hidden layer state h of the BiGRUs layer at the time t t Calculated as follows:
Figure RE-GDA0003738667920000031
wherein the content of the first and second substances,
Figure RE-GDA0003738667920000032
representing the hidden state of the forward GRU network at time t,
Figure RE-GDA0003738667920000033
representing the hidden state, w, of the reverse GRU network at time t t Weight, v, representing hidden layer state of forward GRU network at time t t Weight representing hidden state of reverse GRU network at time t, b t Representing the bias corresponding to the hidden layer state at the time t;
Figure RE-GDA0003738667920000034
and
Figure RE-GDA0003738667920000035
respectively calculated according to the following formulas:
Figure RE-GDA0003738667920000036
Figure RE-GDA0003738667920000037
wherein, GRU represents the non-linear transformation of the input word vector, and encodes the word vector into the corresponding hidden state of GRU; x is the number of t Representing the word vector of the current input,
Figure RE-GDA0003738667920000038
representing the hidden state of the forward GRU network at time t-1,
Figure RE-GDA0003738667920000039
representing the hidden layer state of the reverse GRU network at the time of t-1;
step 4-3: the conditional random field CRF layer performs optimal sequence prediction by using the adjacent label relation of each word in the text, and the calculation process is as follows:
firstly, a prediction score s of a prediction sequence Y for an input sequence X is calculated according to the following formula:
Figure RE-GDA00037386679200000310
wherein X is (X) 1 ,x 2 ,…,x n ) Representing a sequence of word vectors input into the CRF layer, i.e. global semantic information H max ,x i Denotes the ith word vector input, n denotes the total number of word vectors input, Y ═ (Y1, Y2, …, yn) denotes the predicted sequence, yi denotes the prediction labeling result of the ith word, s (X, Y) denotes the prediction score of the predicted sequence Y for the input sequence X, P denotes the prediction score of the predicted sequence Y for the input sequence X, and i,yi a score representing the ith word labeled as the yi label; a represents the transfer score, A yi,yi+1 A score representing the transfer of label yi to label yi + 1;
and then calculating the probability p (Y | X) of generating the prediction sequence Y according to the following formula:
Figure RE-GDA00037386679200000311
wherein the content of the first and second substances,
Figure RE-GDA00037386679200000312
representing the actual annotated sequence, Y X Represents the set of all possible annotation sequences,
Figure RE-GDA00037386679200000313
representing true annotation sequences
Figure RE-GDA00037386679200000314
A predicted score for input sequence X;
taking the logarithm on both sides of the equation of equation (5) yields the likelihood function ln (p (Y | X)) for the predicted sequence Y:
Figure RE-GDA0003738667920000041
finally, the output sequence Y with the highest prediction score is calculated according to the following formula (7) *
Figure RE-GDA0003738667920000042
And 5: training a named entity recognition model: randomly dividing 9 data sets of training samples in the general knowledge corpus in the designated field obtained in the step 1, inputting the training samples into the BERT-BiGRUs-CRF model constructed in the step 4, and training the model by adopting a 9-fold cross validation mode to obtain a trained BERT-BiGRUs-CRF model;
step 6: common sense text knowledge named entity recognition: processing the texts in the common knowledge corpus in the step 1 by using the trained BERT-BiGRUs-CRF model to obtain a labeling sequence in each text;
and 7: the mutual-direction matching between the general knowledge corpus maintenance and the encyclopedic knowledge base entity: inputting an entity description text into a BERT layer of a deep bidirectional pre-training language model to obtain entity vector representation, splicing the entity description text with the text labeling sequence obtained in the step 6 again, and finally outputting a named entity recognition result through a layer of convolutional neural network and an activation function, wherein the output result is a one-dimensional 01 vector, 0 represents no recognition, and 1 represents successful recognition;
and 8: and (3) entity link model training: in the training sample during model training, selecting part of correctly-linked entities as positive examples, and using the rest of incorrectly-connected candidate entities as negative examples;
the entity link model is used for connecting the fragment vector corresponding to the '1' output in the step 7 with the entity description text of the matched candidate entity in the encyclopedic knowledge base and inputting the vector into a BERT layer of the deep bidirectional pre-training language model to obtain vector representation; then, through a full connection layer, activating through an activation function to obtain the probability score of the candidate entity, and selecting the candidate entity with the highest probability to establish a link; finally, outputting a sequence for establishing the link, wherein if the correct link value is 1, otherwise, the value is 0;
and step 9: splicing the identified comment vector in the text in the common-sense knowledge corpus and the entity description text vector to be linked in the encyclopedic knowledge base, inputting the spliced comment vector to the trained entity link model, outputting a sequence whether to finally establish a link or not, and screening out the spliced vector corresponding to the value of 1, namely obtaining the domain knowledge base for completing entity link.
The invention has the following beneficial effects:
the BERT-BiGRU-CRF algorithm and the bidirectional matching strategy adopted in the invention are a method which utilizes context information in a short text and entity description information in a knowledge base, so that a two-phase matching process is realized, the problems of entity boundary identification error and entity identification completion can be effectively solved, and the accuracy of a named entity identification task and an entity link task is greatly improved.
Drawings
FIG. 1 is a diagram of a process framework of the present invention;
fig. 2 is a diagram of an algorithmic network structure for the named entity recognition matching process of the present invention.
FIG. 3 is a diagram of the process of entity representation learning by the encyclopedic knowledge base of the present invention.
FIG. 4 is a model diagram of entity linking after successful bidirectional entity matching according to the present invention.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
As shown in fig. 1, a method for linking entities in specific Chinese fields by fusing common knowledge includes the following steps:
step 1: constructing a general knowledge corpus of the specified field: crawling documents in the designated field including psychology and sociology, extracting texts of an abstract part and a summary part in the documents, carrying out sentence segmentation, punctuation removal and stop word removal on the extracted texts, and taking the text of each processed text field, the entity description _ data in the text and the number kb _ id of the entity in the encyclopedia knowledge base as training samples to obtain a common sense knowledge corpus in the designated field; the mentioned entity _ data in the text comprises entity to be linked after identification;
step 2: constructing and complementing an encyclopedia knowledge base: firstly, according to encyclopedic knowledge related to social network user behaviors in encyclopedic, constructing an encyclopedic knowledge graph by using encyclopedic entries in a structure of a triple (h, r and t), wherein h is Entity, r is Predicate, and t is Object; then, correcting and complementing the encyclopedic knowledge graph, which specifically comprises the following steps:
(1) carrying out lower case conversion on the special proprietary entity name formed by capital English letters;
(2) converting special symbols including quotation marks, commas and periods into English characters, and adding the converted names into the alias corresponding to the entity;
(3) crawling term entries of special nouns in specified fields including psychology and sociology, converting data formats into a structure of a triple (h, r, t), and adding the structure into a constructed encyclopedia knowledge map;
and step 3: and (3) reconstructing entity description text: connecting all predicates and objects in the encyclopedic knowledge base to obtain an entity description text, if the length of the entity description text is greater than d, performing truncation processing on the description text by taking d as a unit, wherein d is a preset length; five dictionaries were constructed including:
(1) taking the entity name in the encyclopedic knowledge base as a main key, and constructing to obtain an entity index dictionary entry _ id;
(2) taking an index of an entity in the encyclopedic knowledge base as a main key, and constructing to obtain an index entity dictionary id _ entry;
(3) taking an index of an entity description text in an encyclopedic knowledge base as a main key, and constructing to obtain an entity description text dictionary id _ text;
(4) taking an index of an entity in the encyclopedic knowledge base as a main key, and constructing to obtain an entity type dictionary id _ type;
(5) taking an index of an entity category in an encyclopedic knowledge base as a main key, and constructing to obtain a category dictionary type _ index;
and 4, step 4: constructing a Chinese named entity recognition BERT-BiGRUs-CRF model: the system comprises an input layer, a deep bidirectional pre-training language model BERT layer, a bidirectional gating recurrent neural network BiGRUs layer and a conditional random field CRF layer;
step 4-1: the structure of the BERT layer of the deep bidirectional pre-training language model consists of an embedded layer, an encoder and a pooling layer; inputting a text from a common knowledge corpus, and generating a word vector based on context information after passing through a BERT layer;
step 4-2: the bidirectional gated recurrent neural network BiGRUs layer comprises 2 gated loop unit GRU networks and 1 global pooling layer in opposite directions, word vectors output by the BERT layer are respectively input into the forward GRU network and the reverse GRU network, and front and rear semantic information vectors corresponding to the entity segments are respectively obtained
Figure RE-GDA0003738667920000061
And
Figure RE-GDA0003738667920000062
splicing the two vectors to obtain H con (ii) a Then, the maximum pooling operation is carried out on the text to obtain the global semantic information H of the words in the text max Then inputting the result into a conditional random field CRF layer to wait for outputting a sequence labeling result; wherein, the hidden layer state h of the BiGRUs layer at the time t t Calculated as follows:
Figure RE-GDA0003738667920000063
wherein, the first and the second end of the pipe are connected with each other,
Figure RE-GDA0003738667920000064
representing the hidden state of the forward GRU network at time t,
Figure RE-GDA0003738667920000065
representing the hidden state, w, of the reverse GRU network at time t t Weight, v, representing hidden layer state of forward GRU network at time t t Weight representing hidden state of reverse GRU network at time t, b t Representing the bias corresponding to the hidden layer state at the time t;
Figure RE-GDA0003738667920000066
and
Figure RE-GDA0003738667920000067
respectively calculated according to the following formula:
Figure RE-GDA0003738667920000068
Figure RE-GDA0003738667920000069
wherein, GRU represents the nonlinear transformation of the input word vector, and encodes the word vector into the corresponding hidden state of GRU; x is the number of t Representing the currently entered word vector and,
Figure RE-GDA00037386679200000610
representing the hidden state of the forward GRU network at time t-1,
Figure RE-GDA00037386679200000611
representing the hidden layer state of the reverse GRU network at the time of t-1;
step 4-3: the conditional random field CRF layer performs optimal sequence prediction by using the adjacent label relation of each word in the text, and the calculation process is as follows:
firstly, a prediction score s of a prediction sequence Y for an input sequence X is calculated according to the following formula:
Figure RE-GDA00037386679200000612
wherein X is (X) 1 ,x 2 ,…,x n ) Representing a sequence of word vectors input into the CRF layer, i.e. global semantic information H max ,x i Denotes the ith word vector input, n denotes the total number of word vectors input, Y ═ Y1, Y2, …, yn denotes the prediction sequence, yi denotes the prediction labeling result of the ith word, and s (X, Y) denotes the prediction sequence Y for the inputPrediction score of sequence X, P i,yi A score representing the ith word labeled as the yi label; a represents the transfer score, A yi,yi+1 Represents the score for the transfer of label yi to label yi + 1;
and then calculating the probability p (Y | X) of generating the prediction sequence Y according to the following formula:
Figure RE-GDA0003738667920000071
wherein, the first and the second end of the pipe are connected with each other,
Figure RE-GDA0003738667920000072
representing the actual annotation sequence, Y X Represents the set of all possible annotation sequences,
Figure RE-GDA0003738667920000073
representing true annotation sequences
Figure RE-GDA0003738667920000074
A predicted score for input sequence X;
taking the logarithm on both sides of the equation of equation (5) yields the likelihood function ln (p (Y | X)) for the predicted sequence Y:
Figure RE-GDA0003738667920000075
finally, the output sequence Y with the highest prediction score is calculated according to the following formula (7) *
Figure RE-GDA0003738667920000076
And 5: as shown in fig. 2, the named entity recognition model trains: randomly dividing 9 data sets of training samples in the general knowledge corpus in the designated field obtained in the step 1, inputting the training samples into the BERT-BiGRUs-CRF model constructed in the step 4, and training the model by adopting a 9-fold cross validation mode to obtain a trained BERT-BiGRUs-CRF model;
and 6: as shown in FIG. 3, the common sense textual knowledge-named entity identifies: processing the texts in the common knowledge corpus in the step 1 by using the trained BERT-BiGRUs-CRF model to obtain a labeling sequence in each text; marking the entities mentioned in the text by using a BIO marking method, namely { B (begin), I (inside), O (Outside) }, wherein the continuous vector segments marked as 'B' and 'I' are the identified segments;
and 7: the mutual-direction matching between the general knowledge corpus maintenance and the encyclopedic knowledge base entity: inputting an entity description text into a BERT layer of a deep bidirectional pre-training language model to obtain entity vector representation, splicing the entity description text with the text labeling sequence obtained in the step 6 again, and finally outputting a named entity recognition result through a layer of convolutional neural network and an activation function, wherein the output result is a one-dimensional 01 vector, 0 represents no recognition, and 1 represents successful recognition;
and 8: and (3) entity link model training: in the training sample during model training, selecting part of correctly-linked entities as positive examples, and using the rest of incorrectly-connected candidate entities as negative examples;
the entity link model is used for connecting the fragment vector corresponding to the '1' output in the step 7 with the entity description text of the matched candidate entity in the encyclopedic knowledge base and inputting the vector into a BERT layer of the deep bidirectional pre-training language model to obtain vector representation; then through a full connection layer, activating through an activation function to obtain probability scores of candidate entities, and selecting the candidate entity with the highest probability to establish a link; finally, outputting a sequence for establishing the link, wherein if the correct link value is 1, otherwise, the value is 0; as shown in fig. 4;
and step 9: splicing the identified comment vector in the text in the common knowledge corpus and the entity description text vector to be linked in the encyclopedic knowledge base, inputting the spliced vectors into the trained entity link model, outputting a sequence whether to finally establish a link, and screening out the spliced vector corresponding to the value of 1, namely obtaining the domain knowledge base for completing entity link.

Claims (1)

1. A Chinese specific field entity linking method fusing common knowledge is characterized by comprising the following steps:
step 1: constructing a general knowledge corpus of the specified field: crawling documents in the designated field including psychology and sociology, extracting texts of an abstract part and a summary part in the documents, carrying out sentence segmentation, punctuation removal and stop word removal on the extracted texts, and taking the text of each processed text field, the entity description _ data in the text and the number kb _ id of the entity in the encyclopedia knowledge base as training samples to obtain a common sense knowledge corpus in the designated field; the mentioned entity _ data in the text comprises entity to be linked after identification;
and 2, step: constructing and complementing an encyclopedia knowledge base: firstly, according to encyclopedic knowledge related to social network user behaviors in encyclopedic, constructing an encyclopedic knowledge graph by using encyclopedic entries in a structure of a triple (h, r and t), wherein h is Entity, r is Predicate, and t is Object; then, correcting and complementing the encyclopedic knowledge graph, which specifically comprises the following steps:
(1) carrying out lower case conversion on a special entity name formed by capital English letters;
(2) converting special symbols including quotation marks, commas and periods into English characters, and adding the converted names into the alias corresponding to the entity;
(3) crawling term entries of special nouns in specified fields including psychology and sociology, converting data formats into a structure of a triple (h, r, t), and adding the structure into a constructed encyclopedia knowledge map;
and step 3: and (3) rebuilding entity description text: connecting all predicates and objects in the encyclopedic knowledge base to obtain an entity description text, and if the length of the entity description text is greater than d, performing truncation processing on the description text by taking d as a unit, wherein d is a preset length; five dictionaries were constructed including:
(1) taking the entity name in the encyclopedic knowledge base as a main key, and constructing to obtain an entity index dictionary entry _ id;
(2) taking an index of an entity in the encyclopedic knowledge base as a main key, and constructing to obtain an index entity dictionary id _ entry;
(3) taking an index of an entity description text in an encyclopedic knowledge base as a main key, and constructing to obtain an entity description text dictionary id _ text;
(4) taking an index of an entity in the encyclopedic knowledge base as a main key, and constructing to obtain an entity type dictionary id _ type;
(5) taking an index of an entity category in an encyclopedic knowledge base as a primary key, and constructing to obtain a category dictionary type _ index;
and 4, step 4: constructing a Chinese named entity recognition BERT-BiGRUs-CRF model: the system comprises an input layer, a deep bidirectional pre-training language model BERT layer, a bidirectional gating recurrent neural network BiGRUs layer and a conditional random field CRF layer;
step 4-1: the structure of the BERT layer of the deep bidirectional pre-training language model consists of an embedded layer, an encoder and a pooling layer; inputting a text from a common knowledge corpus, and generating a word vector based on context information after passing through a BERT layer;
step 4-2: the bidirectional gated recurrent neural network BiGRUs layer comprises 2 gated loop unit GRU networks and 1 global pooling layer in opposite directions, word vectors output by the BERT layer are respectively input into the forward GRU network and the reverse GRU network, and front and rear semantic information vectors corresponding to the entity increment are respectively obtained
Figure RE-FDA0003738667910000021
And
Figure RE-FDA0003738667910000022
splicing the two vectors to obtain H con (ii) a Then, performing maximum pooling operation on the pooled layer to obtain global semantic information H of words in the text max Then inputting the result into a conditional random field CRF layer to wait for outputting a sequence labeling result; wherein the hidden layer state h of the BiGRUs layer at the time t t Calculated as follows:
Figure RE-FDA0003738667910000023
wherein the content of the first and second substances,
Figure RE-FDA0003738667910000024
representing the hidden state of the forward GRU network at time t,
Figure RE-FDA0003738667910000025
representing the hidden state, w, of the reverse GRU network at time t t Weight, v, representing hidden layer state of forward GRU network at time t t Weight representing hidden state of reverse GRU network at time t, b t Representing the bias corresponding to the hidden layer state at the time t;
Figure RE-FDA0003738667910000026
and
Figure RE-FDA0003738667910000027
respectively calculated according to the following formula:
Figure RE-FDA0003738667910000028
Figure RE-FDA0003738667910000029
wherein, GRU represents the non-linear transformation of the input word vector, and encodes the word vector into the corresponding hidden state of GRU; x is the number of t Representing the currently entered word vector and,
Figure RE-FDA00037386679100000210
representing the hidden state of the forward GRU network at time t-1,
Figure RE-FDA00037386679100000211
representing the hidden layer state of the reverse GRU network at the time of t-1;
step 4-3: the conditional random field CRF layer performs optimal sequence prediction by using the adjacent label relation of each word in the text, and the calculation process is as follows:
firstly, a prediction score s of a prediction sequence Y for an input sequence X is calculated according to the following formula:
Figure RE-FDA00037386679100000212
wherein X is (X) 1 ,x 2 ,…,x n ) Representing a sequence of word vectors input into the CRF layer, i.e. global semantic information H max ,x i Denotes an i-th word vector input, n denotes the total number of word vectors input, Y ═ Y1, Y2, …, yn denotes a prediction sequence, yi denotes the prediction labeling result of the i-th word, s (X, Y) denotes the prediction score of the prediction sequence Y for the input sequence X, P denotes the prediction score of the prediction sequence Y for the input sequence X, and i,yi a score representing the ith word labeled as the yi label; a represents the transfer score, A yi,yi+1 A score representing the transfer of label yi to label yi + 1;
and then calculating the probability p (Y | X) of generating the prediction sequence Y according to the following formula:
Figure RE-FDA0003738667910000031
wherein the content of the first and second substances,
Figure RE-FDA0003738667910000032
representing the actual annotation sequence, Y X Represents the set of all possible annotation sequences,
Figure RE-FDA0003738667910000033
representing true annotation sequences
Figure RE-FDA0003738667910000034
A predicted score for input sequence X;
taking the logarithm of both sides of the equation of equation (5) yields the likelihood function ln (p (Y | X)) of the predicted sequence Y:
Figure RE-FDA0003738667910000035
finally, the output sequence Y with the highest prediction score is calculated according to the following formula (7) *
Figure RE-FDA0003738667910000036
And 5: training a named entity recognition model: randomly dividing 9 data sets of training samples in the general knowledge corpus in the designated field obtained in the step 1, inputting the training samples into the BERT-BiGRUs-CRF model constructed in the step 4, and training the model by adopting a 9-fold cross validation mode to obtain a trained BERT-BiGRUs-CRF model;
step 6: common sense text knowledge named entity recognition: processing the texts in the common knowledge corpus in the step 1 by using the trained BERT-BiGRUs-CRF model to obtain a labeling sequence in each text;
and 7: the mutual-direction matching between the general knowledge corpus maintenance and the encyclopedic knowledge base entity: inputting an entity description text into a BERT layer of a deep bidirectional pre-training language model to obtain entity vector representation, splicing the entity description text with the text labeling sequence obtained in the step 6 again, and finally outputting a named entity recognition result through a layer of convolutional neural network and an activation function, wherein the output result is a one-dimensional 01 vector, 0 represents no recognition, and 1 represents successful recognition;
and 8: and (3) entity link model training: in the training sample during model training, selecting part of correctly-linked entities as positive examples, and using the rest of incorrectly-connected candidate entities as negative examples;
the entity link model is to connect the segment vector corresponding to the '1' output in the step 7 with the entity description text of the matched candidate entity in the encyclopedic knowledge base and input the linked vector into a BERT layer of the deep bidirectional pre-training language model to obtain vector representation; then, through a full connection layer, activating through an activation function to obtain the probability score of the candidate entity, and selecting the candidate entity with the highest probability to establish a link; finally, outputting a sequence for establishing the link, wherein if the correct link value is 1, otherwise, the value is 0;
and step 9: splicing the identified comment vector in the text in the common knowledge corpus and the entity description text vector to be linked in the encyclopedic knowledge base, inputting the spliced vectors into the trained entity link model, outputting a sequence whether to finally establish a link, and screening out the spliced vector corresponding to the value of 1, namely obtaining the domain knowledge base for completing entity link.
CN202210400706.1A 2022-04-17 2022-04-17 Method for linking entities in Chinese specific field by fusing common sense knowledge Active CN114943230B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210400706.1A CN114943230B (en) 2022-04-17 2022-04-17 Method for linking entities in Chinese specific field by fusing common sense knowledge

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210400706.1A CN114943230B (en) 2022-04-17 2022-04-17 Method for linking entities in Chinese specific field by fusing common sense knowledge

Publications (2)

Publication Number Publication Date
CN114943230A true CN114943230A (en) 2022-08-26
CN114943230B CN114943230B (en) 2024-02-20

Family

ID=82908036

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210400706.1A Active CN114943230B (en) 2022-04-17 2022-04-17 Method for linking entities in Chinese specific field by fusing common sense knowledge

Country Status (1)

Country Link
CN (1) CN114943230B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115422369A (en) * 2022-08-30 2022-12-02 中国人民解放军国防科技大学 Knowledge graph completion method and device based on improved TextRank
CN115796280A (en) * 2023-01-31 2023-03-14 南京万得资讯科技有限公司 Entity identification entity linking system suitable for high efficiency and controllability in financial field
CN116010583A (en) * 2023-03-17 2023-04-25 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Cascade coupling knowledge enhancement dialogue generation method
CN116451690A (en) * 2023-03-21 2023-07-18 麦博(上海)健康科技有限公司 Medical field named entity identification method
CN117151220A (en) * 2023-10-27 2023-12-01 北京长河数智科技有限责任公司 Industry knowledge base system and method based on entity link and relation extraction
CN117743568A (en) * 2024-02-19 2024-03-22 中国电子科技集团公司第十五研究所 Content generation method and system based on fusion of resource flow and confidence

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112100356A (en) * 2020-09-17 2020-12-18 武汉纺织大学 Knowledge base question-answer entity linking method and system based on similarity
AU2020103654A4 (en) * 2019-10-28 2021-01-14 Nanjing Normal University Method for intelligent construction of place name annotated corpus based on interactive and iterative learning
CN113779992A (en) * 2021-07-19 2021-12-10 西安理工大学 Method for realizing BcBERT-SW-BilSTM-CRF model based on vocabulary enhancement and pre-training

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2020103654A4 (en) * 2019-10-28 2021-01-14 Nanjing Normal University Method for intelligent construction of place name annotated corpus based on interactive and iterative learning
WO2021082366A1 (en) * 2019-10-28 2021-05-06 南京师范大学 Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus
CN112100356A (en) * 2020-09-17 2020-12-18 武汉纺织大学 Knowledge base question-answer entity linking method and system based on similarity
CN113779992A (en) * 2021-07-19 2021-12-10 西安理工大学 Method for realizing BcBERT-SW-BilSTM-CRF model based on vocabulary enhancement and pre-training

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张晓;李业刚;王栋;史树敏;: "基于ERNIE的命名实体识别", 智能计算机与应用, no. 03, 1 March 2020 (2020-03-01) *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115422369A (en) * 2022-08-30 2022-12-02 中国人民解放军国防科技大学 Knowledge graph completion method and device based on improved TextRank
CN115422369B (en) * 2022-08-30 2023-11-03 中国人民解放军国防科技大学 Knowledge graph completion method and device based on improved TextRank
CN115796280A (en) * 2023-01-31 2023-03-14 南京万得资讯科技有限公司 Entity identification entity linking system suitable for high efficiency and controllability in financial field
CN116010583A (en) * 2023-03-17 2023-04-25 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Cascade coupling knowledge enhancement dialogue generation method
CN116010583B (en) * 2023-03-17 2023-07-18 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Cascade coupling knowledge enhancement dialogue generation method
CN116451690A (en) * 2023-03-21 2023-07-18 麦博(上海)健康科技有限公司 Medical field named entity identification method
CN117151220A (en) * 2023-10-27 2023-12-01 北京长河数智科技有限责任公司 Industry knowledge base system and method based on entity link and relation extraction
CN117151220B (en) * 2023-10-27 2024-02-02 北京长河数智科技有限责任公司 Entity link and relationship based extraction industry knowledge base system and method
CN117743568A (en) * 2024-02-19 2024-03-22 中国电子科技集团公司第十五研究所 Content generation method and system based on fusion of resource flow and confidence
CN117743568B (en) * 2024-02-19 2024-04-26 中国电子科技集团公司第十五研究所 Content generation method and system based on fusion of resource flow and confidence

Also Published As

Publication number Publication date
CN114943230B (en) 2024-02-20

Similar Documents

Publication Publication Date Title
US11501182B2 (en) Method and apparatus for generating model
CN110110054B (en) Method for acquiring question-answer pairs from unstructured text based on deep learning
CN114943230A (en) Chinese specific field entity linking method fusing common knowledge
Zhu et al. Knowledge-based question answering by tree-to-sequence learning
CN109508459B (en) Method for extracting theme and key information from news
CN111400455A (en) Relation detection method of question-answering system based on knowledge graph
CN112101014B (en) Chinese chemical industry document word segmentation method based on mixed feature fusion
CN111143574A (en) Query and visualization system construction method based on minority culture knowledge graph
CN113360667B (en) Biomedical trigger word detection and named entity identification method based on multi-task learning
CN113268576B (en) Deep learning-based department semantic information extraction method and device
CN111274829A (en) Sequence labeling method using cross-language information
CN113723103A (en) Chinese medical named entity and part-of-speech combined learning method integrating multi-source knowledge
CN113095074A (en) Word segmentation method and system for Chinese electronic medical record
CN114580639A (en) Knowledge graph construction method based on automatic extraction and alignment of government affair triples
CN115019906A (en) Multi-task sequence labeled drug entity and interaction combined extraction method
CN113312922A (en) Improved chapter-level triple information extraction method
CN112800184A (en) Short text comment emotion analysis method based on Target-Aspect-Opinion joint extraction
CN111666374A (en) Method for integrating additional knowledge information into deep language model
CN111444720A (en) Named entity recognition method for English text
CN116522165B (en) Public opinion text matching system and method based on twin structure
CN112507717A (en) Medical field entity classification method fusing entity keyword features
CN114880994B (en) Text style conversion method and device from direct white text to irony text
Fei et al. GFMRC: A machine reading comprehension model for named entity recognition
CN116306653A (en) Regularized domain knowledge-aided named entity recognition method
CN112784576B (en) Text dependency syntactic analysis method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant