CN114580639A - Knowledge graph construction method based on automatic extraction and alignment of government affair triples - Google Patents

Knowledge graph construction method based on automatic extraction and alignment of government affair triples Download PDF

Info

Publication number
CN114580639A
CN114580639A CN202210166232.9A CN202210166232A CN114580639A CN 114580639 A CN114580639 A CN 114580639A CN 202210166232 A CN202210166232 A CN 202210166232A CN 114580639 A CN114580639 A CN 114580639A
Authority
CN
China
Prior art keywords
government
knowledge
term
attribute
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210166232.9A
Other languages
Chinese (zh)
Inventor
王德军
张雪诚
孙贝尔
姬美琳
孟博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South Central Minzu University
Original Assignee
South Central University for Nationalities
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South Central University for Nationalities filed Critical South Central University for Nationalities
Priority to CN202210166232.9A priority Critical patent/CN114580639A/en
Publication of CN114580639A publication Critical patent/CN114580639A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method for automatically extracting and constructing an aligned knowledge graph based on government affair triples. Because the relational database is not enough to meet the requirements of the government affair intelligent question-answering system on efficiency, knowledge reasoning and the like, and the problems of large workload, low efficiency and the like exist in the manual construction of the knowledge map. The invention aims to efficiently and automatically construct a government affair knowledge graph supporting timely updating iteration. The adopted technical route is as follows: firstly, constructing a government affair body by a seven-step method; and then regularly crawling government knowledge, including but not limited to policy and regulation, government open information and the like, wherein the structured data is mapped to a government knowledge map under the constraint of an ontology, and the unstructured data extracts government triples by using a BERT-BILSTM-CRF model. And the extracted entities and the existing entities in the knowledge graph automatically use a BERT-DNN-Softmax model to calculate similarity, and judge whether to update the knowledge graph or not based on credibility evaluation of domain experts on a data source and data updating time.

Description

Knowledge graph construction method based on automatic extraction and alignment of government affair triples
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a knowledge graph construction method based on automatic extraction and alignment of government affair triples.
Background
The traditional relational database cannot meet the requirement of supporting intelligent government affair question answering in the aspects of efficiency, knowledge reasoning and the like. Meanwhile, the government affairs field relates to a wide range of sub-fields, such as: tax, medical, educational, industrial, commercial, food and medicine, etc., which makes manually building a government knowledge map a problem that it is too labor intensive and inefficient. The government affair knowledge has the problems of numerous sources, non-uniform updating time and the like, and the application of the domain knowledge base has higher requirement on the accuracy rate of the knowledge. The domain ontology gives a formal description of domain entity concepts and interrelation domain activities and the characteristics and rules of the domain, and the extraction accuracy can be improved based on the domain ontology to extract government affair knowledge.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a method for automatically extracting and constructing an aligned knowledge graph based on government affair triples, which is characterized by comprising the following steps
Step 1: building a government affair knowledge data set;
and 2, step: respectively constructing a government affair term data set, a government affair term relationship data set and a government affair term attribute data set, wherein the government affair term data set, the government affair term relationship data set and the government affair term attribute data set are used for defining a government affair ontology;
and step 3: obtaining a government affair triple by adopting a mapping mode for the structured data obtained in the step 1;
and 4, step 4: marking a government entity in the government knowledge data set obtained in the step 1 by adopting a BIO system for training a BERT model, and converting the government knowledge data set obtained in the step 1 into word vectors by using the trained BERT model; inputting the word vector into a BILSTM model, extracting characteristic information through training to obtain a characteristic vector, outputting the characteristic vector to a CRF layer, and calculating to obtain an optimal labeling sequence; using the rule in the step 3 to constrain entities in the labeling sequence;
and 5: constructing a rule template by adopting a method of predefining a relation;
step 6: calculating similarity between the entities in the triples extracted in the step 4 and entities existing in the existing government affair knowledge map, and importing the entities into a triplet set T according to a calculation result;
preferably, the government affairs knowledge data set in step 1 is defined as:
K=(k1,k2,...,kn)
typei={datai,t,datai,c}
wherein k isiFor the ith government knowledge, i ∈ [1, n ]]N represents the number of government affairs knowledge in the government affairs knowledge data set, typeiAttribute data set, data, representing the ith government affairs knowledgei,tTime attribute, data, representing the ith government knowledgei,tGenerated by obtaining the release time of the ith government affair knowledge, datai,cConfidence attribute, data, representing the ith government knowledgei,cAnd generating a confidence score for the source of the ith government knowledge by organizing government specialists. The government affair knowledge is divided into structured data and unstructured data, the structured data is data logically expressed and realized by a two-dimensional table structure, the data format and the length specification are strictly followed, and other data which do not meet the structured definition are all collectively called the unstructured data.
Preferably, the government term data set in step 2 is:
J=(j1,j2,...,jM)
wherein J is a government term data set, JiIs the ith government term, i e [1, M ∈]M denotes the number of government terms in the government term set;
step 2, the government affair term relationship data set is as follows:
R=(ri,j)
i ∈ [1, M ], j ∈ [1, M ], and i ≠ j
Wherein R isFor a government term relationship data set, ri,jM represents the number of the government terms in the government term set for the link between the ith government term and the jth government term, and if there is no link between the ith government term and the jth government term, the link between the ith government term and the jth government term is not present
Figure BDA0003516081780000021
Step 2, constructing a government affair term attribute data set as follows:
P=(p1,p2,...,pM)
pi=(pi,1,pi,2,,...,pi,Ni)
where P is a government term attribute data set, PiAttribute data set, p, for the ith government termi,jIs the jth attribute in the attribute data set for the ith government term;
preferably, the step 3 specifically comprises:
mapping the structured data into a ternary group set according to the constraint of the government body in the step 2 by using a D2R tool, mapping a table header of the structured data into a government term in the government body, and mapping a field name of the structured data into an attribute name of a corresponding entity;
preferably, the BERT layer in step 4 converts the text in the government affairs knowledge data set obtained in step 1 into word vectors, and the specific process is as follows:
the BERT (bidirectional Encoder retrieval from transforms) layer converts text into word vectors, and the sentences input into the BERT are recorded as:
S=(s1,s2,...,sn)
wherein s isiFor the ith word in the sentence, i ∈ [1, n ]]And the corresponding BIO tag is recorded as:
L=(l1,l2,...,ln)
wherein liFor the tag of the ith word in the sentence, i ∈ [1, n ]]. After the sentence S is input into the BERT model, the vectorization output S of the S is obtainedvAnd is recorded as:
Sv=(c1,c2,...,cn)
wherein, ciA word vector for the ith word in the sentence.
Will SvInputting the BILSTM model in the step 4;
the core of the LSTM is mainly the following structure: the system comprises a forgetting gate, an input gate, an output gate and a memory Cell, wherein the forgetting gate and the input gate have the common function of clearing invalid information and simultaneously transmitting valid information to a next moment;
for the output of the whole structure, the output of the memory Cell and the output gate are multiplied, and the structure is expressed by the following formula:
it=σ(xtWxi+ht-1Whi+ct-1Wci+bi)
zt=tanh(xtWxc+ht-1Whc+bc)
ft=σ(xtWxf+ht-1Whf+ct-1Wcf+bf)
ct=cj-1f+itzt
ot=tanh(xtWxo+ht-1Who+ct-1Wco+bo)
ht=otanh(ct)
where σ is the sigmoid function, W is the weight matrix, b is the offset vector, it、ft、otOutput results of the input gate, the forgetting gate and the output gate, z respectivelytIs content to be added, ctIs the update status at time t, htIs the output of the entire LSTM unit at time t.
The basic idea behind BILSTM is to input a vector S from the forward and backward directions, respectivelyvAnd coding, then combining the outputs at the same moment, outputting a prediction label corresponding to each word, and recording the label sequence as:
Y=(y1,y2,...,yn)
wherein, yiThe predicted tag for the ith word in the sentence.
Inputting Y into the CRF layer in step 4, and calculating the probability of the label sequence by using the following formula:
Figure BDA0003516081780000041
Z(x)=∑y'exp(score(x,y'))
score(x,y)=∑iEmit(xi,yi)+Trans(yi-1,yi)
the corresponding loss function is as follows:
-logp=(y|x)=-score(x,y)+logZ(x)
the output is the predicted tag sequence. The label of the predicted entity starts with B and continues with I, and the set of predicted entities is marked as N.
The mode constraint layer uses the rules in the ontology O to constrain the entities in the entity set N, and the entity set conforming to the constraints is marked as N'.
Let the inter-confidence attribute and the confidence attribute of the entities in N' be equal to the time and the confidence attribute of the source government affair knowledge respectively, and record as:
n′i(pt)=kj(pt),n′i(pc)=kj(pc)
wherein, n'i、kjRespectively, entity set N' and corresponding government knowledge, pt、pcTime and confidence attributes, respectively.
Preferably, the step 5 specifically comprises:
extracting a government affair triple t from the government affair knowledge K based on the template, and recording as:
ti=(nx1,rl,nx2)|ti=(nx3,py,vz)
wherein n, r, p, v represent entity, relation, attribute value respectively. And enabling the time attribute and the confidence attribute of the government affair triple to be respectively equal to the time attribute and the confidence attribute of the source government affair knowledge, and recording as follows:
ti(pt)=kj(pt),ti(pc)=kj(pc)
wherein t isi、kjRespectively, a government triple and corresponding government knowledge, pt、pcTime and confidence attributes, respectively.
Preferably, the step 6 specifically includes:
the three layers of the entity similarity calculation model are a pre-training language model BERT, a fully-connected neural network (DNN) and Softmax respectively, and an entity-pair two-classification model is constructed. Model input "entity pair" (n)1,n2) A standard input sequence is constructed by character processing:
[CLS]Q1...Qn[SEP]R1...Rm[SEP]
wherein [ CLS ] and [ SEP ] are two special sign flag bits of BERT, which are used for distinguishing different entities; [ CLS ] appears at the top of the entity pair, [ SEP ] appears at the two entity boundaries and at the end of the entity pair. The model inputs the vector of the last layer of BERT into DNN (fully-connected neural network) for dimension reduction and feature extraction, then performs secondary classification through Softmax to obtain a similarity probability distribution result, and the calculation formula is as follows:
Figure BDA0003516081780000051
where i represents a certain class in k, k ∈ (0, 2)],giA value representing the classification. The loss function of the model uses a cross-entropy loss function, which is as follows:
Figure BDA0003516081780000052
wherein y isiLabel for sample i, with a positive class of 1, a negative class of 0, piRepresenting the probability that sample i is predicted to be positive. Finally, setting a threshold value in a prediction stage to calculate to obtain a similar entity pair. The time attributes (t) of the entities extracted in step 4 and the corresponding entities in the knowledge graphi(pt) And a confidence attribute (t)j(pc) ) were compared. And updating the entity with higher time update and confidence coefficient to the government affair knowledge map.
According to the method, when the government affair triples are extracted, the credibility and the updating time are marked to ensure the accuracy and timeliness of the knowledge graph, and meanwhile, the method can be used for efficiently and automatically constructing the government affair knowledge graph supporting quick updating iteration.
Drawings
FIG. 1: is a flow chart of the present invention;
FIG. 2: the mode layer is a flow chart constructed by a government affair ontology, and the data layer is a construction process from government affair data to a government affair knowledge map;
FIG. 3: creating a government affair ontology and an example local map based on a seven-step method;
FIG. 4 is a schematic view of: mapping structured data into an example of a triplet;
FIG. 5 is a schematic view of: identifying a pre-training data set example for the annotated government affair named entity;
FIG. 6: extracting a BERT-BilSTM-CRF model diagram used by the entity;
FIG. 7: is a structure diagram of an LSTM unit;
FIG. 8: comparing the results of directly extracting the government affair entity and extracting the government affair entity based on the body;
FIG. 9: BERT-DNN-Softmax model plots used for entity similarity calculations.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention without any inventive step, are within the scope of protection of the invention.
The practice of the present invention is further illustrated with reference to FIGS. 1-9.
As shown in fig. 1, an embodiment of the present invention is a method for automatically extracting and constructing an aligned knowledge graph based on government affair triples, and the method includes the following specific steps:
step 1: constructing a government affair knowledge data set;
crawling government knowledge from government websites including, but not limited to, national, provincial and municipal government service networks, department official websites at all levels, and the like; the government knowledge includes, but is not limited to, structured data such as information disclosure guide, government service affairs handling guide, and text data such as policy document, legal document, etc.
Step 1, the government affairs knowledge data set is defined as:
K=(k1,k2,...,kn)
typei={datai,t,datai,c}
wherein k isiFor the ith government knowledge, i e [1, n ]]N represents the number of government knowledge in the government knowledge data set, typeiAttribute data set, data, representing the i-th government knowledgei,tTime attribute, data, representing the ith government knowledgei,tGenerated by obtaining the release time of the ith government affair knowledge, datai,cConfidence attribute, data, representing the ith government knowledgei,cAnd generating a confidence score for the source of the ith government knowledge by organizing government specialists. The government affair knowledge is divided into structured data and unstructured data, the structured data is data logically expressed and realized by a two-dimensional table structure, the data format and the length specification are strictly followed, and other data which do not meet the structured definition are all collectively called the unstructured data.
And 2, step: and respectively constructing a government affair term data set, a government affair term relation data set and a government affair term attribute data set for defining a government affair ontology.
Step 2, the government affair term data set is as follows:
J=(j1,j2,...,jM)
wherein J is the number of government affairs termsData set, jiIs the ith government term, i e [1, M ∈]M denotes the number of government terms in the government term set;
step 2, the government term relationship data set is as follows:
R=(ri,j)
i ∈ [1, M ], j ∈ [1, M ], and i ≠ j
Wherein R is a government term relationship data set, Ri,jM represents the number of the government terms in the government term set for the link between the ith government term and the jth government term, and if there is no link between the ith government term and the jth government term, the link between the ith government term and the jth government term is not present
Figure BDA0003516081780000071
Step 2, the construction of the government affair term attribute data set is as follows:
P=(p1,p2,...,pM)
pi=(pi,1,pi,2,,...,pi,Ni)
where P is a government term attribute data set, PiAttribute data set, p, for the ith government termi,jIs the jth attribute in the attribute data set for the ith government term;
and step 3: and (4) obtaining a government affair triple by adopting a mapping mode for the structured data obtained in the step (1).
And mapping the structured data into a three-tuple set according to the constraint of the government body in the step 2 by using a D2R tool, mapping the table header of the structured data into government terms in the government body, and mapping the field names of the structured data into attribute names of corresponding entities. FIG. 4 is an example of mapping structured data to triples.
And 4, step 4: using BERT-BiLSTM-CRF model to extract entities, fig. 6 is a structure diagram thereof, and the specific flow is as follows: labeling a government entity in the government knowledge data set obtained in the step 1 by adopting a BIO system for training a BERT model, wherein FIG. 5 is a labeling example (wherein B is the beginning of a noun phrase, I is the middle of the noun phrase, and O is not the noun phrase), and converting the government knowledge data set obtained in the step 1 into word vectors by using the trained BERT model; inputting the word vector into a BILSTM model, extracting characteristic information through training to obtain a characteristic vector, outputting the characteristic vector to a CRF layer, and calculating to obtain an optimal labeling sequence; and 3, using the rule in the step 3 to constrain the entity in the annotation sequence.
And 3, converting the text in the government affair knowledge data set obtained in the step 1 into word vectors by the BERT layer, wherein the specific process is as follows:
the BERT (bidirectional Encoder replies from transformations) layer converts text into word vectors, and records the sentences input into the BERT as:
S=(s1,s2,...,sn)
wherein s isiFor the ith word in the sentence, i ∈ [1, n ]]The corresponding BIO label (label) is noted as:
L=(l1,l2,...,ln)
wherein liFor the tag of the ith word in the sentence, i ∈ [1, n ]]. After the sentence S is input into the BERT model, the vectorization output S of the S is obtainedvAnd is recorded as:
Sv=(c1,c2,...,cn)
wherein, ciA word vector for the ith word in the sentence.
Will SvInputting into the BILSTM model in step 3, FIG. 7 shows an LSTM unit structure, and the core of the LSTM mainly has the following structure: the system comprises a forgetting gate, an input gate, an output gate and a memory Cell, wherein the forgetting gate and the input gate have the common function of clearing invalid information and simultaneously transmitting valid information to the next moment. For the output of the whole structure, the output of the memory Cell and the output gate are multiplied, and the structure is expressed by the following formula:
it=σ(xtWxi+ht-1Whi+ct-1Wci+bi)
zt=tanh(xtWxc+ht-1Whc+bc)
ft=σ(xtWxf+ht-1Whf+ct-1Wcf+bf)
ct=cj-1f+itzt
ot=tanh(xtWxo+ht-1Who+ct-1Wco+bo)
ht=otanh(ct)
where σ is the sigmoid function, W is the weight matrix, b is the offset vector, it、ft、otOutput results of the input gate, the forgetting gate and the output gate, ztIs content to be added, ctIs the update status at time t, htIs the output of the entire LSTM unit at time t.
The basic idea behind BILSTM is to input a vector S from the forward and backward directions, respectivelyvAnd coding, then combining the outputs at the same moment, outputting a prediction label corresponding to each word, and recording the label sequence as:
Y=(y1,y2,...,yn)
wherein, yiThe predicted tag for the ith word in the sentence.
Inputting Y into the CRF layer in the step 3, and calculating the probability of the label sequence by using the following formula:
Figure BDA0003516081780000081
Z(x)=∑y'exp(score(x,y'))
score(x,y)=∑iEmit(xi,yi)+Trans(yi-1,yi)
the corresponding loss function is as follows:
-log p=(y|x)=-score(x,y)+log Z(x)
the output is the predicted tag sequence. The label of the predicted entity starts with B and continues with I, and the set of predicted entities is marked as N.
(4) The Schema Constraint (SC) layer constrains the entities in the entity set N using rules in the ontology O, an example is shown in fig. 8, and the entity set conforming to the constraints is denoted as N'.
Let the inter-confidence attribute and the confidence attribute of the entities in N' be equal to the time and the confidence attribute of the source government affair knowledge respectively, and record as:
n′i(pt)=kj(pt),n′i(pc)=kj(pc)
wherein, n'i、kjRespectively, entity set N' and corresponding government knowledge, pt、pcTime and confidence attributes, respectively.
And 5: and the extraction of the domain relation refers to the extraction of the semantic relation between the entities in one or more domains. The extracted patterns are divided into predefined relationships and resulting relationships extracted directly from the text. Step 2 already defines the relationship set R (attribute set P) among the terms in the ontology O, and step 4 links the entities and the terms when performing entity extraction, so the rule template is constructed by using a method of defining the relationship (attribute) in advance.
Extracting a government affair triple t from the government affair knowledge K based on the template, and recording as:
ti=(nx1,rl,nx2)|ti=(nx3,py,vz)
wherein n, r, p, v represent entity, relation, attribute value respectively. And making the time attribute and the confidence attribute of the government affair triple equal to the time attribute and the confidence attribute of the source government affair knowledge respectively, and recording as follows:
ti(pt)=kj(pt),ti(pc)=kj(pc)
wherein t isi、kjRespectively, a government triple and corresponding government knowledge, pt、pcTime and confidence attributes, respectively.
Step 6: and (4) calculating the similarity between the entities in the triples extracted in the step (4) and the entities existing in the existing government affair knowledge map, and importing the entities into a triplet set T according to the calculation result.
The entity similarity calculation model is shown in fig. 9, three layers are respectively a pre-training language model BERT, a fully-connected neural network (DNN) and Softmax, and an entity-pair two-class model is constructed. Model input "entity pair" (n)1,n2) A standard input sequence is constructed by character processing:
[CLS]Q1...Qn[SEP]R1...Rm[SEP]
wherein [ CLS ] and [ SEP ] are two special sign flag bits of BERT, which are used for distinguishing different entities; [ CLS ] appears at the top of the entity pair, [ SEP ] appears at the two entity boundaries and at the end of the entity pair. The model inputs the vector of the last layer of BERT into DNN (fully-connected neural network) for dimension reduction and feature extraction, then performs secondary classification through Softmax to obtain a similarity probability distribution result, and the calculation formula is as follows:
Figure BDA0003516081780000101
where i represents a certain class in k, k ∈ (0, 2)],giA value representing the classification. The loss function of the model uses a cross-entropy loss function, which is as follows:
Figure BDA0003516081780000102
wherein y isiLabel for sample i, with a positive class of 1, a negative class of 0, piRepresenting the probability that sample i is predicted to be positive. And finally, setting a threshold value in a prediction stage to calculate to obtain a similar entity pair. The time attributes (t) of the entities extracted in the step 4 and the corresponding entities in the knowledge graph are comparedi(pt) And a confidence attribute (t)j(pc) ) were compared. And updating the entity with higher time update and confidence coefficient to the government affair knowledge map.
Finally, it should be noted that the scope of the present invention is not limited to the above-mentioned embodiments, and it is apparent that those skilled in the art can make various modifications and variations to the present invention without departing from the scope and spirit of the invention. It is intended that the present invention cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents.

Claims (7)

1. A method for automatically extracting and constructing an aligned knowledge graph based on government affair triples is characterized by comprising the following steps:
step 1: constructing a government affair knowledge data set;
step 2: respectively constructing a government affair term data set, a government affair term relationship data set and a government affair term attribute data set, wherein the government affair term data set, the government affair term relationship data set and the government affair term attribute data set are used for defining a government affair ontology;
and step 3: obtaining a government affair triple by adopting a mapping mode for the structured data obtained in the step 1;
and 4, step 4: marking a government entity in the government knowledge data set obtained in the step 1 by adopting a BIO system for training a BERT model, and converting the government knowledge data set obtained in the step 1 into word vectors by using the trained BERT model; inputting the word vector into a BILSTM model, extracting characteristic information through training to obtain a characteristic vector, outputting the characteristic vector to a CRF layer, and calculating to obtain an optimal labeling sequence; using the rule in the step 3 to constrain entities in the labeling sequence;
and 5: constructing a rule template by adopting a method of predefining a relation;
step 6: and (5) calculating the similarity between the entities in the triples extracted in the step (4) and the entities existing in the existing government affair knowledge map, and importing the entities into a triplet set T according to the calculation result.
2. The method for automatically extracting an aligned knowledge graph construct based on government triples according to claim 1, comprising the steps of: step 1, the government affairs knowledge data set is defined as:
K=(k1,k2,...,kn)
typei={datai,t,datai,c}
wherein k isiFor the ith government knowledge, i ∈ [1, n ]]N represents the number of government affairs knowledge in the government affairs knowledge data set, typeiAttribute data set, data, representing the ith government affairs knowledgei,tTime attribute, data, representing the ith government knowledgei,tGenerated by obtaining the release time of the ith government affair knowledge, datai,cConfidence attribute, data, representing the ith government knowledgei,cGenerating a confidence score for the source of the ith government knowledge by organizing government specialists; the government affair knowledge is divided into structured data and unstructured data, the structured data is data logically expressed and realized by a two-dimensional table structure, the data format and the length specification are strictly followed, and other data which do not meet the structured definition are all collectively called the unstructured data.
3. The method for automatically extracting an aligned knowledge graph construct based on government triples according to claim 1, comprising the steps of: step 2 the government term data set is:
J=(j1,j2,...,jM)
wherein J is a government term data set, JiIs the ith government term, i e [1, M ∈]M denotes the number of government terms in the government term set;
step 2, the government term relationship data set is as follows:
R=(ri,j)
i ∈ [1, M ], j ∈ [1, M ], and i ≠ j
Wherein R is a government term relationship data set, Ri,jM represents the number of the government terms in the government term set for the link between the ith government term and the jth government term, and if there is no link between the ith government term and the jth government term, the link between the ith government term and the jth government term is not present
Figure FDA0003516081770000021
Step 2, constructing a government affair term attribute data set as follows:
P=(p1,p2,...,pM)
Figure FDA0003516081770000022
where P is a government term attribute data set, PiAttribute data set, p, for the ith government termi,jIs the jth attribute in the attribute data set for the ith government term.
4. The method for automatically extracting an aligned knowledge-graph construct based on government triplets according to claim 1, comprising the steps of: the step 3 specifically comprises the following steps:
and mapping the structured data into a three-tuple set according to the constraint of the government body in the step 2 by using a D2R tool, mapping the table header of the structured data into government terms in the government body, and mapping the field names of the structured data into attribute names of corresponding entities.
5. The method for automatically extracting an aligned knowledge graph construct based on government triples according to claim 1, comprising the steps of: step 4, the BERT layer converts the text in the government affair knowledge data set obtained in the step 1 into word vectors, and the specific process is as follows:
the BERT layer converts the text into word vectors, and the sentences input into the BERT are recorded as:
S=(s1,s2,...,sn)
wherein s isiFor the ith word in the sentence, i ∈ [1, n ]]The corresponding BIO tag is noted as:
L=(l1,l2,...,ln)
wherein liFor the tag of the ith word in the sentence, i ∈ [1, n ]](ii) a After the sentence S is input into the BERT model, the vectorization output S of the S is obtainedvAnd is recorded as:
Sv=(c1,c2,...,cn)
wherein, ciA word vector for the ith word in the sentence;
will SvInputting the BILSTM model in the step 4;
the core of LSTM is mainly the following structure: the system comprises a forgetting gate, an input gate, an output gate and a memory Cell, wherein the forgetting gate and the input gate have the common function of clearing invalid information and simultaneously transmitting valid information to the next moment;
for the output of the whole structure, the output of the memory Cell and the output gate are multiplied, and the structure is expressed by the following formula:
it=σ(xtWxi+ht-1Whi+ct-1Wci+bi)
zt=tanh(xtWxc+ht-1Whc+bc)
ft=σ(xtWxf+ht-1Whf+ct-1Wcf+bf)
ct=cj-1f+itzt
ot=tanh(xtWxo+ht-1Who+ct-1Wco+bo)
ht=otanh(ct)
where σ is the sigmoid function, W is the weight matrix, b is the offset vector, it、ft、otOutput results of the input gate, the forgetting gate and the output gate, ztIs content to be added, ctIs the update status at time t, htIs the output of the whole LSTM unit at time t;
the basic idea behind BILSTM is to input a vector S from the forward and backward directions, respectivelyvAnd coding, then combining the outputs at the same moment, outputting a prediction label corresponding to each word, and recording the label sequence as:
Y=(y1,y2,...,yn)
wherein, yiA predicted tag for the ith word in the sentence;
inputting Y into the CRF layer in the step 4, and calculating the probability of the label sequence by using the following formula:
Figure FDA0003516081770000031
Z(x)=∑y'exp(score(x,y'))
score(x,y)=∑iEmit(xi,yi)+Trans(yi-1,yi)
the corresponding loss function is as follows:
-logp=(y|x)=-score(x,y)+logZ(x)
the output is a predicted tag sequence; the label of the predicted entity starts with B and continues with I, and the predicted entity set is marked as N;
the mode constraint layer uses the rules in the body O to constrain the entity in the entity set N, and the entity set conforming to the constraint is marked as N';
let the inter-confidence attribute and the confidence attribute of the entities in N' be equal to the time and the confidence attribute of the source government affair knowledge respectively, and record as:
n′i(pt)=kj(pt),n′i(pc)=kj(pc)
wherein, n'i、kjRespectively, entity set N' and corresponding government knowledge, pt、pcTime and confidence attributes, respectively.
6. The method for automatically extracting an aligned knowledge graph construct based on government triples according to claim 1, comprising the steps of: the step 5 specifically comprises the following steps:
extracting a government affair triple t from the government affair knowledge K based on the template, and recording as:
ti=(nx1,rl,nx2)|ti=(nx3,py,vz)
wherein n, r, p and v represent entity, relation, attribute and attribute value respectively; and enabling the time attribute and the confidence attribute of the government affair triple to be respectively equal to the time attribute and the confidence attribute of the source government affair knowledge, and recording as follows:
ti(pt)=kj(pt),ti(pc)=kj(pc)
wherein t isi、kjRespectively, a government triple and corresponding government knowledge, pt、pcTime and confidence attributes, respectively.
7. The method for automatically extracting an aligned knowledge graph construct based on government triples according to claim 1, comprising the steps of: the step 6 specifically comprises the following steps:
the three layers of the entity similarity calculation model are respectively a pre-training language model BERT, a fully-connected neural network (DNN) and Softmax, and an entity pair two classification model is constructed; model input "entity pair" (n)1,n2) A standard input sequence is constructed by character processing:
[CLS]Q1...Qn[SEP]R1...Rm[SEP]
wherein [ CLS ] and [ SEP ] are two special sign flag bits of BERT, which are used for distinguishing different entities; [ CLS ] appears at the top of the entity pair, [ SEP ] appears at the two entity boundaries and at the end of the entity pair; the model inputs the vector of the last layer of BERT into DNN (fully-connected neural network) for dimension reduction and feature extraction, then performs secondary classification through Softmax to obtain a similarity probability distribution result, and the calculation formula is as follows:
Figure FDA0003516081770000041
where i represents a certain class in k, k ∈ (0, 2)],giA value representing the classification; the loss function of the model uses a cross-entropy loss function, which is as follows:
Figure FDA0003516081770000051
wherein y isiLabel, positive class, for sample i1, negative class is 0, piRepresents the probability that sample i is predicted to be positive; finally, setting a threshold value in a prediction stage to calculate to obtain a similar entity pair; the time attributes (t) of the entities extracted in the step 4 and the corresponding entities in the knowledge graph are comparedi(pt) And a confidence attribute (t)j(pc) ) a comparison is made; and updating the entity with higher time update and confidence coefficient to the government affair knowledge map.
CN202210166232.9A 2022-02-23 2022-02-23 Knowledge graph construction method based on automatic extraction and alignment of government affair triples Pending CN114580639A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210166232.9A CN114580639A (en) 2022-02-23 2022-02-23 Knowledge graph construction method based on automatic extraction and alignment of government affair triples

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210166232.9A CN114580639A (en) 2022-02-23 2022-02-23 Knowledge graph construction method based on automatic extraction and alignment of government affair triples

Publications (1)

Publication Number Publication Date
CN114580639A true CN114580639A (en) 2022-06-03

Family

ID=81771022

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210166232.9A Pending CN114580639A (en) 2022-02-23 2022-02-23 Knowledge graph construction method based on automatic extraction and alignment of government affair triples

Country Status (1)

Country Link
CN (1) CN114580639A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115271686A (en) * 2022-09-28 2022-11-01 北京长河数智科技有限责任公司 Intelligent government affair data auditing method and device
CN115757837A (en) * 2023-01-04 2023-03-07 军工保密资格审查认证中心 Confidence evaluation method and device of knowledge graph, electronic equipment and medium
CN116501895A (en) * 2023-06-14 2023-07-28 四创科技有限公司 Typhoon time sequence knowledge graph construction method and terminal
CN116562265A (en) * 2023-07-04 2023-08-08 南京航空航天大学 Information intelligent analysis method, system and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115271686A (en) * 2022-09-28 2022-11-01 北京长河数智科技有限责任公司 Intelligent government affair data auditing method and device
CN115757837A (en) * 2023-01-04 2023-03-07 军工保密资格审查认证中心 Confidence evaluation method and device of knowledge graph, electronic equipment and medium
CN116501895A (en) * 2023-06-14 2023-07-28 四创科技有限公司 Typhoon time sequence knowledge graph construction method and terminal
CN116501895B (en) * 2023-06-14 2023-09-01 四创科技有限公司 Typhoon time sequence knowledge graph construction method and terminal
CN116562265A (en) * 2023-07-04 2023-08-08 南京航空航天大学 Information intelligent analysis method, system and storage medium
CN116562265B (en) * 2023-07-04 2023-12-01 南京航空航天大学 Information intelligent analysis method, system and storage medium

Similar Documents

Publication Publication Date Title
CN111950285B (en) Medical knowledge graph intelligent automatic construction system and method with multi-mode data fusion
CN114580639A (en) Knowledge graph construction method based on automatic extraction and alignment of government affair triples
CN108874878A (en) A kind of building system and method for knowledge mapping
CN108182295A (en) A kind of Company Knowledge collection of illustrative plates attribute extraction method and system
CN111914556B (en) Emotion guiding method and system based on emotion semantic transfer pattern
CN113535917A (en) Intelligent question-answering method and system based on travel knowledge map
CN114020768A (en) Construction method and application of SQL (structured query language) statement generation model of Chinese natural language
CN114943230B (en) Method for linking entities in Chinese specific field by fusing common sense knowledge
CN115599899B (en) Intelligent question-answering method, system, equipment and medium based on aircraft knowledge graph
WO2023040493A1 (en) Event detection
CN115470871B (en) Policy matching method and system based on named entity recognition and relation extraction model
CN111143574A (en) Query and visualization system construction method based on minority culture knowledge graph
CN111324691A (en) Intelligent question-answering method for minority nationality field based on knowledge graph
CN112559734A (en) Presentation generation method and device, electronic equipment and computer readable storage medium
CN113822026A (en) Multi-label entity labeling method
CN117171333A (en) Electric power file question-answering type intelligent retrieval method and system
CN114077673A (en) Knowledge graph construction method based on BTBC model
CN111209362A (en) Address data analysis method based on deep learning
CN114897167A (en) Method and device for constructing knowledge graph in biological field
CN114153994A (en) Medical insurance information question-answering method and device
CN112883172B (en) Biomedical question-answering method based on dual knowledge selection
CN117194682B (en) Method, device and medium for constructing knowledge graph based on power grid related file
CN113158659B (en) Case-related property calculation method based on judicial text
CN113869054A (en) Deep learning-based electric power field project feature identification method
CN116522165B (en) Public opinion text matching system and method based on twin structure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination