CN114580639A

CN114580639A - Knowledge graph construction method based on automatic extraction and alignment of government affair triples

Info

Publication number: CN114580639A
Application number: CN202210166232.9A
Authority: CN
Inventors: 王德军; 张雪诚; 孙贝尔; 姬美琳; 孟博
Original assignee: South Central University for Nationalities
Current assignee: South Central Minzu University
Priority date: 2022-02-23
Filing date: 2022-02-23
Publication date: 2022-06-03

Abstract

The invention provides a method for automatically extracting and constructing an aligned knowledge graph based on government affair triples. Because the relational database is not enough to meet the requirements of the government affair intelligent question-answering system on efficiency, knowledge reasoning and the like, and the problems of large workload, low efficiency and the like exist in the manual construction of the knowledge map. The invention aims to efficiently and automatically construct a government affair knowledge graph supporting timely updating iteration. The adopted technical route is as follows: firstly, constructing a government affair body by a seven-step method; and then regularly crawling government knowledge, including but not limited to policy and regulation, government open information and the like, wherein the structured data is mapped to a government knowledge map under the constraint of an ontology, and the unstructured data extracts government triples by using a BERT-BILSTM-CRF model. And the extracted entities and the existing entities in the knowledge graph automatically use a BERT-DNN-Softmax model to calculate similarity, and judge whether to update the knowledge graph or not based on credibility evaluation of domain experts on a data source and data updating time.

Description

Knowledge graph construction method based on automatic extraction and alignment of government affair triples

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a knowledge graph construction method based on automatic extraction and alignment of government affair triples.

Background

The traditional relational database cannot meet the requirement of supporting intelligent government affair question answering in the aspects of efficiency, knowledge reasoning and the like. Meanwhile, the government affairs field relates to a wide range of sub-fields, such as: tax, medical, educational, industrial, commercial, food and medicine, etc., which makes manually building a government knowledge map a problem that it is too labor intensive and inefficient. The government affair knowledge has the problems of numerous sources, non-uniform updating time and the like, and the application of the domain knowledge base has higher requirement on the accuracy rate of the knowledge. The domain ontology gives a formal description of domain entity concepts and interrelation domain activities and the characteristics and rules of the domain, and the extraction accuracy can be improved based on the domain ontology to extract government affair knowledge.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a method for automatically extracting and constructing an aligned knowledge graph based on government affair triples, which is characterized by comprising the following steps

Step 1: building a government affair knowledge data set;

and 2, step: respectively constructing a government affair term data set, a government affair term relationship data set and a government affair term attribute data set, wherein the government affair term data set, the government affair term relationship data set and the government affair term attribute data set are used for defining a government affair ontology;

and step 3: obtaining a government affair triple by adopting a mapping mode for the structured data obtained in the step 1;

and 4, step 4: marking a government entity in the government knowledge data set obtained in the step 1 by adopting a BIO system for training a BERT model, and converting the government knowledge data set obtained in the step 1 into word vectors by using the trained BERT model; inputting the word vector into a BILSTM model, extracting characteristic information through training to obtain a characteristic vector, outputting the characteristic vector to a CRF layer, and calculating to obtain an optimal labeling sequence; using the rule in the step 3 to constrain entities in the labeling sequence;

and 5: constructing a rule template by adopting a method of predefining a relation;

step 6: calculating similarity between the entities in the triples extracted in the step 4 and entities existing in the existing government affair knowledge map, and importing the entities into a triplet set T according to a calculation result;

preferably, the government affairs knowledge data set in step 1 is defined as:

K＝(k₁,k₂,...,k_n)

type_i＝{data_i,t，data_i,c}

wherein k is_iFor the ith government knowledge, i ∈ [1, n ]]N represents the number of government affairs knowledge in the government affairs knowledge data set, type_iAttribute data set, data, representing the ith government affairs knowledge_i,tTime attribute, data, representing the ith government knowledge_i,tGenerated by obtaining the release time of the ith government affair knowledge, data_i,cConfidence attribute, data, representing the ith government knowledge_i,cAnd generating a confidence score for the source of the ith government knowledge by organizing government specialists. The government affair knowledge is divided into structured data and unstructured data, the structured data is data logically expressed and realized by a two-dimensional table structure, the data format and the length specification are strictly followed, and other data which do not meet the structured definition are all collectively called the unstructured data.

Preferably, the government term data set in step 2 is:

J＝(j₁,j₂,...,j_M)

wherein J is a government term data set, J_iIs the ith government term, i e [1, M ∈]M denotes the number of government terms in the government term set;

step 2, the government affair term relationship data set is as follows:

R＝(r_i,j)

i ∈ [1, M ], j ∈ [1, M ], and i ≠ j

Wherein R isFor a government term relationship data set, r_i,jM represents the number of the government terms in the government term set for the link between the ith government term and the jth government term, and if there is no link between the ith government term and the jth government term, the link between the ith government term and the jth government term is not present

Step 2, constructing a government affair term attribute data set as follows:

P＝(p₁,p₂,...,p_M)

p_i＝(p_i,1,p_i,2,,...,p_i,Ni)

where P is a government term attribute data set, P_iAttribute data set, p, for the ith government term_i,jIs the jth attribute in the attribute data set for the ith government term;

preferably, the step 3 specifically comprises:

mapping the structured data into a ternary group set according to the constraint of the government body in the step 2 by using a D2R tool, mapping a table header of the structured data into a government term in the government body, and mapping a field name of the structured data into an attribute name of a corresponding entity;

preferably, the BERT layer in step 4 converts the text in the government affairs knowledge data set obtained in step 1 into word vectors, and the specific process is as follows:

the BERT (bidirectional Encoder retrieval from transforms) layer converts text into word vectors, and the sentences input into the BERT are recorded as:

S＝(s₁,s₂,...,s_n)

wherein s is_iFor the ith word in the sentence, i ∈ [1, n ]]And the corresponding BIO tag is recorded as:

L＝(l₁,l₂,...,l_n)

wherein l_iFor the tag of the ith word in the sentence, i ∈ [1, n ]]. After the sentence S is input into the BERT model, the vectorization output S of the S is obtained_vAnd is recorded as:

S_v＝(c₁,c₂,...,c_n)

wherein, c_iA word vector for the ith word in the sentence.

Will S_vInputting the BILSTM model in the step 4;

the core of the LSTM is mainly the following structure: the system comprises a forgetting gate, an input gate, an output gate and a memory Cell, wherein the forgetting gate and the input gate have the common function of clearing invalid information and simultaneously transmitting valid information to a next moment;

for the output of the whole structure, the output of the memory Cell and the output gate are multiplied, and the structure is expressed by the following formula:

i_t＝σ(x_tW_xi+h_t-1W_hi+c_t-1W_ci+b_i)

z_t＝tanh(x_tW_xc+h_t-1W_hc+b_c)

f_t＝σ(x_tW_xf+h_t-1W_hf+c_t-1W_cf+b_f)

c_t＝c_j-1f+i_tz_t

o_t＝tanh(x_tW_xo+h_t-1W_ho+c_t-1W_co+b_o)

h_t＝otanh(c_t)

where σ is the sigmoid function, W is the weight matrix, b is the offset vector, i_t、f_t、o_tOutput results of the input gate, the forgetting gate and the output gate, z respectively_tIs content to be added, c_tIs the update status at time t, h_tIs the output of the entire LSTM unit at time t.

The basic idea behind BILSTM is to input a vector S from the forward and backward directions, respectively_vAnd coding, then combining the outputs at the same moment, outputting a prediction label corresponding to each word, and recording the label sequence as:

Y＝(y₁,y₂,...,y_n)

wherein, y_iThe predicted tag for the ith word in the sentence.

Inputting Y into the CRF layer in step 4, and calculating the probability of the label sequence by using the following formula:

Z(x)＝∑_y'exp(score(x,y'))

score(x,y)＝∑_iEmit(x_i,y_i)+Trans(y_i-1,y_i)

the corresponding loss function is as follows:

-logp＝(y|x)＝-score(x,y)+logZ(x)

the output is the predicted tag sequence. The label of the predicted entity starts with B and continues with I, and the set of predicted entities is marked as N.

The mode constraint layer uses the rules in the ontology O to constrain the entities in the entity set N, and the entity set conforming to the constraints is marked as N'.

Let the inter-confidence attribute and the confidence attribute of the entities in N' be equal to the time and the confidence attribute of the source government affair knowledge respectively, and record as:

n′_i(p_t)＝k_j(p_t),n′_i(p_c)＝k_j(p_c)

wherein, n'_i、k_jRespectively, entity set N' and corresponding government knowledge, p_t、p_cTime and confidence attributes, respectively.

Preferably, the step 5 specifically comprises:

extracting a government affair triple t from the government affair knowledge K based on the template, and recording as:

t_i＝(n_x1,r_l,n_x2)|t_i＝(n_x3,p_y,v_z)

wherein n, r, p, v represent entity, relation, attribute value respectively. And enabling the time attribute and the confidence attribute of the government affair triple to be respectively equal to the time attribute and the confidence attribute of the source government affair knowledge, and recording as follows:

t_i(p_t)＝k_j(p_t),t_i(p_c)＝k_j(p_c)

wherein t is_i、k_jRespectively, a government triple and corresponding government knowledge, p_t、p_cTime and confidence attributes, respectively.

Preferably, the step 6 specifically includes:

the three layers of the entity similarity calculation model are a pre-training language model BERT, a fully-connected neural network (DNN) and Softmax respectively, and an entity-pair two-classification model is constructed. Model input "entity pair" (n)₁,n₂) A standard input sequence is constructed by character processing:

[CLS]Q1...Qn[SEP]R1...Rm[SEP]

wherein [ CLS ] and [ SEP ] are two special sign flag bits of BERT, which are used for distinguishing different entities; [ CLS ] appears at the top of the entity pair, [ SEP ] appears at the two entity boundaries and at the end of the entity pair. The model inputs the vector of the last layer of BERT into DNN (fully-connected neural network) for dimension reduction and feature extraction, then performs secondary classification through Softmax to obtain a similarity probability distribution result, and the calculation formula is as follows:

where i represents a certain class in k, k ∈ (0, 2)],g_iA value representing the classification. The loss function of the model uses a cross-entropy loss function, which is as follows:

wherein y is_iLabel for sample i, with a positive class of 1, a negative class of 0, p_iRepresenting the probability that sample i is predicted to be positive. Finally, setting a threshold value in a prediction stage to calculate to obtain a similar entity pair. The time attributes (t) of the entities extracted in step 4 and the corresponding entities in the knowledge graph_i(p_t) And a confidence attribute (t)_j(p_c) ) were compared. And updating the entity with higher time update and confidence coefficient to the government affair knowledge map.

According to the method, when the government affair triples are extracted, the credibility and the updating time are marked to ensure the accuracy and timeliness of the knowledge graph, and meanwhile, the method can be used for efficiently and automatically constructing the government affair knowledge graph supporting quick updating iteration.

Drawings

FIG. 1: is a flow chart of the present invention;

FIG. 2: the mode layer is a flow chart constructed by a government affair ontology, and the data layer is a construction process from government affair data to a government affair knowledge map;

FIG. 3: creating a government affair ontology and an example local map based on a seven-step method;

FIG. 4 is a schematic view of: mapping structured data into an example of a triplet;

FIG. 5 is a schematic view of: identifying a pre-training data set example for the annotated government affair named entity;

FIG. 6: extracting a BERT-BilSTM-CRF model diagram used by the entity;

FIG. 7: is a structure diagram of an LSTM unit;

FIG. 8: comparing the results of directly extracting the government affair entity and extracting the government affair entity based on the body;

FIG. 9: BERT-DNN-Softmax model plots used for entity similarity calculations.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention without any inventive step, are within the scope of protection of the invention.

The practice of the present invention is further illustrated with reference to FIGS. 1-9.

As shown in fig. 1, an embodiment of the present invention is a method for automatically extracting and constructing an aligned knowledge graph based on government affair triples, and the method includes the following specific steps:

step 1: constructing a government affair knowledge data set;

crawling government knowledge from government websites including, but not limited to, national, provincial and municipal government service networks, department official websites at all levels, and the like; the government knowledge includes, but is not limited to, structured data such as information disclosure guide, government service affairs handling guide, and text data such as policy document, legal document, etc.

Step 1, the government affairs knowledge data set is defined as:

K＝(k₁,k₂,...,k_n)

type_i＝{data_i,t，data_i,c}

wherein k is_iFor the ith government knowledge, i e [1, n ]]N represents the number of government knowledge in the government knowledge data set, type_iAttribute data set, data, representing the i-th government knowledge_i,tTime attribute, data, representing the ith government knowledge_i,tGenerated by obtaining the release time of the ith government affair knowledge, data_i,cConfidence attribute, data, representing the ith government knowledge_i,cAnd generating a confidence score for the source of the ith government knowledge by organizing government specialists. The government affair knowledge is divided into structured data and unstructured data, the structured data is data logically expressed and realized by a two-dimensional table structure, the data format and the length specification are strictly followed, and other data which do not meet the structured definition are all collectively called the unstructured data.

And 2, step: and respectively constructing a government affair term data set, a government affair term relation data set and a government affair term attribute data set for defining a government affair ontology.

Step 2, the government affair term data set is as follows:

J＝(j₁,j₂,...,j_M)

wherein J is the number of government affairs termsData set, j_iIs the ith government term, i e [1, M ∈]M denotes the number of government terms in the government term set;

step 2, the government term relationship data set is as follows:

R＝(r_i,j)

i ∈ [1, M ], j ∈ [1, M ], and i ≠ j

Wherein R is a government term relationship data set, R_i,jM represents the number of the government terms in the government term set for the link between the ith government term and the jth government term, and if there is no link between the ith government term and the jth government term, the link between the ith government term and the jth government term is not present

Step 2, the construction of the government affair term attribute data set is as follows:

P＝(p₁,p₂,...,p_M)

p_i＝(p_i,1,p_i,2,,...,p_i,Ni)

and step 3: and (4) obtaining a government affair triple by adopting a mapping mode for the structured data obtained in the step (1).

And mapping the structured data into a three-tuple set according to the constraint of the government body in the step 2 by using a D2R tool, mapping the table header of the structured data into government terms in the government body, and mapping the field names of the structured data into attribute names of corresponding entities. FIG. 4 is an example of mapping structured data to triples.

And 4, step 4: using BERT-BiLSTM-CRF model to extract entities, fig. 6 is a structure diagram thereof, and the specific flow is as follows: labeling a government entity in the government knowledge data set obtained in the step 1 by adopting a BIO system for training a BERT model, wherein FIG. 5 is a labeling example (wherein B is the beginning of a noun phrase, I is the middle of the noun phrase, and O is not the noun phrase), and converting the government knowledge data set obtained in the step 1 into word vectors by using the trained BERT model; inputting the word vector into a BILSTM model, extracting characteristic information through training to obtain a characteristic vector, outputting the characteristic vector to a CRF layer, and calculating to obtain an optimal labeling sequence; and 3, using the rule in the step 3 to constrain the entity in the annotation sequence.

And 3, converting the text in the government affair knowledge data set obtained in the step 1 into word vectors by the BERT layer, wherein the specific process is as follows:

the BERT (bidirectional Encoder replies from transformations) layer converts text into word vectors, and records the sentences input into the BERT as:

S＝(s₁,s₂,...,s_n)

wherein s is_iFor the ith word in the sentence, i ∈ [1, n ]]The corresponding BIO label (label) is noted as:

L＝(l₁,l₂,...,l_n)

S_v＝(c₁,c₂,...,c_n)

wherein, c_iA word vector for the ith word in the sentence.

Will S_vInputting into the BILSTM model in step 3, FIG. 7 shows an LSTM unit structure, and the core of the LSTM mainly has the following structure: the system comprises a forgetting gate, an input gate, an output gate and a memory Cell, wherein the forgetting gate and the input gate have the common function of clearing invalid information and simultaneously transmitting valid information to the next moment. For the output of the whole structure, the output of the memory Cell and the output gate are multiplied, and the structure is expressed by the following formula:

i_t＝σ(x_tW_xi+h_t-1W_hi+c_t-1W_ci+b_i)

z_t＝tanh(x_tW_xc+h_t-1W_hc+b_c)

f_t＝σ(x_tW_xf+h_t-1W_hf+c_t-1W_cf+b_f)

c_t＝c_j-1f+i_tz_t

o_t＝tanh(x_tW_xo+h_t-1W_ho+c_t-1W_co+b_o)

h_t＝otanh(c_t)

where σ is the sigmoid function, W is the weight matrix, b is the offset vector, i_t、f_t、o_tOutput results of the input gate, the forgetting gate and the output gate, z_tIs content to be added, c_tIs the update status at time t, h_tIs the output of the entire LSTM unit at time t.

Y＝(y₁,y₂,...,y_n)

wherein, y_iThe predicted tag for the ith word in the sentence.

Inputting Y into the CRF layer in the step 3, and calculating the probability of the label sequence by using the following formula:

Z(x)＝∑_y'exp(score(x,y'))

score(x,y)＝∑_iEmit(x_i,y_i)+Trans(y_i-1,y_i)

the corresponding loss function is as follows:

-log p＝(y|x)＝-score(x,y)+log Z(x)

(4) The Schema Constraint (SC) layer constrains the entities in the entity set N using rules in the ontology O, an example is shown in fig. 8, and the entity set conforming to the constraints is denoted as N'.

n′_i(p_t)＝k_j(p_t),n′_i(p_c)＝k_j(p_c)

And 5: and the extraction of the domain relation refers to the extraction of the semantic relation between the entities in one or more domains. The extracted patterns are divided into predefined relationships and resulting relationships extracted directly from the text. Step 2 already defines the relationship set R (attribute set P) among the terms in the ontology O, and step 4 links the entities and the terms when performing entity extraction, so the rule template is constructed by using a method of defining the relationship (attribute) in advance.

t_i＝(n_x1,r_l,n_x2)|t_i＝(n_x3,p_y,v_z)

wherein n, r, p, v represent entity, relation, attribute value respectively. And making the time attribute and the confidence attribute of the government affair triple equal to the time attribute and the confidence attribute of the source government affair knowledge respectively, and recording as follows:

t_i(p_t)＝k_j(p_t),t_i(p_c)＝k_j(p_c)

Step 6: and (4) calculating the similarity between the entities in the triples extracted in the step (4) and the entities existing in the existing government affair knowledge map, and importing the entities into a triplet set T according to the calculation result.

The entity similarity calculation model is shown in fig. 9, three layers are respectively a pre-training language model BERT, a fully-connected neural network (DNN) and Softmax, and an entity-pair two-class model is constructed. Model input "entity pair" (n)₁,n₂) A standard input sequence is constructed by character processing:

[CLS]Q1...Qn[SEP]R1...Rm[SEP]

wherein y is_iLabel for sample i, with a positive class of 1, a negative class of 0, p_iRepresenting the probability that sample i is predicted to be positive. And finally, setting a threshold value in a prediction stage to calculate to obtain a similar entity pair. The time attributes (t) of the entities extracted in the step 4 and the corresponding entities in the knowledge graph are compared_i(p_t) And a confidence attribute (t)_j(p_c) ) were compared. And updating the entity with higher time update and confidence coefficient to the government affair knowledge map.

Finally, it should be noted that the scope of the present invention is not limited to the above-mentioned embodiments, and it is apparent that those skilled in the art can make various modifications and variations to the present invention without departing from the scope and spirit of the invention. It is intended that the present invention cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents.

Claims

1. A method for automatically extracting and constructing an aligned knowledge graph based on government affair triples is characterized by comprising the following steps:

step 1: constructing a government affair knowledge data set;

step 2: respectively constructing a government affair term data set, a government affair term relationship data set and a government affair term attribute data set, wherein the government affair term data set, the government affair term relationship data set and the government affair term attribute data set are used for defining a government affair ontology;

step 6: and (5) calculating the similarity between the entities in the triples extracted in the step (4) and the entities existing in the existing government affair knowledge map, and importing the entities into a triplet set T according to the calculation result.

2. The method for automatically extracting an aligned knowledge graph construct based on government triples according to claim 1, comprising the steps of: step 1, the government affairs knowledge data set is defined as:

K＝(k₁,k₂,...,k_n)

type_i＝{data_i,t，data_i,c}

wherein k is_iFor the ith government knowledge, i ∈ [1, n ]]N represents the number of government affairs knowledge in the government affairs knowledge data set, type_iAttribute data set, data, representing the ith government affairs knowledge_i,tTime attribute, data, representing the ith government knowledge_i,tGenerated by obtaining the release time of the ith government affair knowledge, data_i,cConfidence attribute, data, representing the ith government knowledge_i,cGenerating a confidence score for the source of the ith government knowledge by organizing government specialists; the government affair knowledge is divided into structured data and unstructured data, the structured data is data logically expressed and realized by a two-dimensional table structure, the data format and the length specification are strictly followed, and other data which do not meet the structured definition are all collectively called the unstructured data.

3. The method for automatically extracting an aligned knowledge graph construct based on government triples according to claim 1, comprising the steps of: step 2 the government term data set is:

J＝(j₁,j₂,...,j_M)

step 2, the government term relationship data set is as follows:

R＝(r_i,j)

i ∈ [1, M ], j ∈ [1, M ], and i ≠ j

Step 2, constructing a government affair term attribute data set as follows:

P＝(p₁,p₂,...,p_M)

where P is a government term attribute data set, P_iAttribute data set, p, for the ith government term_i,jIs the jth attribute in the attribute data set for the ith government term.

4. The method for automatically extracting an aligned knowledge-graph construct based on government triplets according to claim 1, comprising the steps of: the step 3 specifically comprises the following steps:

and mapping the structured data into a three-tuple set according to the constraint of the government body in the step 2 by using a D2R tool, mapping the table header of the structured data into government terms in the government body, and mapping the field names of the structured data into attribute names of corresponding entities.

5. The method for automatically extracting an aligned knowledge graph construct based on government triples according to claim 1, comprising the steps of: step 4, the BERT layer converts the text in the government affair knowledge data set obtained in the step 1 into word vectors, and the specific process is as follows:

the BERT layer converts the text into word vectors, and the sentences input into the BERT are recorded as:

S＝(s₁,s₂,...,s_n)

wherein s is_iFor the ith word in the sentence, i ∈ [1, n ]]The corresponding BIO tag is noted as:

L＝(l₁,l₂,...,l_n)

wherein l_iFor the tag of the ith word in the sentence, i ∈ [1, n ]](ii) a After the sentence S is input into the BERT model, the vectorization output S of the S is obtained_vAnd is recorded as:

S_v＝(c₁,c₂,...,c_n)

wherein, c_iA word vector for the ith word in the sentence;

will S_vInputting the BILSTM model in the step 4;

the core of LSTM is mainly the following structure: the system comprises a forgetting gate, an input gate, an output gate and a memory Cell, wherein the forgetting gate and the input gate have the common function of clearing invalid information and simultaneously transmitting valid information to the next moment;

i_t＝σ(x_tW_xi+h_t-1W_hi+c_t-1W_ci+b_i)

z_t＝tanh(x_tW_xc+h_t-1W_hc+b_c)

f_t＝σ(x_tW_xf+h_t-1W_hf+c_t-1W_cf+b_f)

c_t＝c_j-1f+i_tz_t

o_t＝tanh(x_tW_xo+h_t-1W_ho+c_t-1W_co+b_o)

h_t＝otanh(c_t)

where σ is the sigmoid function, W is the weight matrix, b is the offset vector, i_t、f_t、o_tOutput results of the input gate, the forgetting gate and the output gate, z_tIs content to be added, c_tIs the update status at time t, h_tIs the output of the whole LSTM unit at time t;

Y＝(y₁,y₂,...,y_n)

wherein, y_iA predicted tag for the ith word in the sentence;

inputting Y into the CRF layer in the step 4, and calculating the probability of the label sequence by using the following formula:

Z(x)＝∑_y'exp(score(x,y'))

score(x,y)＝∑_iEmit(x_i,y_i)+Trans(y_i-1,y_i)

the corresponding loss function is as follows:

-logp＝(y|x)＝-score(x,y)+logZ(x)

the output is a predicted tag sequence; the label of the predicted entity starts with B and continues with I, and the predicted entity set is marked as N;

the mode constraint layer uses the rules in the body O to constrain the entity in the entity set N, and the entity set conforming to the constraint is marked as N';

n′_i(p_t)＝k_j(p_t),n′_i(p_c)＝k_j(p_c)

6. The method for automatically extracting an aligned knowledge graph construct based on government triples according to claim 1, comprising the steps of: the step 5 specifically comprises the following steps:

t_i＝(n_x1,r_l,n_x2)|t_i＝(n_x3,p_y,v_z)

wherein n, r, p and v represent entity, relation, attribute and attribute value respectively; and enabling the time attribute and the confidence attribute of the government affair triple to be respectively equal to the time attribute and the confidence attribute of the source government affair knowledge, and recording as follows:

t_i(p_t)＝k_j(p_t),t_i(p_c)＝k_j(p_c)

7. The method for automatically extracting an aligned knowledge graph construct based on government triples according to claim 1, comprising the steps of: the step 6 specifically comprises the following steps:

the three layers of the entity similarity calculation model are respectively a pre-training language model BERT, a fully-connected neural network (DNN) and Softmax, and an entity pair two classification model is constructed; model input "entity pair" (n)₁,n₂) A standard input sequence is constructed by character processing:

[CLS]Q1...Qn[SEP]R1...Rm[SEP]

wherein [ CLS ] and [ SEP ] are two special sign flag bits of BERT, which are used for distinguishing different entities; [ CLS ] appears at the top of the entity pair, [ SEP ] appears at the two entity boundaries and at the end of the entity pair; the model inputs the vector of the last layer of BERT into DNN (fully-connected neural network) for dimension reduction and feature extraction, then performs secondary classification through Softmax to obtain a similarity probability distribution result, and the calculation formula is as follows:

where i represents a certain class in k, k ∈ (0, 2)],g_iA value representing the classification; the loss function of the model uses a cross-entropy loss function, which is as follows:

wherein y is_iLabel, positive class, for sample i1, negative class is 0, p_iRepresents the probability that sample i is predicted to be positive; finally, setting a threshold value in a prediction stage to calculate to obtain a similar entity pair; the time attributes (t) of the entities extracted in the step 4 and the corresponding entities in the knowledge graph are compared_i(p_t) And a confidence attribute (t)_j(p_c) ) a comparison is made; and updating the entity with higher time update and confidence coefficient to the government affair knowledge map.