CN114706951A - Temporal knowledge graph question-answering method based on subgraph - Google Patents

Temporal knowledge graph question-answering method based on subgraph Download PDF

Info

Publication number
CN114706951A
CN114706951A CN202210347571.7A CN202210347571A CN114706951A CN 114706951 A CN114706951 A CN 114706951A CN 202210347571 A CN202210347571 A CN 202210347571A CN 114706951 A CN114706951 A CN 114706951A
Authority
CN
China
Prior art keywords
time
temporal
question
score
subgraph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210347571.7A
Other languages
Chinese (zh)
Inventor
陈子阳
胡升泽
王军波
赵翔
徐浩
谭真
黄宏斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202210347571.7A priority Critical patent/CN114706951A/en
Publication of CN114706951A publication Critical patent/CN114706951A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Human Computer Interaction (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a temporal knowledge graph question-answering method based on subgraphs, which comprises the following steps: extracting and analyzing the time information hidden in the problem through knowledge in the temporal knowledge graph, and replacing the problem in the natural language format with a simplified problem by using a regular expression; converting the problem of adding time constraint into vector representation, and obtaining candidate entities and semantic scores thereof by using a temporal knowledge map scoring function and a time sensitive function; constructing a temporal neighbor subgraph for each problem; clipping the temporal neighbor subgraph through time constraint; using a time activation function to realize quantitative scoring on each entity in a temporal neighbor subgraph to obtain a subgraph inference score; and fusing the score semantic score and the subgraph reasoning score to obtain a final answer. The invention improves the ability to identify time information contained in the problem; and deducing answers meeting the time limit through a time subgraph to obtain reliable answers of the questions.

Description

Temporal knowledge graph question-answering method based on subgraph
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a temporal knowledge graph question-answering method based on subgraphs.
Background
Knowledge-graph question answering (KGQA) requires the system to reason about the Knowledge Graph (KG) to provide an answer after a natural language question is given. However, in real life, questions often contain time limitations, such as, "who obtained the nobel prize in 2019? "traditional knowledge maps are difficult to support for inference of answers due to lack of temporal information. In recent years, Temporal Knowledge Graph (TKG) blooms, where the facts (facts) involved have a temporal attribute when faced with temporal questions, which can be used as a knowledge base to help locate potential answers, a task known as Temporal Knowledge Graph Questioning (TKGQA). Temporal knowledge-graph questions and answers classify temporal problems into two categories. (1) Simple questions, only one fact need be answered; (2) complex problems require reasoning about multiple facts.
For a complex question "who is the first president of the united states after two war? ". The model should determine the golden prediction from the TKG by retrieving and comparing the relevant entities.
This task becomes challenging because of the existence of the time dimension. First, in addition to containing specific times (simple questions), questions are likely to include implicit temporal expressions (a.k.a. complex questions). In the above example, to answer this complex question, we need to figure out the period of world war ii, 1939-. Second, direct queries to the temporal knowledge graph may not answer this question because of the presence of time-triggered words such as "first" and "after". Assuming that the model already knows that the time of the second war is 1939-.
A representative method TempoQR of the tense knowledge graph question-answering utilizes the tense knowledge graph embedding, and answers the tense questions as link prediction tasks on the knowledge graph. Specifically, it introduces time range information for each question, and adopts EaE (Entities, namely, properties as experts) to enrich semantic information of question representation, and finally applies a score function of a temporal knowledge graph to predict an answer. Although TempoQR achieves the best performance at present, it does not explore the core of temporal knowledge map questions and answers. In addition, the reason for the best performance of TempoQR is mainly that the current data set CronQuestions has a pseudo temporal problem. For example, what is the first award obtained for "Carlo Taverna? ", there is only one fact in the time series knowledge map that is associated with Carlo Taverna, which makes the time trigger word" first "useless. We define such a problem as a pseudo-temporal problem, and in order to avoid the influence caused by them, we refine the cronquents data set to construct a Complex-cronquents data set, which only contains a truly Complex temporal problem.
How are humans answer temporal questions? According to the theory of the dual process, firstly, the semantics of the problem is understood, relevant information is searched in massive memory, and then answers are selected in consideration of specific constraint conditions. Therefore, people can fully utilize the fact knowledge in the mind to understand the implicit time in the problem and further carry out reasoning according to the time constraint so as to obtain reliable judgment. Therefore, i tried to put this theory into a time-series knowledge-graph question-answer, whose flow can be modeled as shown in fig. 1. For a specific question, for example, "who is the first president of the united states after two war? ", a pile of related facts, such as (shidi, time of occurrence, 1939-. Then, considering the triggering words in time, namely 'the first president' and 'the second war' we can dig time information deeply and carry out detailed reasoning, and the rest candidates are eliminated to obtain correct answers.
Knowledge-graph question-answering is an important component of an intelligent system, and research in the field is very valuable and attracts the attention of many researchers. The current mainstream approach is to answer questions using KG pre-trained embedding. Such methods work well for simple problems, but have challenges for complex problems.
To address the limitations of the above approaches, there are studies using representations of reinforcement problems to solve complex problems. Such methods use logical reasoning or exploit available side information in the form of text inventories. However, these KGQA algorithms work on non-temporal knowledge graphs, i.e. knowledge graphs containing triples in (subject, relationship, object) form, and are not suitable for processing time constraints.
For time-series knowledge-graph question-answer questions, in order to answer a temporal question, we need to extrapolate a temporal knowledge graph containing facts in the form of (subject, relationship, object, start time, end time). The current tense knowledge-graph question-answering method mainly comprises two types, wherein one type is a decomposition method based on time constraint, and the other type is a representation method based on a tense knowledge-graph.
The decomposition method based on time constraint decomposes the original question into non-temporal question and time constraint, and then uses the traditional knowledge-graph question-answering method to search the candidate answer of the subproblem. Finally, the candidate answers are filtered by time constraints, resulting in a final answer to the original question. One major drawback of this approach is the use of pre-specified templates for decomposition and the assumption that there is a time limit to the entity. In addition, it cannot be directly applied to a temporal KG which is a real time domain.
The timeline representation-based approach learns entity, relationship, and timestamp embedding using a temporal knowledge graph embedding approach, and then scores answers using temporal knowledge graph embedding and distance from question embedding. CronKGQA (QuestionAnswering overtemperature Knowledge graphs) provides a learnable reasoning process for the temporal Knowledge graph question-answering, does not depend on manually made rules, and proves that the temporal Knowledge graph embedding can be applied to the task of the temporal Knowledge graph question-answering. While CronKGQA performs very well in answering simple questions, it encounters challenges when it is necessary to reason about some complex questions that have time constraints, primarily because CronKGQA lacks the ability to understand the time and time constraint reasoning underlying the question. To compensate for this drawback, tempoqr (temporal query answering over Knowledge graphs) introduces time range information for each Question, and EaE (entity or expert) method is used to enhance semantic information represented by the Question. However, although TempoQR has improved performance over complex problems, it still does not take into account time constraints such as "before/after", "first/last", etc., resulting in unreliable results. CronQuestions contains two parts: a knowledge graph containing time labels and a set of questions and answers requiring temporal reasoning. TempoQR achieves excellent results without taking into account complex constraints, mainly due to the pseudo-temporal problem in CronQuestions.
Disclosure of Invention
In order to effectively process complex time problems, the invention designs a method using sub graph time reasoning (SubGTR), which comprises three modules. (1) The implicit expression analysis module aims at carrying out explicit conversion on the implicit time expression; (2) a related fact search module that attempts to retrieve all potential entities; and (3) a sub-graph logic inference module that takes into account temporal constraints for final answer determination. Specifically, in the implicit expression parsing module, implicit time information included in a natural language problem is accurately identified and replaced by explicit time for subsequent searching and reasoning. In the relevant fact search module, we first encode the question as an embedding containing question semantics, temporal knowledge-graph semantics and temporal semantics, and then use the TKG scoring function and the time sensitive function to obtain the candidate entities and their semantic scores. In the subgraph logical inference module, a subgraph of each question is first constructed and tailored by time constraints. The present invention designs a time activation function to obtain a quantitative score for each entity, which helps to locate the final prediction.
The invention discloses a method for converting a problem added with time constraint into vector representation comprising problem semantics, temporal knowledge graph spectrum semantics and time semantics, which comprises the following steps:
obtaining semantic information of natural language questions, i.e. questions q, through pre-trained language modelstextThe natural language form of (A) is converted into a semantic matrix Q by the pre-trained RoBERTAn
Qn=WnRoBERTa(qtext),
Wherein
Figure RE-GDA0003616254690000051
Is QnLine i of (1), N is the number of labels, D is the size of the temporal atlas insert, WnIs a DxDrobertaProjection matrix, DrobertaIs the dimension of RoBERTa embedding;
substituting pre-trained temporal knowledge map embedding for semantic matrix QnEmbedding of the entity and timestamp of the corresponding location:
Figure RE-GDA0003616254690000052
wherein WEIs a D x D embedded matrix of the matrix,
Figure RE-GDA0003616254690000053
eòand tτThe method is characterized in that a temporal knowledge graph spectrum is embedded in a pre-training mode;
time embedding of the extracted temporal knowledge graph is integrated into the problem representation:
Figure RE-GDA0003616254690000054
Figure RE-GDA0003616254690000055
is all that
Figure RE-GDA0003616254690000056
The constructed matrix, T1 and T2, is the time horizon of the problem;
fusing information into a single problem representation Q using an information fusion layer consisting of a learnable encoder transform (·), fusing together problem, entity and time-aware information, the final label embedding matrix Q being computed as
Q=Transformer(QT),
Wherein Q ═ QCLS,q2,...,qN]The final question indicates that q selects the final output layer, i.e. q ═ qCLSCLS is a special output of the Transformer, aggregating all the input information.
Further, the candidate entity and the semantic Score thereof are obtained by using a temporal knowledge graph scoring function phi (-) and a time sensitive function f (-) tosematicThe method comprises the following steps:
an entity
Figure RE-GDA0003616254690000057
The final score for a yes answer is given by the following formula:
Figure RE-GDA0003616254690000058
where s, o and τ are the subject, object and timestamp, respectively, noted in the question, PEThe matrix is a D multiplied by D learnable matrix used for entity prediction, subject and object marked in the question sentence are interchanged, and the max (-) function ensures that when s or o is a false entity, the score is ignored;
furthermore, a timestamp τ ∈ T is the final score for the answer given by the following equation:
max(φ(es,PTq,eo,tτ),fsensity(PTq,tτ)),
where s, o is the subject and object of the annotation in the question, PTIs a DxD learnable matrix, f, for temporal predictionsensityIs a function that measures the sensitivity of the problem representation to the time stamp;
fsensity(PTq,tτ)=PTq·tτ.
during the training process, the scores of the entities and time are concatenated to obtain ScoresematicAnd is converted into probability through a softmax function; parameters of the model are continuously updated, and higher probability is distributed to correct answers;
the ScoresematicThe following calculations were made:
Figure RE-GDA0003616254690000061
furthermore, the temporal neighbor subgraph comprises a central entity ehAnd its L-th order neighbour entities, which are connected by edges containing time information.
Further, for the tagged entity e in questionhIf all relevant facts of the temporal knowledge graph are inquired in the temporal knowledge graph, and relevant entities and time information of the temporal knowledge graph are extracted to construct a time sub-graph, the temporal neighbor sub-graph can be expressed as follows:
Figure RE-GDA0003616254690000062
wherein
Figure RE-GDA0003616254690000063
Is ehFor each edge, for a corresponding L-hop graph
Figure RE-GDA0003616254690000064
It contains corresponding time information tr,NL(eh) Is ehThe set of nodes in the non-indirect neighborhood in the knowledge-graph,
Figure RE-GDA0003616254690000065
are the corresponding edges.
Further, the clipping the subgraph by the time constraint includes:
extracting all paths from the neighbor nodes to the central node, p1, p2, p 3;
clipping edges which do not meet the time constraint in the path;
all isolated nodes are deleted, the remainder GfI.e. a sub-graph that satisfies the temporal constraint.
Further, the quantization scoring is carried out on each entity in the subgraph by using the time activation function, and the Score is obtainedreasoningThe method comprises the following steps:
entities in the subgraphs of different question types were scored using the following four time activation functions:
the equality function: when the question types are equal, we assign corresponding scores to entities with the same time as T, and the rest are 0:
Figure RE-GDA0003616254690000071
wherein c is1Is a positive number;
before/after function: when the question type is before/after, the entity closer to T1/T2 is given a higher score:
Figure RE-GDA0003616254690000072
first/last function: when the question type is first/last, entities earlier/later in time are assigned a higher score:
Figure RE-GDA0003616254690000073
time range function: when the question type is time range, the entity of the time between T1 and T2 is assigned a corresponding score, and the other entities are assigned a score of 0:
Figure RE-GDA0003616254690000074
wherein c is2Is a positive number;
in combination with the four temporal perception functions, the form of the composite score function is derived as follows:
Figure RE-GDA0003616254690000081
wherein r (q)i,ei) Is a problem q i1 indicates that the condition is 1 and the condition is 0.
Further, the score obtained by spatio-temporal subgraph inference and the score obtained by relevant fact search are fused to be used as a final entity score:
scorefusing=μ*scoresematic+(1-μ)*scorereasoning,
where μ e (0,1) is used to balance the semantic scoresematicAnd subgraph inference scorereasoningThe weight of (c);
the scores for all entities and time are concatenated and the probability of answer to this composite score vector is calculated using softmax and trained using cross-entropy loss.
Further, for each question in the data set of CronQuestions, relevant facts in the temporal knowledge map are retrieved according to entity annotations in the question, and if the number of the questions is less than 5, the question is eliminated to obtain a complete-CronQuestions data set.
Compared with the prior art, the invention has the following beneficial effects:
a sub-graph time inference framework SubGTR is provided to solve TKGQA, and the capability of identifying time information contained in the problem is improved;
and (3) the importance of the logic reasoning on the time limit is provided, and the answer meeting the time limit is reasoned through a time subgraph to obtain a reliable answer to the question.
Pseudo-tense problems in the data set of the CronQuestions are filtered, a real Complex tense data set Complex-CronQuestions is provided, and the capability of a model for answering the Complex tense problems can be better evaluated; optimal performance was achieved on both cronquents and Complex-cronquents datasets;
cold start experiments demonstrated the reasoning ability of SubGTR for unseen entities.
Drawings
FIG. 1 is a flow diagram of a prior art time series knowledge graph question-answer;
FIG. 2 is an architectural diagram of implicit expression parsing and related fact search in the present invention;
FIG. 3 is a diagram of the sub-graph logical inference architecture of the present invention;
FIG. 4 is a schematic diagram of temporal neighbor subgraphs in accordance with the present invention;
FIG. 5 is a diagram illustrating a subgraph clipping process according to the present invention;
FIG. 6 is a diagram of four time activation functions in the present invention.
Detailed Description
The invention is further described with reference to the accompanying drawings, which are not intended to be limiting in any way, and any alterations or substitutions based on the teachings of the invention are intended to fall within the scope of the invention.
The temporal knowledge graph K ═ (E, R, T, F) is a multivariate relational directed graph, with time-stamped edges between entities. One fact in K can be formalized as (s, R, o, τ) ∈ F, where s, o ∈ E denote the subject and guest entities, R ∈ R denotes the relationship between them, and τ ∈ T is the timestamp associated with the relationship. The temporal knowledge map embedding method learns each of K
Figure RE-GDA0003616254690000091
And τ ∈ T of K-dimensional vector
Figure RE-GDA0003616254690000092
So that the ratio of scores for each fact (s, r, o, τ) e F is given by the scoring function φ (-), e
Figure RE-GDA0003616254690000093
Is high formally expressed as phi (e)s,vr,eo,tτ)>φ(es′,vr′,eo′,tτ′).
Given a temporal knowledge graph K and a natural language question q, the temporal knowledge graph question-answering aims at obtaining an entity s/o ∈ E or a timestamp τ ∈ T that correctly answers the question q. In general, the question q can be formulated as a list of tokens [ token ]1,...tokenn]Therein tokeniIs the ith label of the question and n is the length of the question. In the temporal knowledge-graph question-answer dataset D, one data sample consists of one question-answer pair (q)i,ai) And q isiAnd the link of the unique identifier of the TKG entity and the middle entity.
Fig. 2 and 3 show the overall architecture of the present invention. It consists of three modules: the system comprises an implicit expression analysis module, a related fact searching module and a subgraph logic reasoning module. First, the problem of background knowledge simplification in the temporal knowledge map is utilized to obtain the corresponding time constraint. Then, candidate entities are searched in massive temporal knowledge graph embedding, and initial scores are obtained. And finally, obtaining an answer entity by constructing a question subgraph, and performing cutting and logical reasoning on the answer entity by using time constraint.
In the implicit expression analysis module, the time information hidden in the problem is extracted and analyzed through knowledge in the temporal knowledge graph, the problem in the natural language format is replaced by a simplified problem through a regular expression, and time constraints T1 and T2 are added to the problem for subsequent reasoning and query. In the related fact searching module, a natural language problem is firstly converted into a vector representation containing problem semantics, temporal knowledge map semantics and temporal semantics, and then a TKG scoring function phi (-) and a time sensitive function f (-) are used for obtaining candidate entities and semantic entities thereofScore Scoresematic(ii) a In the sub-graph temporal inference module, we first build a sub-graph for each problem and then prune the sub-graph through temporal constraints. Finally, we use the time activation function to achieve quantitative scoring on each entity in the subgraph to obtain ScorereasoningAnd obtaining a final answer through the fusion score.
In a complex problem, there is a lot of implicit time information, which provides an obstacle to the question-answering system. Humans will naturally think of existing common sense information to simplify the problem. Based on the intuition, a hidden expression analysis module is designed, so that the model can simplify the problem and explicitly obtain the hidden time information in the problem. The method mainly comprises two parts, namely generalized knowledge extraction and specific knowledge extraction.
And (3) generalized knowledge extraction: in practice, we often use the name of some well-known events as a time constraint, e.g., "who is the first president of the united states after second war? ", world war ii has no direct relationship to the american president, but can be considered as implicit time information. Due to lack of background knowledge, existing models tend to miss such time information. Therefore, we replace the implicit event constraint appearing in the natural language problem with the explicit time information, i.e. 1921, by using the event knowledge in the temporal knowledge graph.
Specific knowledge extraction: in addition to well-known events, other types of hidden temporal information can also arise in complex problems, such as, "who made a significant contribution to quantum mechanics before einstein? "i can not obtain time information directly through einstein, but need to query the knowledge graph through such joint information (quantum mechanics, einstein) to obtain corresponding time information, which is called specific knowledge extraction.
In the relevant fact search module, inspired by EaE and TempoQR, we use language model and temporal knowledge graph embedding to convert natural language form questions into vector representation (question coding) containing question semantics, temporal knowledge graph semantics and temporal semantics, and then use the scoring function and time sensitive function of temporal knowledge graph to obtain candidate entities and their semantic scores.
Encoding the problem includes obtaining semantic information of the natural language problem, capturing semantic knowledge about entities in the natural language problem, merging temporal embedding of a temporal knowledge graph into a problem representation, and merging problem, entity, and temporal-aware information.
Semantics of natural language: semantic information of the natural language problem is obtained through a pre-trained language model. More precisely, the problem qtextThe natural language form of (A) is converted into a semantic matrix Q by the pre-trained RoBERTAn
Qn=WnRoBERTa(qtext),
Wherein therein
Figure RE-GDA0003616254690000111
Is QnN is the number of labels, D is the size of the temporal knowledge-map embedding, WNIs a DxDrobertaProjection matrix of which DrobertaIs the dimension of RoBERTa embedding.
Semantics of the entity: to capture semantic knowledge about entities in natural language problems, we replace Q with pre-trained temporal knowledge graph embeddingnEmbedding of the entity of the corresponding location and the timestamp.
Figure RE-GDA0003616254690000121
Wherein WEIs a D x D embedded matrix and,
Figure RE-GDA0003616254690000122
eòand tτIs the embedding of the pre-training of the temporal knowledge graph. This enriches the representation of the problem with the entity information of the temporal knowledge graph.
The semantics of time: in the implicit knowledge perception module, time information in a question is subjected to explicit conversion, and time information T1 and T2 are obtained. To enhance the knowledge of temporal information, we incorporate temporal embedding of the extracted temporal knowledge-graph into the problem representation.
Figure RE-GDA0003616254690000123
Figure RE-GDA0003616254690000124
T1 and T2 are the time horizon for the problem obtained by the implicit knowledge perception module.
Semantic fusion: next, the present invention proposes an information fusion layer to fully fuse information into a single problem representation q. We use an information fusion layer, consisting of a special learnable encoder transform (·). This encoder is able to fuse together problem, entity and temporal perceptual information. The final mark embedding matrix Q is calculated as
Q=Transformer(QT),
Wherein Q ═ QCLS,q2,...,qN]The final question indicates that q selects the final output layer, i.e. q ═ qCLS
Scoring function
An entity
Figure RE-GDA0003616254690000125
Is that the final score of the answer is given by the following formula
Figure RE-GDA0003616254690000126
Where s, o and τ are respectively the subject, object and timestamp noted in the question, PEIs a DxD learnable matrix dedicated to entity prediction. Here we process the subject and object of the annotation interchangeably, the max (-) function ensures we ignore the score when s or o is a false entity.
Furthermore, a time stamp τ ∈ T is the final score of the answer given by the following equation
max(φ(es,PTq,eo,tτ),fsensity(PTq,tτ)),
Where s, o is the annotation entity in the question, PTIs a D × D learnable matrix that is dedicated to temporal prediction. f. ofsensityIs a function that measures the sensitivity of the problem representation to the time stamp.
fsensity(PTq,tτ)=PTq·tτ.
During the training process, the scores of entities and time are concatenated and converted to probabilities by a softmax function. In the training process, the parameters of the model are continuously updated, and a higher probability is distributed to correct answers.
In a sub-graph time reasoning module, a temporal neighbor sub-graph is firstly constructed for each question, then the temporal neighbor sub-graph is cut through time constraint, and finally quantitative scoring of the time constraint is realized by using a time activation function to obtain a final answer. The method specifically comprises four steps of subgraph extraction, subgraph clipping, entity inference scoring and scoring fusion.
Sub-graph extraction: the neighbors of the entity noted in the question contain much information that helps answer the question. To obtain information of neighbor entities, we first construct a temporal neighbor subgraph. The temporal neighbor subgraph comprises a central entity ehAnd its L-th order neighbour entities, which are connected by edges containing time information. More precisely, for the tagged entity e in the question (the teacher in FIG. 4)hAnd querying all relevant facts of the time-state knowledge graph, and extracting relevant entities and time information of the time-state knowledge graph to construct a time subgraph. N is a radical ofk(eh) Is ehThe set of nodes of the L-hop (non-indirect) neighborhood in the KG,
Figure RE-GDA0003616254690000131
for the corresponding edge, the temporal neighbor subgraph can be represented as
Figure RE-GDA0003616254690000132
Wherein
Figure RE-GDA0003616254690000133
Is ehCorresponding L-hop graph. For each edge
Figure RE-GDA0003616254690000134
It contains corresponding time information tr
Sub-graph cutting: since temporal neighbor subgraphs contain a large amount of useless neighbor information, we use the temporal constraints in the problem to filter the temporal subgraphs. In the implicit expression parsing module, we have obtained the time range of each question [ T1, T2 ]]T1 is a start time, and T2 is an end time. We check if all paths on the subgraph conform to the time constraint. Referring to fig. 5, in particular, we first extract all paths from neighbor nodes to the central node, p1, p2, p3fI.e. subgraphs that satisfy the temporal constraints.
And (3) entity inference scoring: referring to fig. 6, to implement quantitative scoring of time constraints, four time activation functions are designed as follows to score entities in subgraphs of different question types.
And (3) equality: when the question types are equal, we assign corresponding scores to entities with the same time as T, and the rest are 0 scores.
Figure RE-GDA0003616254690000141
Wherein c is1Is a positive number.
Before/after: when the question type is Before/After (Before/After), we give a higher score to entities closer to T1/T2.
Figure RE-GDA0003616254690000142
First/last: when the question type is first/last (first/last), we assign a higher score to entities that are earlier/later in time.
Figure RE-GDA0003616254690000143
Time range: when the question type is time Range (timejoin), we assign a corresponding score to the entity at the time between T1 and T2 and a score of 0 to the other entities.
Figure RE-GDA0003616254690000144
Wherein c is2Is a positive number.
Thus, in combination with the four temporal perception functions, we can give the form of the composite scoring function as follows.
Figure RE-GDA0003616254690000151
Wherein r (q)i,ei) Is a problem q i1 indicates that the condition is 1 and the condition is 0.
Score fusion: and fusing the scores obtained by the inference of the spatio-temporal subgraph and the scores obtained by searching the related facts to serve as final entity scores.
scorefusing=μ*scoresematic+(1-μ)*scorereasoning,
Where μ ∈ (0,1), the weights used to balance semantic and subgraph inference scores.
The scores for all entities and time are concatenated and softmax is used to calculate the answer probability for this composite score vector. The model is trained using cross-entropy loss.
The following is a part of the experimental results, and the present invention was realized by Pythrch. TComplEx was used as TKG insert with size D512. The pre-trained language model used in semantic search is RoBERTa. During training, the parameters of the language model and the TKG embedding are fixed and not updated. The number of conversion layers of the encoder transform (·) is set to l ═ 6. The number of hops k of the subgraph is set to 1 and the fractional fusion weight μ is set to 0.5. The parameters of the model were updated with Adam and the learning rate was 0.0002. The model was trained for the 20 largest epochs, with the final parameters determined for best validation performance.
CronQuestions is a data set for a tense knowledge-graph answer, including a tense KG with 125k entities and 328k facts, and a set of 410k natural language questions that require tense reasoning. In this temporal knowledge map, facts are represented as (subject, relationship, object, [ start time, end time ]). The CronQuestions dataset has 41 ten thousand question-answer pairs, 35 thousand of which were used for training and 3 thousand for verification and testing. In addition, the entities and times present in the question are annotated. There are four types of questions for CronQuestions, as shown in Table 1, including simple and complex temporal questions. However, in CronQuestions, there are a number of pseudo-temporal problems, and the temporal limits do not play a role due to the lack of correlation in TKGs. To remedy this deficiency, the present invention constructs Complex-CronQuestions that eliminate the simple and pseudo-temporal problems in CronQuestions. Specifically, for each question, its associated facts in the temporal knowledge graph are retrieved based on the entity annotations in the question, and if the number is less than 5, the question is rejected. Table 1 shows the data set partitioning for CronQuestions and Complex-CronQuestions.
TABLE 1
Data set CronQuestions Complex-CronQuestions
train 350000 35795
valid 30000 5020
test 30000 5006
The model of the invention is compared to the following exemplary work.
BERT: we performed experiments with BERT and RoBERTa. To evaluate these models, we generated their LM-based problem embedding and connected it with annotated entity and time embedding, and then made a learnable projection. The resulting embedding scores all entities and timestamps by point multiplication and softmax them to predict answer probability.
Embedkqa: embedkqa is designed using conventional KGs. Our experiments for TKGQA are as follows. Timestamps are ignored in the pre-training and random time embedding is used in the QA task.
CronKGQA: CrongKGQA is a TKGQA embedding-based method, first using LM model to obtain question embedding, then using TKG embedding-based scoring function to predict the answer.
EaE: an entity or expert (EaE) is a method of entity awareness. In experiments, we used TKG embedding to enhance the representation of the question, and then predicted the probability of the answer by dot product.
EntityQR: based on EAE, EntityQR uses a TKG-based embedded scoring function for answer prediction.
TempoQR: similar to EaE, TempoQR predicts the answer using a TKG-based embedded scoring function and fuses additional temporal information.
Results on the CronQuestions dataset: table 2 shows the results of comparisons of the present invention with other baselines on CronQuestions. First, by comparing TempoQR with CronKGQA, i see to integrate the entities and time contained in the question into the question representation, a great help in answering complex questions. In this case, the complexity problem is 27% and 9% improvement on Hits @1 and Hits @10, respectively. Furthermore, comparing SubGTR and TempoQR, we see how well the subgraph reasoning is for the problem. Compared with TempoQR, the SubGTR has 5% improvement on integral Hits @1 and 8% improvement on Hits @1 index of a complex problem. Our model is a significant improvement over the current baseline model and achieves optimal results on CronQuestions.
TABLE 2
Figure RE-GDA0003616254690000171
Results on Complex-CronQuestions dataset
To verify the validity of this model on the actual Complex problem, we performed experiments on the Complex-CronQuestions dataset. The results of the experiment are shown in figure 4. Relative to the results for CronQuestions, CronKGQA, EntityQR and TempoQR all showed significant decreases in Complex-CronQuestions, with 0.381, 0.320 and 0.126 drops at Hits @1, respectively. SubGTR showed the best results, 0.920 at Hits @1, a drop of only 0.04 compared to the results on CronQuestions. The capability of the model for solving complex problems is highlighted.
In CronQuestions and Complex-CronQuestions, the entities involved in the training set and test set do not overlap with each other. However, the temporal knowledge graph used for training is embedded to contain all entity information, so that the model can obtain the entity information of the test set in the training process. In order to verify the effectiveness of the model in solving the entity cold start problem, the fact of related entities in a test set is deleted from the temporal knowledge graph Wikidata, and the temporal knowledge graph embedding only containing the entities in the training set is retrained, so that the information of the entities in the test set cannot appear in the training.
In the above way, we have constructed cronquents (zero) and Complex-cronquents (zero), wherein the entity involved in the prediction does not appear in training, and there is no corresponding pre-trained TKG embedding, so as to simulate the problem of entity cold start caused by adding new entity in extreme cases. The results of the experiment are shown in table 4. TempoQR mainly relies on TKG embedding for reasoning, and under the condition that a test set has no TKG embedding, random vectors in normal distribution are extracted as embedding of entities. From the experimental results, it can be seen that TempoQR performs poorly on entities, with only some effect on the prediction of temporal entities. In contrast, the accuracy of SubGTR on cronquents (zero) and Complex-cronquents (zero) datasets reaches 92.0% and 82.0% respectively, which shows that SubGTR can fully utilize the structural information of the time sequence subgraph to reason about the problem and answer on the entity which has never been seen, and is not limited by the specific entity. The model of the invention effectively avoids the problem of entity cold start.
Table 4 experiments on unseen entities
Figure RE-GDA0003616254690000181
To validate the validity of the various modules of the SubGTR, we performed subtractive experiments. Firstly, a subgraph reasoning module is removed, the score of a related fact searching module is directly used as a final score, and the fact @1 value of the model in the experiment is found to be reduced by 7%, so that the effectiveness of the subgraph reasoning module is highlighted.
Second, implicit expression parsing can obtain the time information hidden in the problem. The Hard-subvision method in TempoQR is used for replacing the model, the model is reduced by 2% on HIts @1, and therefore the implicit expression analysis module can accurately extract hidden time information, natural language problems are simplified, and quality of problem representation is improved.
Finally, the temporal sensitivity score may improve the perception of the model of the timestamp. The time sensitivity scoring effect is reduced by 3% after the time scoring function of the temporal knowledge graph spectrum embedding method replaces the time sensitivity scoring of the model, and the effectiveness of the time sensitivity scoring provided by the invention is demonstrated.
TABLE 5 ablation test results
Figure RE-GDA0003616254690000191
A case study was conducted in this section to demonstrate the unique advantages of SubGTR compared to the current best method, TempoQR. In table 6, a typical example of a wrong answer to TempoQR is shown, highlighting its design flaws. In contrast, SubGTR can effectively use time constraints to cut and score entities through time sub-graph reasoning to obtain correct answers.
TABLE 6 typical example of TempoQR misprediction
Type of problem Typical problems TempoQR The invention
Before/After Who is the first president of the united states after the second war? Error(s) in Correction of
First/Last Who is the norwegian ministry of finance? Error(s) in Correction of
Compared with the prior art, the invention has the following beneficial effects:
a sub-graph time inference framework SubGTR is provided to solve TKGQA, and the capability of identifying time information contained in the problem is improved;
and (3) the importance of the logic reasoning on the time limit is provided, and the answer meeting the time limit is reasoned through a time subgraph to obtain a reliable answer to the question.
Pseudo-tense problems in the data set of the CronQuestions are filtered, a real Complex tense data set Complex-CronQuestions is provided, and the capability of a model for answering the Complex tense problems can be better evaluated; optimal performance was achieved on both cronquents and Complex-cronquents datasets;
cold start experiments demonstrated the reasoning ability of SubGTR for unseen entities.
The word "preferred" is used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as "preferred" is not necessarily to be construed as advantageous over other aspects or designs. Rather, use of the word "preferred" is intended to present concepts in a concrete fashion. The term "or" as used in this application is intended to mean an inclusive "or" rather than an exclusive "or". That is, unless specified otherwise or clear from context, "X employs A or B" is intended to include either of the permutations as a matter of course. That is, if X employs A; b is used as X; or X employs both A and B, then "X employs A or B" is satisfied in any of the foregoing examples.
Also, although the disclosure has been shown and described with respect to one or an implementation, equivalent alterations and modifications will occur to others skilled in the art based upon a reading and understanding of this specification and the annexed drawings. The present disclosure includes all such modifications and alterations, and is limited only by the scope of the appended claims. In particular regard to the various functions performed by the above described components (e.g., elements, etc.), the terms used to describe such components are intended to correspond, unless otherwise indicated, to any component which performs the specified function of the described component (e.g., that is functionally equivalent), even though not structurally equivalent to the disclosed structure which performs the function in the herein illustrated exemplary implementations of the disclosure. In addition, while a particular feature of the disclosure may have been disclosed with respect to only one of several implementations, such feature may be combined with one or other features of the other implementations as may be desired and advantageous for a given or particular application. Furthermore, to the extent that the terms "includes," has, "" contains, "or variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term" comprising.
Each functional unit in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or a plurality of or more than one unit are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium. The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Each apparatus or system described above may execute the storage method in the corresponding method embodiment.
In summary, the above-mentioned embodiment is an implementation manner of the present invention, but the implementation manner of the present invention is not limited by the above-mentioned embodiment, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be regarded as equivalent replacements within the protection scope of the present invention.

Claims (9)

1. The temporal knowledge graph question-answering method based on the subgraph is characterized by comprising the following steps:
extracting and analyzing hidden time information in the problem through knowledge in the temporal knowledge graph, replacing the problem in the natural language format with a simplified problem by using a regular expression, and adding time ranges T1 and T2 of the problem into the problem for subsequent reasoning and query;
converting the problem added with the time constraint into vector representation containing problem semantics, temporal knowledge graph semantics and time semantics, and then obtaining a candidate entity and semantic score thereof by using a temporal knowledge graph scoring function phi (-) and a time sensitive function f (-);
constructing a temporal neighbor subgraph for each problem; clipping the temporal neighbor subgraph through time constraint; using a time activation function to realize quantitative scoring on each entity in a temporal neighbor subgraph to obtain a subgraph inference score;
and fusing the score semantic score and the subgraph reasoning score to obtain a final answer.
2. The subgraph-based temporal knowledge-graph question-answering method according to claim 1, wherein converting the question added with the time constraint into a vector representation containing question semantics, temporal knowledge-graph semantics and time semantics comprises:
obtaining semantic information of natural language questions, i.e. questions q, through pre-trained language modelstextThe natural language form of (A) is converted into a semantic matrix Q by the pre-trained RoBERTAn
Qn=WnRoBERTa(qtext),
Wherein
Figure FDA0003577441450000011
Figure FDA0003577441450000012
Is QnN is the number of labels, D is the size of the temporal knowledge-map embedding, WnIs a DxDrobertaProjection matrix, DrobertaIs the dimension of RoBERTa embedding;
substituting pre-trained temporal knowledge map embedding for semantic matrix QnEmbedding of the entity and timestamp of the corresponding location:
Figure FDA0003577441450000021
wherein WEIs a D x D embedded matrix and,
Figure FDA0003577441450000022
eòand tτIs the embedding of the temporal knowledge map pre-training;
time embedding of the extracted temporal knowledge graph is integrated into the problem representation:
Figure FDA0003577441450000023
Figure FDA0003577441450000024
is all that
Figure FDA0003577441450000025
The constructed matrix, T1 and T2, is the time horizon of the problem;
fusing information into a single problem representation Q using an information fusion layer consisting of a learnable encoder transform (·), fusing together problem, entity and time-aware information, the final label embedding matrix Q being computed as
Q=Transformer(QT),
Wherein Q ═ QCLS,q2,...,qN]The final problem indicates that q selects the final output layer, i.e., q is qCLSCLS is a special output of the Transformer, aggregating all the input information.
3. The subgraph-based temporal knowledge-graph question-answering method according to claim 1, characterized in that the candidate entities and their semantic scores Score are obtained using a temporal knowledge-graph scoring function phi (-) and a time-sensitive function f (-) tosematicThe method comprises the following steps:
an entity e is the final score of the answer given by the following formula:
Figure FDA0003577441450000026
where s, o and τ are the subject, object and timestamp, respectively, noted in the question, PEIs a DxD learnable matrix for entity prediction, interchanging subjects and objects marked in question sentences, the max (-) function ensures that when s or o is a false entity, the score is ignored;
furthermore, a time stamp τ ∈ T is the final score of the answer given by the following equation
max(φ(es,PTq,eo,tτ),fsensity(PTq,tτ)),
Where s, o is the subject and object of the annotation in the question, PTIs a DxD learnable matrix, f, for temporal predictionsensityIs a function that measures the sensitivity of the problem representation to the time stamp;
fsensity(PTq,tτ)=PTq·tτ.
during the training process, the scores of the entities and time are concatenated to obtain ScoresematicAnd is converted into probability through a softmax function; parameters of the model are continuously updated, and higher probability is distributed to correct answers;
the ScoresematicThe following calculations were made:
Figure FDA0003577441450000031
Figure FDA0003577441450000032
4. the subgraph-based temporal knowledge-graph question-answering method of claim 1, wherein the temporal neighbor subgraphs comprise a central entity ehAnd its L-th order neighbour entities, which are connected by edges containing time information.
5. The subgraph-based temporal knowledge-graph question-answering method according to claim 4, characterized in that for a tagged entity e in a questionhIf all relevant facts of the temporal knowledge graph are inquired, and relevant entities and time information of the temporal knowledge graph are extracted to construct a time sub-graph, temporal neighbor sub-graphs can be expressed as follows:
Figure FDA0003577441450000033
wherein
Figure FDA0003577441450000034
Is ehFor each edge, for a corresponding L-hop graph
Figure FDA0003577441450000035
It contains corresponding time information tr,NL(eh) Is ehThe set of nodes in the non-indirect neighborhood in the knowledge-graph,
Figure FDA0003577441450000036
are the corresponding edges.
6. The subgraph-based temporal knowledge-graph question-answering method according to claim 1, wherein the pruning of the subgraph by time constraints comprises:
extracting all paths from the neighbor nodes to the central node, p1, p2, p 3;
clipping edges which do not meet the time constraint in the path;
all isolated nodes are deleted, the remainder GfI.e. a sub-graph that satisfies the temporal constraint.
7. The subgraph-based temporal knowledge graph question-answering method according to claim 1, wherein each entity in the subgraph is quantitatively scored by using a time activation function to obtain ScorereasoningThe method comprises the following steps:
entities in the subgraphs of different question types are scored using the following four time activation functions:
equality function: when the question types are equal, we assign corresponding scores to entities with the same time as T, and the rest are 0 scores:
Figure DEST_PATH_GDA0003616254690000141
wherein c is1Is a positive number;
before/after function: when the question type is before/after, entities closer to T1/T2 are given a higher score:
Figure FDA0003577441450000042
first/last function: when the question type is first/last, entities earlier/later in time are assigned a higher score:
Figure FDA0003577441450000043
time range function: when the question type is time range, the entity of time between T1 and T2 is assigned a corresponding score, and the other entities are assigned a score of 0:
Figure FDA0003577441450000044
wherein c is2Is a positive number;
in combination with the four temporal perception functions, the form of the composite score function is derived as follows:
Figure FDA0003577441450000051
wherein r (q)i,ei) Is a problem qi1 indicates that the condition is 1 and the condition is 0.
8. The subgraph-based tense knowledge-graph question-answering method according to claim 1, characterized in that scores obtained by spatio-temporal subgraph inference and scores obtained by relevant fact search are fused as final entity scores:
scorefusing=μ*scoresematic+(1-μ)*scorereasoning,
where μ e (0,1) is used to balance the semantic scoresematicAnd subgraph inference scorereasoningThe weight of (c);
the scores for all entities and time are concatenated and softmax is used to calculate the answer probability for this composite score vector and trained using cross-entropy loss.
9. The subgraph-based temporal knowledge graph question-answering method according to claim 1, characterized in that for each question in the CronQuestions dataset, the relevant facts in the temporal knowledge graph are retrieved according to the entity annotation in the question, and if the number is less than 5, the question is rejected, resulting in a complete-CronQuestions dataset.
CN202210347571.7A 2022-04-01 2022-04-01 Temporal knowledge graph question-answering method based on subgraph Pending CN114706951A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210347571.7A CN114706951A (en) 2022-04-01 2022-04-01 Temporal knowledge graph question-answering method based on subgraph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210347571.7A CN114706951A (en) 2022-04-01 2022-04-01 Temporal knowledge graph question-answering method based on subgraph

Publications (1)

Publication Number Publication Date
CN114706951A true CN114706951A (en) 2022-07-05

Family

ID=82172778

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210347571.7A Pending CN114706951A (en) 2022-04-01 2022-04-01 Temporal knowledge graph question-answering method based on subgraph

Country Status (1)

Country Link
CN (1) CN114706951A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115374296A (en) * 2022-10-25 2022-11-22 科大讯飞(苏州)科技有限公司 Question-answering method based on time sequence knowledge graph, entity representation method and related device
CN118095450A (en) * 2024-04-26 2024-05-28 支付宝(杭州)信息技术有限公司 Knowledge-graph-based medical LLM model reasoning method and related equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115374296A (en) * 2022-10-25 2022-11-22 科大讯飞(苏州)科技有限公司 Question-answering method based on time sequence knowledge graph, entity representation method and related device
CN118095450A (en) * 2024-04-26 2024-05-28 支付宝(杭州)信息技术有限公司 Knowledge-graph-based medical LLM model reasoning method and related equipment

Similar Documents

Publication Publication Date Title
Yin et al. Answering questions with complex semantic constraints on open knowledge bases
CN113806563B (en) Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material
CN112002411A (en) Cardiovascular and cerebrovascular disease knowledge map question-answering method based on electronic medical record
Bansal et al. Structured learning for taxonomy induction with belief propagation
Ru et al. Using semantic similarity to reduce wrong labels in distant supervision for relation extraction
CN114706951A (en) Temporal knowledge graph question-answering method based on subgraph
Sovrano et al. Legal knowledge extraction for knowledge graph based question-answering
US20170169355A1 (en) Ground Truth Improvement Via Machine Learned Similar Passage Detection
CN110532328A (en) A kind of text concept figure building method
Wang et al. Neural related work summarization with a joint context-driven attention mechanism
Efremova et al. Multi-source entity resolution for genealogical data
McNamee et al. HLTCOE participation at TAC 2012: Entity linking and cold start knowledge base construction
Li et al. Neural factoid geospatial question answering
Kilias et al. Idel: In-database entity linking with neural embeddings
Chemmengath et al. Topic transferable table question answering
Li et al. Approach of intelligence question-answering system based on physical fitness knowledge graph
Hathout Acquisition of morphological families and derivational series from a machine readable dictionary
Sheikh et al. On semi-automated extraction of causal networks from raw text
Augenstein Towards Explainable Fact Checking
Langenecker et al. Sportstables: A new corpus for semantic type detection
CN115391548A (en) Retrieval knowledge graph library generation method based on combination of scene graph and concept network
Çelebi et al. Automatic question answering for Turkish with pattern parsing
CN114519092A (en) Large-scale complex relation data set construction framework oriented to Chinese field
Alali A novel stacking method for multi-label classification
DeVille et al. Text as Data: Computational Methods of Understanding Written Expression Using SAS

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination