CN113742489A - Comprehensive influence compensation method based on time sequence knowledge graph - Google Patents

Comprehensive influence compensation method based on time sequence knowledge graph Download PDF

Info

Publication number
CN113742489A
CN113742489A CN202110894317.4A CN202110894317A CN113742489A CN 113742489 A CN113742489 A CN 113742489A CN 202110894317 A CN202110894317 A CN 202110894317A CN 113742489 A CN113742489 A CN 113742489A
Authority
CN
China
Prior art keywords
time
event
influence
entity
historical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110894317.4A
Other languages
Chinese (zh)
Inventor
王彬
李哲辉
王炜智
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202110894317.4A priority Critical patent/CN113742489A/en
Publication of CN113742489A publication Critical patent/CN113742489A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a comprehensive influence compensation method based on a time sequence knowledge graph, which expresses acquired triple knowledge information in the knowledge graph by the connection between nodes, can describe the whole data in a graph network form, and is more convenient for analyzing historical events; step S2, on the constructed knowledge graph, dividing the history into different segments by a time slice dividing method, constructing an adjacent matrix corresponding to the event subnets in different time slices, using the information of the event occurrence time to propose a quadruple event representation form, combining the event occurrence time information, considering that the event network in the knowledge graph dynamically changes along with the time, proposing a time attenuation function, and fitting the attenuation trend of the event correlation influence in the history; in step S3, the time span interval is divided, so as to compensate the historical total influence and obtain more accurate historical comprehensive influence.

Description

Comprehensive influence compensation method based on time sequence knowledge graph
Technical Field
The invention relates to a comprehensive influence compensation method based on a time sequence knowledge graph, and belongs to the technical field of time sequence knowledge graphs.
Background
Knowledge Graph (KG) is a Knowledge system that structurally stores Knowledge in the form of Graph databases, and is essentially a semantic network. Because the knowledge graph has strong expression capability, logical meaning and rules and flexible modeling, the knowledge graph is concerned by researchers and is widely applied to specific applications of multiple industries such as information retrieval, intelligent question-answering systems, recommendation systems and the like.
The expression learning is applied to the knowledge graph, and the object to be described is expressed as a low-dimensional dense vector, namely, the problem of data sparsity can be effectively solved by adopting a distributed expression method, and the calculation in a low-dimensional semantic space is facilitated.
A distributed vector representation model TransE based on entities and relations and knowledge translation models such as TransH, TransR and TransD improved for multiple relations on the basis of the distributed vector representation model TransE, describe static knowledge information in a vector translation and space mapping mode, and in the real world, knowledge is often time-tagged and can change along with time. Therefore, time-series knowledge maps taking into account time factors are beginning to be of interest to researchers, and a knowledge representation of four-tuples (head entity, time, relationship, tail entity) is proposed.
In the prior art [1] (Liu J, Zhang Q, Fu L, et al. Evalving Knowledge Graphs [ C ]// IEEE INFOCOM 2019-IEEE Conference on Computer communications. IEEE,2019.), a time influence model EvolngKG based on a time decay function is provided, and influence of a historical event on a current event is described. In the prior art [2] (Zhan Weiwei. Research of improved event KG Method Based on Comprehensive information Model [ J ]. Application Research of computers,2020,37(S1): 159. 162.) on the basis, different event Influence weights are taken into consideration, and a Comprehensive evaluation Method of Influence is provided for entity prediction tasks.
However, the current time-series knowledge graph reasoning algorithm has the problem that the influence of time span is ignored, namely, the larger the time span from the occurrence of the current event is, the more the number of historical events related to the current event is, the larger the accumulative influence of the historical events on the comprehensive influence is; the influence of the events which occur in the near future and have larger relevance with the current events is weakened due to the small number of the events, so that the evaluation of the comprehensive influence of the historical events is influenced.
Disclosure of Invention
The invention provides a comprehensive influence compensation method based on a time sequence knowledge graph, which is used for obtaining the comprehensive influence of compensated historical events and further can be combined with a training model to perform a link prediction task of the time sequence knowledge graph.
The technical scheme of the invention is as follows: a comprehensive influence compensation method based on a time sequence knowledge graph comprises the following steps:
s1, cleaning the data set, extracting triple knowledge (h, r, T) in the cleaned data set and the time of the event represented by the triple knowledge, dividing the extracted data formed by the triple knowledge (h, r, T) and the time T of the event represented by the triple knowledge into a training set and a testing set, and constructing a knowledge graph of the training set; the relation r in the triple knowledge is used as the relation between the nodes in the knowledge graph; counting all head entities and tail entities in the training set and the test set, and representing the head entities and the tail entities as an entity set (E) after duplication removal1,E2,E3....EN}; counting all relations in the training set and the test set, and expressing the relations as R ═ R after removing the duplication1,R2,R3....RM}; wherein E isNRepresenting an Nth entity, wherein the entity is a head entity/a tail entity, and the total number of the entities is N; rMRepresenting the Mth relation, wherein the total number of the relations is M;
s2, on the knowledge graph constructed in the step S1, time slices are divided according to the fixed length d on the historical time axis, and the tuple events on the time axis are divided into { G }1,G2....Gn},GnRepresents the nth event subnet; constructing an adjacency matrix A (G) corresponding to each subnet1),A(G2)....A(Gn) Calculating the correlation between the node pairs with the common neighbor nodes through the adjacency matrix and the similarity index, and then fusing time factors to obtain the correlation influence of the fusion time factors; the influence obtained is regarded asThe historical relevance comprehensive influence of the time slice on the current event under the condition determined by the previous event;
and S3, for the time slices after the division, dividing time span intervals according to the span between the time slices and the current time node, and giving different span factors to calculate to obtain the comprehensive influence of the compensated historical events.
Integrating the compensated comprehensive influence of the historical events into a knowledge representation model as weight, and iteratively obtaining vector representation after time factors are integrated with the entity and the relation; and performing a link prediction task on the test set according to the score ranking and the performance index through the vector representation obtained by training.
The S2 specifically includes:
s2.1, on the knowledge graph constructed in the step S1, time slice division is carried out on the knowledge graph spectrum according to the fixed length d on the historical time axis, and the events on the time axis are divided into { G }1,G2....Gn},GnRepresents the nth event subnet;
s2.2, constructing an adjacency matrix A (G) corresponding to each event subnet1),A(G2)....A(Gn)},A(Gn) Indicating an event subnet GnThe adjacency matrix of (a);
s2.3, corresponding to each adjacency matrix, counting common neighbor nodes of all node pairs;
s2.4, counting the node degree of each common neighbor node of each node pair, taking the node degree as the important contribution degree of the neighbor node in the indirect connection, and calculating the correlation S between the node pairs according to the importance degrees of all the common neighbor nodes between the node pairs through Adamic-Adar indexesAB
S2.5, adding time as fourth element knowledge information into a triple knowledge representation mode, representing the event as a positive quadruple (h, r, T, T), traversing the correlation between the node pairs obtained in the step S2.4 according to the head entity and the tail entity of the current event (A, r, B, T2) of the current event occurring at the current time point, and enabling the S meeting the requirement that the head entity is A and the tail entity is B at the current time point to be SABFusing with time attenuation function to obtain the phase of fused time factorAnd the relevance influence SIM (A, B) is used as the historical relevance comprehensive influence of the time slice on the current events which occur at the current time point and have the head entity of A and the tail entity of B.
The time attenuation function f (T1) e-λ(T2-T1)(ii) a Wherein, T1 represents the time points of the node a and the node B representing the historical events in the knowledge graph, T2 represents the time points of the current event when the head entity is a and the tail entity is B, λ is a decay factor, and the time decay function f (T1) represents the degree of the decay of the influence of the historical events occurring at the time point T1 on the current event.
The S3 specifically includes:
s3.1, carrying out time span interval division on the historical time axis in which the data set is positioned in the step S1 according to an equal-area division method of normal distribution; one or more time slices exist in each time span interval, and each time slice only belongs to any time span interval;
s3.2, under the condition that the current event is determined, counting the historical correlation comprehensive influence of the time slices contained in the time span interval on the current event, and calculating to obtain the comprehensive influence of the time span interval on the current event;
s3.3, endowing different time span intervals with different span factors, and calculating to obtain the comprehensive influence of the historical event on the current event after compensation;
s3.4, integrating the comprehensive influence of the historical events on the current events after compensation as weight into a knowledge representation model, constructing equal-quantity negative quadruples through positive quadruples, training the negative quadruples as model input, and obtaining the vector representation { E after the time factors of the entity and the relation are integrated1,E2,E3....EN},{R1,R2,R3....RM}; wherein E isNAs entity ENVector representation after fusion of time factors, RMAs a relation RMVector representation after fusion of time factors;
s3.5, performing head entity/tail entity replacement on all four tuples in the test set, wherein the replacement modes are the same, and the description is given by the head entity replacement, specifically: replacing a head entity of a quadruple represented by each event in the test set by the statistical N entities to construct N candidate quadruple data, calculating a score in the N candidate quadruple constructed by each event in the test set by a score function, and determining the score ranking of the quadruple in the N candidate quadruple which is the same as the original event in the test set; and judging the effect of the link prediction task through indexes Meanrank and Hits @ according to the score ranking of all events in the statistical test set.
The score function fr(h,t)=||Eh+Rr-Et||L2,EhRepresenting the vector of the head entity in the entity set E after fusing the time factor, RrVector representation after fusion of time factors as a relation, EtRepresenting the vector of the tail entity in the entity set E after fusing time factors; l2 denotes the norm.
The integrated influence of the time span interval on the current event in step S3.2 is:
Figure BDA0003197234010000031
wherein lwFor the combined influence of the w-th time span interval on the current event, qwFor the number of time slices contained in the w-th time span interval, SIMiAnd (A, B) represents the historical correlation comprehensive influence of the ith time slice in the time span interval.
In said step S3.3, the W time span intervals are given different span factors
Figure BDA0003197234010000041
And will obtainwAccumulating to be used as the comprehensive influence of the historical events on the current events after compensation;
Figure BDA0003197234010000042
w is 1, 2.. W; w is the total number of time span interval divisions, qwIs the w timeNumber of corresponding time slices in span interval, lwThe comprehensive influence of the w-th time span interval on the current event is defined, and l is the comprehensive influence of the historical event compensated on the current event with the head entity A and the tail entity B.
The model training process in step S3.4 is represented as:
Figure BDA0003197234010000043
wherein S is a positive quadruplet set, S' is a negative quadruplet set, lposIs the positive quadruple combined influence, f, calculated in S3.3r(h, t) is the score calculation formula for the positive quadruple, lnegIs the negative quadruple combined influence, f, calculated in S3.3r(h ', t') is a negative quadruple score calculation formula, gamma is a standardized item, the training process is the process of minimizing the loss function L, and the output of the training is the vector representation of all entities and relations.
The invention has the beneficial effects that: according to the invention, a comprehensive influence compensation model of historical events is designed based on the time sequence knowledge graph, and the model can effectively mine and capture the influence of the historical events on the current situation, so that more accurate knowledge representation can be obtained; on the basis of dividing the historical events into time slices, the invention not only considers the attenuation of the event influence along with the time, but also considers the influence of neighborhood network information in an event subnet on the future, and simultaneously on the basis, the time span interval is divided to compensate the comprehensive influence, thereby being beneficial to obtaining more accurate knowledge representation; the test on a plurality of data sets shows that the method has strong generalization capability and can be combined with a static vector training model to perform a link prediction task of a time sequence knowledge graph.
Specifically, the method comprises the following steps: the acquired triple knowledge information is represented in the knowledge graph by the connection between the nodes, so that the data can be integrally depicted in a graph network form, and the historical events can be more conveniently analyzed; further, in step S2, on the constructed knowledge graph, the history is divided into different segments by a time slice dividing method, and an adjacency matrix is constructed corresponding to the event subnets in different time slices, so that the calculation of the correlation influence of the history events is facilitated, and the analysis of different influences caused in different time slices is facilitated; the event representation form of a quadruple is provided by utilizing the information of the event occurrence time, the time attenuation function is provided by combining the event occurrence time information and considering that the event network in the knowledge graph dynamically changes along with the time, the attenuation trend of the event correlation influence in the history is fitted, and the event development rule is better met; further, for the problem that the influence of events which occur recently in history is large, but the influence is weakened in the calculation process of accumulated historical influence due to small quantity of events, division of a time span interval is provided in step S3, so that the purpose of compensating historical total influence is achieved, more accurate historical comprehensive influence is obtained, the compensated historical comprehensive influence is integrated into a knowledge representation model, entity and relation vector representation which are integrated with time information are obtained through training, link prediction experiments are further performed on test set data, and the link prediction effect is improved in subsequent tasks through indexes.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a flowchart of a method of compensating for integrated influence;
FIG. 3 is a flow chart of a training experiment;
FIG. 4 is a flow chart of a prediction experiment.
Detailed Description
The invention will be further described with reference to the following figures and examples, without however restricting the scope of the invention thereto.
Example 1: as shown in fig. 1 to 4, a comprehensive influence compensation method based on a time series knowledge graph includes:
s1, cleaning the time sequence structured event data set (eliminating data with missing information), and extracting triple knowledge (h, r, t) in the cleaned data set and the triple knowledgeDividing a plurality of extracted data formed by the triple knowledge (h, r, T) and the event occurrence time T represented by the triple knowledge into a training set and a testing set, and constructing a knowledge graph of the training set; the relation r in the triple knowledge is used as the relation between the nodes in the knowledge graph; counting all head entities and tail entities in the training set and the test set, and representing the head entities and the tail entities as E ═ E { E } by the set after removing the duplication1,E2,E3....EN}; counting all relations in the training set and the test set, and expressing the relations as R ═ R by the set after removing the duplication1,R2,R3....RM}; wherein E isNRepresenting an Nth entity, wherein the entity is a head entity/a tail entity, and the total number of the entities is N; rMRepresenting the Mth relation, wherein the total number of the relations is M;
s2, on the knowledge graph constructed in the step S1, time slices are divided according to fixed length d (month, week and the like can be taken) on the historical time axis, and tuple events on the time axis are divided into { G }1,G2....Gn},GnRepresents the nth event subnet; constructing an adjacency matrix A (G) corresponding to each subnet1),A(G2)....A(Gn) Calculating the correlation between the node pairs with the common neighbor nodes through the adjacency matrix and a similarity index (such as an Adamic-Adar index), and then fusing time factors to obtain the correlation influence of the fusion time factors; taking the obtained influence as the historical correlation comprehensive influence of the time slice on the current event under the condition determined by the current event;
and S3, for the time slices after division, dividing time span intervals according to the span with the current time node, and endowing different span factors to obtain the comprehensive influence of the compensated historical events, thereby realizing the compensation of the comprehensive influence of the events occurring in the near future in the history, and weakening the influence of the events far away.
Further, the comprehensive influence of the compensated historical events can be set to be taken as weight to be fused into the knowledge representation model, and the vector representation { E after the entity and the relation are fused with time factors is obtained in an iteration mode1,E2,E3....EN},{R1,R2,R3....RM}. And performing a link prediction task on the test set according to performance indexes such as score ranking, hit rate and the like through vector representation obtained by training. The link prediction task is to predict what relationship may occur between two nodes at the current time by analyzing historical correlations of the two nodes.
Further, S2 may specifically be:
s2.1, on the knowledge graph constructed in the step S1, time slice division is carried out on the knowledge graph spectrum according to the fixed length d on the historical time axis, and the events on the time axis are divided into { G }1,G2....Gn},GnRepresents the nth event subnet;
s2.2, constructing an adjacency matrix A (G) corresponding to each event subnet1),A(G2)....A(Gn)},A(Gn) Indicating an event subnet GnThe adjacency matrix of (a);
s2.3, corresponding to each adjacency matrix, counting common neighbor nodes of all node pairs (the nodes are in the existence form of entities in the knowledge graph, and two nodes can form one node pair);
s2.4, counting the node degree of each common neighbor node of each node pair, taking the node degree as the important contribution degree of the neighbor node in the indirect connection, and calculating the correlation S between the node pairs according to the importance degrees of all the common neighbor nodes between the node pairs through Adamic-Adar indexesAB
S2.5, considering the correlation S between the node pairs calculated in the step S2.4ABIs decaying with time, so the time is taken as the fourth element knowledge information to be added into the triple knowledge representation mode, the event is represented as a positive quadruple (h, r, T, T), and for the current event (A, r, B, T2) which occurs at the current time point T2, the head of the current event is real according to the current eventTraversing the correlation between the node pairs obtained in the step S2.4 to obtain the S with the head entity A and the tail entity B at the current time pointABAnd fusing (i.e. multiplying) the time attenuation function to obtain the correlation influence SIM (A, B) of the fused time factor, wherein the correlation influence SIM is used as the historical correlation comprehensive influence of the time slice on the current events which occur at the current time point and have the head entity of A and the tail entity of B.
In the step S2.4, the correlation S between the node pairs is calculated through an Adamic-Adar indexAB
Figure BDA0003197234010000061
Wherein Γ (A) is a neighbor node set of the node A, Γ (B) is a neighbor node set of the node B, z is a common neighbor node of A and B, and k (z) is node degree information of the common neighbor node z; A. correlation S between two nodes BABTaking logarithm of node degree of common neighbor node and calculating reciprocal of the logarithm, SABWhich is used to characterize the contribution of the neighbor node to A, B influence of the relevance of the two nodes.
Further, the time decay function f (T1) may be set to e-λ(T2-T1)(ii) a Wherein, T1 represents the time points of the node a and the node B representing the historical events in the knowledge graph, T2 represents the time points of the current event when the head entity is a and the tail entity is B, λ is a decay factor, and the time decay function f (T1) represents the degree of the decay of the influence of the historical events occurring at the time point T1 on the current event; the time decay function in said step S2.5 is a negative exponential function intended to fit the decay trend of the event influence.
If P { (h, r, T, T2) } is used to represent the probability of the event (h, r, T, T2) occurring, the following condition is satisfied:
if the entity h does not generate a new related event within a period of time, the probability of the occurrence of the event remains unchanged by the end of the period of time; if the entity h has occurred related historical events within a certain time range, the probability of the occurrence of the event is greater than that of the event in the case of R1 (no related event occurs) by the end of the time; if the entity h has occurred related historical events within a certain time range, the more the date of the occurrence of the historical events is close to the current event, the higher the probability of the occurrence of the historical events is when the time is over; if the entity h has occurred related historical events within a certain time range, the greater the number of the related historical events, the greater the probability of the occurrence of the related historical events until the end of the time.
Wherein: if nothing happens in the interval T1 to T2 (T1 ═ T2- Δ T), the probability remains unchanged:
P{(h,r,t,T2}=P{(h,r,t,T1}
if some are aggregated
Figure BDA0003197234010000071
The fact represented occurs in the interval from time T1 to T2, the probability satisfies:
R1:
Figure BDA0003197234010000072
wherein
Figure BDA0003197234010000073
and
Figure BDA0003197234010000074
Figure BDA0003197234010000075
And
Figure BDA0003197234010000076
is that
Figure BDA0003197234010000077
Two possibilities of (3).
R2:
Figure BDA0003197234010000078
Wherein
Figure BDA0003197234010000079
T2≥T3≥T4≥T1。
R3:
Figure BDA00031972340100000710
Wherein the content of the first and second substances,
Figure BDA00031972340100000711
therefore, the occurrence of the historical event has a certain effect on the current event, but the influence of the historical event is reduced continuously as the time after the occurrence of the historical event goes on. Generally, the time influence change of the historical event on the current event can be specifically expressed by a time decay function as follows: f (T1) ═ e-λ(T2-T1)(ii) a Wherein, T1 represents the time points of the node a and the node B representing the historical event in the knowledge graph, T2 represents the time points of the current event when the head entity is a and the tail entity is B, λ is a decay factor, and a value is 0.01, and a time decay function f (T1) represents the degree of attenuation of the influence of the historical event occurring at the time point T1 on the current event after the current event target is determined (i.e., the head entity is a and the tail entity is B of the current event is determined).
Further, S3 may specifically be:
s3.1, dividing a time span interval of a historical time axis in which the time sequence structured event data set is positioned in the step S1 according to an equal-area division method of normal distribution; one or more time slices exist in each time span interval, and each time slice only belongs to any time span interval;
s3.2, under the condition that the current event is determined, counting the historical correlation comprehensive influence of the time slices contained in the time span interval on the current event, and calculating to obtain the comprehensive influence of the time span interval on the current event;
s3.3, endowing different time span intervals with different span factors, and calculating to obtain the comprehensive influence of the compensated historical events;
s3.4, taking the comprehensive influence of the compensated historical events as weight, and integrating the weight into a knowledge representation model (such as a knowledge table)The representation model can be a TransE model), and meanwhile, equal-number negative quadruplets are constructed through positive quadruplets and are used as model input for training, so that vector representation { E ] after the entity and the relation are fused with time factors is obtained1,E2,E3....EN},{R1,R2,R3....RM}; wherein E isNAs entity ENVector representation after fusion of time factors, E1As entity E1Vector representation after fusion of time factors, RMAs a relation RMVector representation after fusion of time factors;
the true data set obtained in the foregoing step S1 is used to construct a positive quadruple, and the negative quadruple constructed in this step is a non-true data set.
S3.5, performing head entity/tail entity replacement on all four tuples in the test set, wherein the replacement modes are the same, and the description is given by the head entity replacement, specifically: replacing a head entity of a quadruple represented by each event in the test set by the statistical N entities to construct N candidate quadruple data, calculating a score in the N candidate quadruple constructed by each event in the test set by a score function, and determining the score ranking of the quadruple in the N candidate quadruple which is the same as the original event in the test set; and judging the effect of the link prediction task through indexes Meanrank and Hits @ according to the score ranking of all events in the statistical test set. For example, when N is counted to have 10000 (different entity numbers are counted), each event in the test set data is replaced, and then becomes 10000 candidate quadruple data, where the 10000 quadruple data includes a quadruple data that is the same as the replaced event in the test set data. Each event in the test set data does so.
Further, the score function f may be setr(h,t)=||Eh+Rr-Et||L2,EhRepresenting the vector of the head entity in the entity set E after fusing the time factor, RrVector representation after fusion of time factors as a relation, EtRepresenting the vector of the tail entity in the entity set E after fusing time factors; l2 denotesAnd (4) norm.
Further, the dividing in step S3.1 may be performed according to an equal area method with normal distribution, according to an area integral formula, such as:
Figure BDA0003197234010000081
dividing a historical time axis into a plurality of time span intervals; wherein t is1Is a time starting point, t, of a certain time span interval2Is the time end of a certain time span interval.
Further, the statistical process of the integrated influence of the time span interval on the current event in the step S3.2 may be set as:
Figure BDA0003197234010000082
wherein lwFor the integrated influence of the w-th time span interval on the current event (i.e. the historical correlation integrated influence of the time slices accumulated for the w-th time span interval), qwFor the number of time slices contained in the w-th time span interval, SIMiAnd (A, B) represents the historical correlation comprehensive influence of the ith time slice.
Further, it may be arranged that in said step S3.3, the W time span intervals are given different span factors
Figure BDA0003197234010000091
And will obtainwAccumulating to be used as the comprehensive influence of the historical events on the current events after compensation;
Figure BDA0003197234010000092
w is 1, 2.. W; w is the total number of time span interval divisions, qwThe number of corresponding time slices in the w-th time span interval is, and l is the compensated comprehensive influence of the historical event on the current event with the head entity A and the tail entity B.
Further, it may be set that the model training process in step S3.4 may be expressed as:
Figure BDA0003197234010000093
wherein S is a positive quadruplet set, S' is a negative quadruplet set, lposIs the positive quadruple combined influence, f, calculated in S3.3r(h,t)=||Eh+Rr-Et||L2Formula for score calculation of positive quadruples,/negIs the negative quadruple combined influence, f, calculated in S3.3r(h',t')=||Eh'+Rr-Et'||L2The method is a negative quadruple score calculation formula, gamma is a standardized item, 1.0 is taken, the training process is the process of minimizing a loss function L, and the output of the training is vector representation of all entities and relations; eh、Eh'is vector representation of head entities h and h' in an entity set E after fusion of time factors, RrVector representation after fusion of time factors as a relation, Et、Et'is vector representation after the tail entities t and t' in the entity set E are fused with time factors, and subscript + represents that the value inside brackets and 0 are taken as the maximum value;
example 2: taking the data ICEWS2014 and ICEWS2017 of the comprehensive crisis early warning system as examples, the time sequence knowledge graph link prediction is carried out, and the first table of the experimental data attribute statistics is shown.
Table-statistical table of ICEWS attribute of experimental data
Figure BDA0003197234010000094
A comprehensive influence compensation method based on a time sequence knowledge graph comprises the following steps:
s1, cleaning the data set of the comprehensive crisis early warning system, extracting triple knowledge (h, r, T) in the cleaned data set and the time of the event represented by the triple knowledge (the extraction number of the triple knowledge (h, r, T) and the time of the event represented by the triple knowledge is selected according to actual needs), and extracting a plurality of the extracted triple knowledgeDividing data formed by tuple knowledge (h, r, T) and the time T of occurrence of an event represented by the tuple knowledge into a training set and a testing set, and constructing a knowledge graph on the training set; the relation r in the triple knowledge is used as the relation between the nodes in the knowledge graph; counting all head entities and tail entities in the training set and the test set, and representing the head entities and the tail entities as E ═ E { E } by the set after removing the duplication1,E2,E3....EN}; counting all relations in the training set and the test set, and expressing the relations as R ═ R by the set after removing the duplication1,R2,R3....RM}; wherein E isNRepresenting an Nth entity, wherein the entity is a head entity/a tail entity, and the total number of the entities is N; rMRepresenting the Mth relation, wherein the total number of the relations is M;
s2, dividing the constructed knowledge graph into time slices (in the example, month) according to fixed length d on historical time axes 2014-1-1 to 2014-12-31 and 2017-1-1 to 2017-12-31, and dividing tuple events on the time axes into { G1,G2....GnN is 12 event subnets, and an adjacency matrix { a (G) is constructed corresponding to each subnet1),A(G2)....A(Gn) Calculating the correlation between node pairs with common neighbor nodes through an adjacency matrix and an adaptive-Adar index, fusing time attenuation, and obtaining the influence of a historical event after time attenuation as the comprehensive influence generated by the time slice on the event (A, B) at the current time point;
s3, for the time slices after being divided, time span intervals are divided according to the span of the time slices and the current time node, different span factors are given, the comprehensive influence of recent events in the history is compensated, the weight is integrated into a vector representation model, and the vector representation { E after the entity and the relation are integrated with the time factors is obtained in an iterative mode1,E2,E3....EN},{R1,R2,R3....RM}. Performing link prediction on test set data according to score ranking and hit rate through vector representation obtained by trainingAnd (5) transaction.
The specific method for acquiring the comprehensive influence in the step S2 is as follows:
s2.1, on the knowledge graph constructed in the step S1, time slice division is carried out on the historical time axis according to the fixed length d to divide the tuple events on the time axis into { G }1,G2....GnN event subnets;
s2.2, constructing an adjacency matrix A (G) corresponding to each event subnet1),A(G2)....A(Gn)};
S2.3, traversing all entities corresponding to each adjacency matrix, and counting all nodes which have common neighbors with the entities to obtain a common neighbor set of each node pair;
s2.4, counting the node degrees of the common neighbors of each node pair, taking the node degrees as the important contribution degrees of the neighbor nodes in the indirect connection, and calculating the relevance S between the two node pairs according to the importance degrees of all the common neighbor nodes between the node pairs through Adamic-Adar indexesAB
S2.5, considering the correlation S between the node pairs calculated in the step S2.4ABThe time is attenuated along with the time, so that the time is taken as fourth element knowledge information and added into a triple knowledge representation mode, the event is represented as a positive quadruple (h, r, T, T), for a current event (A, r, B, T2) occurring at a current time point T2, according to a head entity and a tail entity of the current event, the correlation between the node pairs obtained through the step S2.4 is traversed, and S meeting the condition that the head entity is A and the tail entity is B at the current time pointABAnd fusing (i.e. multiplying) the time attenuation function to obtain the correlation influence SIM (A, B) of the fused time factor, wherein the correlation influence SIM is used as the historical correlation comprehensive influence of the time slice on the current events which occur at the current time point and have the head entity of A and the tail entity of B.
The step S3 specifically includes:
s3.1, dividing the time span interval of the historical time axis according to an equal-area division method of normal distribution; according to the area integral formula:
Figure BDA0003197234010000111
the historical time axis is divided into a plurality of time span intervals (3 in the embodiment). The obtained intervals are divided, so that the total influence of events occurring at different historical times on the current event is relatively balanced in quantity and time; satisfies P (t)1≤T1≤t2)≈P(t2≤T1≤t3)≈…≈P(tW≤T1≤tW+1) Probability obtained according to a probability density formula of normal distribution; according to the number of the determined time span intervals, equal-area division is carried out, and the divided intervals correspond to the time axis of the data;
s3.2, endowing different span factors to different time span intervals, counting time slices contained in the time span intervals, and calculating to obtain the comprehensive influence of the time span intervals on the current event;
s3.3, distributing different time span factors to the comprehensive influence force obtained in different time span intervals for an accumulation summation method to obtain the comprehensive influence force of the compensated historical event;
s3.4, taking the comprehensive influence of the compensated historical events as weight, integrating the weight into a knowledge representation model (for example, the knowledge representation model can be a TransE model), simultaneously constructing equal number of negative sample quadruples through positive sample quadruples, taking the negative sample quadruples as model input for training, and obtaining vector representation { E after the entity and the relation are integrated with time factors1,E2,E3....EN},{R1,R2,R3....RM}; wherein E isNAs entity ENVector representation after fusion of time factors, E1As entity E1Vector representation after fusion of time factors, RMAs a relation RMVector representation after fusion of time factors;
the true data set obtained in the foregoing step S1 is used to construct a positive quadruple, and the negative quadruple constructed in this step is a non-true data set. The loss function is defined as:
Figure BDA0003197234010000112
and S3.5, replacing the head entity or the tail entity with all four-tuple in the test set data, constructing a plurality of candidate four-tuple data, performing score ranking, and judging the effect of the link prediction task through indexes Meanrank and Hits @. The candidate quadruplet is constructed by replacing head entities or tail entities of all quadruplets in the test set data one by one, the replaced data is all entities in the entity set E, and the constructed candidate quadruplet comprises the original quadruplet data; passing the candidate quadruple through a scoring function f in turnr(h,t)=||Eh+Rr-Et||L2Obtaining the error values of the quadruple, ranking the error values of all the quadruple, and counting the ranking of the original quadruple; and averaging the ranks of all the data in the test set to obtain the value of an index mean, and counting the proportion of the data ranked in the first, the first ten and the first fifty to obtain the value of an index Hits @ to judge the effect of the link prediction task.
As tables two to seven show the link prediction effect of the invention on the real world comprehensive crisis early warning data sets ICEWS2014 and ICEWS2017, the Trans series is a traditional method (time attenuation is not considered), as more multiple relations exist in the data set used by the invention, and the spatial calculation is optimized by the TransD algorithm aiming at the multiple relations, the Hits @50 result of the invention on the ICEWS2014 data set is slightly higher than that of the invention, but other index results of the TransD are not as good as that of the invention. For the MenaRank index, the results of the algorithm on the two data sets of ICEWS2014 and ICEWS2017 are optimized compared with the traditional Trans algorithm. For Hits @ index, the effect of the method of the present invention on the ICEWS2014 and ICEWS2017 data sets is not much different (for example, the TranH difference is large), the method of the present invention has improvements on different data sets, and is superior to other methods, and compared to other methods, the method of the present invention has a better generalization ability for different data sets.
On the basis of considering time attenuation, compared with two methods, namely, evlovingg (namely, prior art 1 referred to in the background) and evlovingg _ weight (namely, prior art 2 referred to in the background), in experimental results, the method disclosed by the invention is in two ways, namely, ICEWS2014 and ICEWS2017On the data set, for the MeanRank index, the results of head entity prediction have mean values reduced by 73.6% and 59.2% (taking EvlovingKG as an example, the mean value refers to
Figure BDA0003197234010000121
Other similar reasons). The mean values of the results of tail entity prediction were reduced by 74.8% and 60.2%. The indexes of Hits @1, Hits @10 and Hits @50 are improved by 113.7% on two data sets of ICEWS2014 and ICEWS2017 compared with the average value of head entity prediction results of Evervingg _ weight (namely, the indexes are improved by 113.7% (namely, the indexes are improved by the average value of head entity prediction results of Evervingg _ weight)
Figure BDA0003197234010000122
) 51.2 percent and 33.2 percent, and the average value of the prediction results of the tail entities is improved by 57.6 percent, 23.7 percent and 44.4 percent.
Table two: ICEWS2014 data set head-to-tail entity link prediction Meanrank result comparison
Method Head entity meanank Tail entity meanank
TransE 6583 6144
TransH 4527 5386
TransD 1434 1397
TransR 8127 7847
EvlovingKG 6154 6397
EvlovingKG_weight 4123 4104
The invention 1347 1325
Table three: ICEWS2017 data set head-tail entity link prediction Meanrank result comparison
Method Head entity meanank Tail entity meanank
TransE 6199 6324
TransH 7919 8314
TransD 2206 2107
TransR 9815 8173
EvlovingKG 6361 6319
EvlovingKG_weight 3971 3947
The invention 1951 1878
Table four: ICEWS2014 data head entity link prediction Hits @1, Hits @10 and Hits @50 result comparison
Method Hits@1 Hits@10 Hits@50
TransE 0.97 5.56 13.96
TransH 2.08 12.62 23.11
TransD 0.2 18.62 38.55
TransR 0.67 1.21 2.78
EvlovingKG 1.16 2.8 4.45
EvlovingKG_weight 1.33 12.49 27.72
The invention 3.53 19.42 35.39
Table five: ICEWS2014 data set tail entity link prediction Hits @1, Hits @10 and Hits @50 result comparison
Method Hits@1 Hits@10 Hits@50
TransE 0.79 4.47 13.58
TransH 2.31 13.2 21.3
TransD 0.45 16.37 32.42
TransR 0.78 1.4 3.32
EvlovingKG 1.54 5.61 7.67
EvlovingKG_weight 1.83 13.9 24.51
The invention 2.86 17.1 33.38
Table six: ICEWS2017 data head entity link prediction Hits @1, Hits @10 and Hits @50 result comparison
Method Hits@1 Hits@10 Hits@50
TransE 1.07 6.64 14.5
TransH 0.14 0.39 0.97
TransD 1.31 11.62 25.45
TransR 0.15 0.3 0.93
EvlovingKG 0.16 0.89 1.5
EvlovingKG_weight 1.44 9.83 21.5
The invention 2.39 14.33 30.17
TABLE VII: ICEWS2017 dataset end entity link prediction Hits @1, Hits @10 and Hits @50 result comparison
Method Hits@1 Hits@10 Hits@50
TransE 1.28 5.49 13.19
TransH 0.37 0.7 1.82
TransD 1.6 12.46 26.63
TransR 0.14 0.51 1.37
EvlovingKG 0.67 1.17 2.8
EvlovingKG_weight 1.5 10.9 20.69
The invention 2.39 15.4 31.9
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (9)

1. A comprehensive influence compensation method based on a time sequence knowledge graph is characterized by comprising the following steps: the method comprises the following steps:
s1, cleaning the data set, extracting triple knowledge (h, r, T) in the cleaned data set and the time of the event represented by the triple knowledge, dividing the extracted data formed by the triple knowledge (h, r, T) and the time T of the event represented by the triple knowledge into a training set and a testing set, and constructing a knowledge graph of the training set; the relation r in the triple knowledge is used as the relation between the nodes in the knowledge graph; counting all head entities and tail entities in the training set and the test set, and representing the head entities and the tail entities as an entity set (E) after duplication removal1,E2,E3....EN}; counting all relations in the training set and the test set, and expressing the relations as R ═ R after removing the duplication1,R2,R3....RM}; wherein E isNRepresenting an Nth entity, wherein the entity is a head entity/a tail entity, and the total number of the entities is N; rMRepresenting the Mth relation, wherein the total number of the relations is M;
s2, on the knowledge graph constructed in the step S1, time slices are divided according to the fixed length d on the historical time axis, and the tuple events on the time axis are divided into { G }1,G2....Gn},GnRepresents the nth event subnet; constructing an adjacency matrix A (G) corresponding to each subnet1),A(G2)....A(Gn) Calculating the correlation between the node pairs with the common neighbor nodes through the adjacency matrix and the similarity index, and then fusing time factors to obtain the correlation influence of the fusion time factors; taking the obtained influence as the historical correlation comprehensive influence of the time slice on the current event under the condition determined by the current event;
and S3, for the time slices after the division, dividing time span intervals according to the span between the time slices and the current time node, and giving different span factors to calculate to obtain the comprehensive influence of the compensated historical events.
2. The time series knowledge graph-based synthetic influence compensation method according to claim 1, wherein: integrating the compensated comprehensive influence of the historical events into a knowledge representation model as weight, and iteratively obtaining vector representation after time factors are integrated with the entity and the relation; and performing a link prediction task on the test set according to the score ranking and the performance index through the vector representation obtained by training.
3. The time series knowledge-graph-based synthetic influence compensation method according to claim 1 or 2, wherein: the S2 specifically includes:
s2.1, on the knowledge graph constructed in the step S1, time slice division is carried out on the knowledge graph spectrum according to the fixed length d on the historical time axis, and the events on the time axis are divided into { G }1,G2....Gn},GnRepresents the nth event subnet;
s2.2, constructing an adjacency matrix A (G) corresponding to each event subnet1),A(G2)....A(Gn)},A(Gn) Indicating an event subnet GnThe adjacency matrix of (a);
s2.3, corresponding to each adjacency matrix, counting common neighbor nodes of all node pairs;
s2.4, counting the node degree of each common neighbor node of each node pair, taking the node degree as the important contribution degree of the neighbor node in the indirect connection, and calculating the correlation S between the node pairs according to the importance degrees of all the common neighbor nodes between the node pairs through Adamic-Adar indexesAB
S2.5, adding time as fourth element knowledge information into a triple knowledge representation mode, representing the event as a positive quadruple (h, r, T, T), traversing the section obtained in the step S2.4 according to the head entity and the tail entity of the current event (A, r, B, T2) occurring at the current time pointThe correlation between the point pairs meets the requirement of S with a head entity of A and a tail entity of B at the current time pointABAnd fusing with a time attenuation function to obtain the correlation influence SIM (A, B) fusing time factors, and taking the correlation influence SIM (A, B) as the historical correlation comprehensive influence of the time slice on the current events of which the head entity is A and the tail entity is B and which occur at the current time point.
4. The time series knowledge graph-based synthetic influence compensation method according to claim 3, wherein: the time attenuation function f (T1) e-λ(T2-T1)(ii) a Wherein, T1 represents the time points of the node a and the node B representing the historical events in the knowledge graph, T2 represents the time points of the current event when the head entity is a and the tail entity is B, λ is a decay factor, and the time decay function f (T1) represents the degree of the decay of the influence of the historical events occurring at the time point T1 on the current event.
5. The time series knowledge-graph-based synthetic influence compensation method according to claim 1 or 2, wherein: the S3 specifically includes:
s3.1, carrying out time span interval division on the historical time axis in which the data set is positioned in the step S1 according to an equal-area division method of normal distribution; one or more time slices exist in each time span interval, and each time slice only belongs to any time span interval;
s3.2, under the condition that the current event is determined, counting the historical correlation comprehensive influence of the time slices contained in the time span interval on the current event, and calculating to obtain the comprehensive influence of the time span interval on the current event;
s3.3, endowing different time span intervals with different span factors, and calculating to obtain the comprehensive influence of the historical event on the current event after compensation;
s3.4, integrating the comprehensive influence of the historical events on the current events after compensation as weight into a knowledge representation model, constructing equal-number negative quadruples through positive quadruples, and training the negative quadruples as model input to obtain the final productVector representation after fusing time factors to entities and relationships E1,E2,E3....EN},{R1,R2,R3....RM}; wherein E isNAs entity ENVector representation after fusion of time factors, RMAs a relation RMVector representation after fusion of time factors;
s3.5, performing head entity/tail entity replacement on all four tuples in the test set, wherein the replacement modes are the same, and the description is given by the head entity replacement, specifically: replacing a head entity of a quadruple represented by each event in the test set by the statistical N entities to construct N candidate quadruple data, calculating a score in the N candidate quadruple constructed by each event in the test set by a score function, and determining the score ranking of the quadruple in the N candidate quadruple which is the same as the original event in the test set; and judging the effect of the link prediction task through indexes Meanrank and Hits @ according to the score ranking of all events in the statistical test set.
6. The time series knowledge graph-based synthetic influence compensation method according to claim 5, wherein: the score function
Figure FDA0003197231000000031
EhRepresenting the vector of the head entity in the entity set E after fusing the time factor, RrVector representation after fusion of time factors as a relation, EtRepresenting the vector of the tail entity in the entity set E after fusing time factors; l2 denotes the norm.
7. The time series knowledge graph-based synthetic influence compensation method according to claim 5, wherein: the integrated influence of the time span interval on the current event in step S3.2 is:
Figure FDA0003197231000000032
wherein lwFor the w-th time span interval for the summary of the current eventResultant influence qwFor the number of time slices contained in the w-th time span interval, SIMiAnd (A, B) represents the historical correlation comprehensive influence of the ith time slice in the time span interval.
8. The time series knowledge graph-based synthetic influence compensation method according to claim 5, wherein: in said step S3.3, the W time span intervals are given different span factors
Figure FDA0003197231000000033
And will obtainwAccumulating to be used as the comprehensive influence of the historical events on the current events after compensation;
Figure FDA0003197231000000034
w is 1, 2.. W; w is the total number of time span interval divisions, qwIs the corresponding time slice number, l, in the w time span intervalwThe comprehensive influence of the w-th time span interval on the current event is defined, and l is the comprehensive influence of the historical event compensated on the current event with the head entity A and the tail entity B.
9. The time series knowledge graph-based synthetic influence compensation method according to claim 5, wherein: the model training process in step S3.4 is represented as:
Figure FDA0003197231000000035
wherein S is a positive quadruplet set, S' is a negative quadruplet set, lposIs the positive quadruple combined influence, f, calculated in S3.3r(h, t) is the score calculation formula for the positive quadruple, lnegIs the negative quadruple combined influence, f, calculated in S3.3r(h ', t') is a score calculation formula of negative quadrupleAnd gamma is a standardized item, the training process is the process of minimizing the loss function L, and the output of the training is the vector representation of all entities and relations.
CN202110894317.4A 2021-08-05 2021-08-05 Comprehensive influence compensation method based on time sequence knowledge graph Pending CN113742489A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110894317.4A CN113742489A (en) 2021-08-05 2021-08-05 Comprehensive influence compensation method based on time sequence knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110894317.4A CN113742489A (en) 2021-08-05 2021-08-05 Comprehensive influence compensation method based on time sequence knowledge graph

Publications (1)

Publication Number Publication Date
CN113742489A true CN113742489A (en) 2021-12-03

Family

ID=78730131

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110894317.4A Pending CN113742489A (en) 2021-08-05 2021-08-05 Comprehensive influence compensation method based on time sequence knowledge graph

Country Status (1)

Country Link
CN (1) CN113742489A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115858822A (en) * 2023-02-21 2023-03-28 北京网智天元大数据科技有限公司 Time sequence knowledge graph construction method and system
CN115907144A (en) * 2022-11-21 2023-04-04 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Event prediction method and device, terminal equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115907144A (en) * 2022-11-21 2023-04-04 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Event prediction method and device, terminal equipment and storage medium
CN115858822A (en) * 2023-02-21 2023-03-28 北京网智天元大数据科技有限公司 Time sequence knowledge graph construction method and system

Similar Documents

Publication Publication Date Title
Ding et al. A novel composite forecasting framework by adaptive data preprocessing and optimized nonlinear grey Bernoulli model for new energy vehicles sales
McMahan et al. Ad click prediction: a view from the trenches
US7089250B2 (en) Method and system for associating events
Lahiri et al. Structure prediction in temporal networks using frequent subgraphs
JP2021518024A (en) How to generate data for machine learning algorithms, systems
CN113742489A (en) Comprehensive influence compensation method based on time sequence knowledge graph
Steck et al. Bayesian belief networks for data mining
CN110796313B (en) Session recommendation method based on weighted graph volume and item attraction model
Archak et al. Mining advertiser-specific user behavior using adfactors
Wei et al. Measuring temporal patterns in dynamic social networks
WO2019172848A1 (en) Method and apparatus for predicting occurrence of an event to facilitate asset maintenance
CN110795613B (en) Commodity searching method, device and system and electronic equipment
CN112650933A (en) High-order aggregation-based graph convolution and multi-head attention mechanism conversation recommendation method
US11321359B2 (en) Review and curation of record clustering changes at large scale
Zhu et al. A fuzzy clustering‐based denoising model for evaluating uncertainty in collaborative filtering recommender systems
CN113610610A (en) Session recommendation method and system based on graph neural network and comment similarity
CN107730306A (en) Film score in predicting and preference method of estimation based on multidimensional preference model
CN110727867A (en) Semantic entity recommendation method based on fuzzy mechanism
Xu et al. A new multilevel modeling approach for clustered survival data
CN114462627A (en) Method for diagnosing abnormity of top-blown smelting system based on Hui wolf algorithm and support vector machine
CN114429404A (en) Multi-mode heterogeneous social network community discovery method
CN111078840B (en) Movie comment sentiment analysis method based on document vector
Madyembwa et al. An Automated Data Pre-processing Technique for Machine Learning in Critical Systems
US11941020B2 (en) Displaying query results using machine learning model-determined query results visualizations
SURYANA et al. HYBRIDIZATION APPROACH TO ELIMINATE SPARSE DATA BASED ON NONNEGATIVE MATRIX FACTORIZATION & DEEP LEARNING.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination