CN114662693A

CN114662693A - Reinforced learning knowledge graph reasoning method based on action sampling

Info

Publication number: CN114662693A
Application number: CN202210244316.XA
Authority: CN
Inventors: 贾海涛; 乔磊崖; 李家伟; 李嘉豪; 林萧; 曾靓
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2022-06-24

Abstract

The invention discloses a reinforcement learning knowledge graph reasoning method based on action sampling. Aiming at the problems of insufficient representation capability, ineffective redundant action selection and no memory component in the traditional knowledge graph reinforcement learning inference algorithm, the method aims at selecting a representation learning method with stronger adaptability to represent a reinforcement learning environment to enhance the algorithm representation capability according to the original fact prediction score of the representation learning method on a data set; designing an action sampler to reduce invalid redundant action selection of the intelligent agent in the walking process; the LSTM is used as a memory component, and historical information is coded to increase model precision, so that the algorithm can obtain an effect superior to that of a path-based reasoning algorithm without pre-training. The method maps the path obtained by the agent walking in the environment to the three-layer LSTM strategy network, promotes the agent to select a more meaningful path through action sampling, and finally realizes more accurate entity relationship path learning.

Description

Reinforced learning knowledge graph reasoning method based on action sampling

Technical Field

The invention belongs to the field of natural language processing.

Background

In recent years, deep learning techniques have achieved many of the most advanced results in a variety of classification and recognition problems. However, complex natural language processing problems often require multiple interrelated decisions, and the ability to have deep learning models learn reasoning remains a challenging problem. To process complex queries without obvious answers, the intelligent machine must be able to reason about existing resources and learn to infer unknown answers.

With the continuous development of knowledge graph reasoning technology, reinforcement learning is proved to obtain better results in a knowledge reasoning task. The deep Path issued by the EMNLP2017 introduces reinforcement learning into reasoning of the knowledge graph for the first time, and the deep Path simply samples the knowledge graph and puts the knowledge graph into a strategy network for training. The main task is to give an entity pair (entity1, entity2) in a knowledge graph, so that a model can reason the path from a head entity to a tail entity; its subtasks include Link Prediction (Link Prediction) and Fact Prediction (Fact Prediction). However, deepPath suffers from the following problems:

(1) states in the environment are simply represented by TransE, and the representation capability is insufficient;

(2) the random action sampling mode may cause the agent to take many invalid redundant actions, consuming computational cost, and generating a false path problem;

(3) the state vector is directly input into the strategy network, and rich relevance and semantic information among original states are lost.

Aiming at the problems, the invention provides a reinforced Learning known Knowledge Graph Reasoning Method (RLKGR-ASM) based on Action Sampling and an LSTM memory component.

Disclosure of Invention

The invention provides a reinforcement learning knowledge graph reasoning method based on action sampling, and aims to solve the problems of insufficient representation capability, invalid action selection, no memory component and the like of the existing reinforcement learning reasoning method. The method comprises the following steps:

(1) and selecting an optimal representation method for different data sets in a data processing layer, and representing the triad and reasoning relation in the data as a feature vector.

(2) And pre-training the model by using a random Breadth First Strategy (BFS) and expert data in a pre-training layer so as to improve the convergence of the model.

(3) And adding a reward function to the secondary training layer for retraining, and adding an action sampler and an LSTM memory component to the RL model.

(4) The output layer uses a policy network for output.

Drawings

FIG. 1 RLKGR-ASM algorithm flow chart

FIG. 2 is a schematic diagram of an LSTM memory device

FIG. 3 is a schematic view of an operation sampler

FIG. 4 the train series of MAP scores on the fact prediction task

FIG. 5 NELL-995 data set Link prediction task MAP-value comparison

FIG. 6 FB15K-2375 data set Link prediction task MAP value comparison

FIG. 7 Hits @1, Hits @3, MRR, MAP values on the prediction task of this experiment and DeepPath linking the NELL-995 and FB15K-237 datasets

FIG. 8 shows actual MAP values of predicted results for TransE, TransR, TransH, TransD, DeepPath and RLKGR-ASM (in this experiment)

FIG. 9 number of inference paths used by PRA and this experiment

FIG. 10 DeepPath, RLKGR-ASM (without motion sampler), RLKGR-ASM (for this experiment) migration time per round on two datasets (unit: sec)

Detailed Description

The technical solution in the embodiment of the present invention will be clearly and completely described below with reference to fig. 1 in the embodiment of the present invention.

As shown in the attached figure 1, the invention is based on an action sampling and LSTM memory component, and the reasoning algorithm mainly comprises five parts of data preprocessing, pre-training, reward retraining and output. The specific implementation mode is as follows: the method comprises the following steps: data processing layer

After the basic preprocessing is carried out on the data sets NELL-995 and FB15K-237 used in the experiment, the evaluation standard is consistent with the evaluation standard of the final result of the experiment by directly applying four embedding-based methods of TransE, TransH, TransR and TransD to the task of fact prediction: average accuracy (MAP), results are shown in figure 4. As shown, TransD works best in NELL-995; among FB15K-237, TransH gave the best results.

The original reasoning result of the embedding method on the data set can directly reflect the adaptation degree of the representation method and the data set, the higher the score is, the better the reasoning effect is, namely, the method can more and more perfectly acquire the original semantic information of the data, and the algorithm environment has stronger representation capability; based on this, the present invention selects TransD as a representative method of NELL-995 and TransH as a representative method of FB 15K-237.

Step two: pre-training layer

The model is pre-trained using a random Breadth First Strategy (BFS) with expert data to improve the convergence of the model. For each relationship, we learn the supervised policy using a subset of all positive samples (entity pairs).

For each relationship, the algorithm learns a supervised policy using a subset of all positive samples; for each positive sample (es, et), bilateral BFS is employed in the pre-training process to find the correct path between entities. For each sequence of path relationships (r1, r 2.., rn), θ is updated to maximize the desired reward, as shown in equation (1), where J (θ) is the desired reward.

For supervised learning, the algorithm rewards each successful walk with +1, as shown in equation (2) we update the gradient of the policy network with the correct path found by BFS.

Step three: reward retraining layer

The RL agent used to implement reinforcement learning and the external environment of reinforcement learning are defined, and the environment is initialized according to the definition of the global reward function.

The reinforcement learning system is composed of two parts, the first part is an external environment E, and the dynamics between KG and intelligent agent interaction are specified. This environment is modeled as a Markov Decision Process (MDP). The MDP is defined as a tuple<S,A,P,R>Where S is a continuous state space, and A ═ a₁,a₂,.....a_nIs the set of all available actions, P is the transition probability matrix, and R (s, a) is the reward function for each (s, a).

The second part of the system is an agent, which is represented as a policy network, e.g., pi_θ(s, a) ═ p (a | s; θ). It maps the state to a random strategy and updates the neural net geometric parameter theta by adopting a random gradient descent method.

The components of the system are respectively as follows:

action (Action): give the entity pair with relation r (e)_s,e_t) Reinforcement learning agents are expected to find the most informative paths connecting these two entities. Starting from the head entity es, the agent uses the policy network to select the most likely relationship, expanding the path in each step until it reaches the target entity et. We defineThe output dimension of the strategy is equal to the related coefficient in a large-scale Knowledge Graph (KG), namely, the action space is defined as all relations in KG.

Status (States): the entities and relationships in the KG are discrete symbols, and to obtain semantic information for these symbols, the present invention uses TransD and TransH to map NELL-995 and FB15K-237, respectively, into a low-dimensional space. The state vector of the t step is

ω_t＝(e_t⊥，e_target⊥-e_t⊥). Where et is the embedding of the current entity, e_targetIs the embedded vector of the target entity.

Reward (Reward): in the rewarding retraining process, the intelligent agent needs to obtain the rewarding feedback to judge the quality of the walk, so as to update the network parameters, and the global rewarding function defined by the invention is shown as the following formula (3):

if the agent can reach the target through a series of actions, a +1 global reward is obtained.

For the relational reasoning task, it is observed that short paths tend to provide more reliable reasoning evidence than long paths. Shorter relationship chains can also improve reasoning efficiency by limiting the length of RL interaction with the environment. The efficiency reward is defined as shown in equation (4).

Wherein a path p is defined as a series of relations r₁→r₂→......→r_n。

The training samples (entity1, entity2) have similar state representations in vector space. The agent is inclined to find paths with similar syntax and semantics. These paths usually contain redundant information, and in order to encourage the agent to find different paths, a diversity reward function is defined using the cosine similarity between the current path and the existing path as shown in equation (5).

Wherein

Represents a relationship chain r₁→r₂→......→r_nIs input.

And then an LSTM memory component is built, and the LSTM memory component is added in the aspect of state representation. The dynamic environment path planning algorithm is optimized and reinforced by combining with the LSTM, as shown in the attached figure 2.

The strategy network of the common DeepPath reinforcement learning algorithm only receives the state representation of the current time, but the search strategy is related to the historical information, and in order to enable the algorithm to obtain the sufficient correlation between the historical information and the state, the invention adopts a three-layer LSTM network to encode the historical search information;

cell state s of the LSTM layer at time t_tAnd output h_tThe calculation process of (2) is as follows:

first, synthesize the current input x_tOutput h at time (t-1)_t-1And bias term b of forgetting gate_fThe activation value f of the gate at this time is obtained_t. Invalid information is deleted from the state at time t-1. And then sigmoid is selected as an activation function to realize normalized output as shown in the following formula (6).

f_t＝sigmoid(W_f,xx_t+W_f,hh_t-1+b_f) (6)

Next, the LSTM layer stores the cell state s_tThe more efficient information is selected for storage. First, candidate values that may possibly be added to the cell state are calculated

Calculating an activation value i for an input gate_tAs shown in formula (7) and formula (8).

i_t＝sigmoid(W_i,xx_t+W_i,hh_t-1+b_i) (8)

Finally, the current unit state s is updated according to the previous calculation result_tAs shown in formula (10), wherein

Representing a Hadamard (element) product.

Output h of the LSTM network_tCan be expressed by the following formulae (10) and (11).

o_t＝sigmoid(W_o,xx_t+W_o,hh_t-1+b_o) (10)

After three layers of coding, can be expressed as h_t＝LSTM(h_t-1,w_t) When t is 0, h_t-10. After the encoding is completed, the state of RL at this time is denoted as s_t＝(h_t,w_t) Inputting the state into a strategy network, and training through a fully-connected neural network consisting of two layers of ReLU and one layer of Softmax to obtain an action probability matrix.

When the agent selects an action, the invention sets an action sampler, as shown in fig. 3.

In the reinforcement learning algorithm, an intelligent agent continuously expands a path through interaction with the environment, a strategy network outputs an action probability matrix after receiving a joint state of a current state and historical information, and the intelligent agent selects a next action according to the action probability matrix to expand the path.

In order to avoid the intelligent agent from going to excessive selection invalid paths during action selection, an action sampler is added in the chapter during action selection of the intelligent agent: recording termination node e whenever self occurs in random walk of agent_dAction (relation) r with this selection_dAnd added to the memory of the motion sampler, and recorded as invalid motion, and expressed as (e)_d，r_d) The pair of entity relationships of (1); in subsequent walks, assume that the agent arrives at e_tIf e is_tExisting in the physical memory set of the motion sampler, the motion sampler will remove r from the motion space when the next motion is selected_dAt this time, the next action selected by the agent is not necessary to be an invalid action which has occurred before, so that the agent is encouraged to have a greater probability to perform a complete walk to search a more informative path set, and meanwhile, the calculation power can be saved.

Step four: output layer

In order to find the inference path controlled by the reward function, the supervised policy network needs to be retrained using the reward function. The training process is similar to pre-training, except that the reward function portion is added, where the gradient of the parameter is updated as shown in equation (12).

Starting from the source node, the agent selects a relationship according to a random strategy pi (a | s) to extend its thrust path, the relationship links that may cause the agent to reach a new entity or may not produce any result (action sampler reduces the occurrence of such a situation), the failed step causes the agent to obtain a negative reward, and if successfully reaching the target entity, the agent obtains a positive reward of + 1.

The purpose of Link Prediction (Link Prediction) is to predict the target entity. For each entity-relationship pair (e, r), there is a true value trum and about 10 generated false values tfaiid. Here, the results of PRA, DeepPath, TransE, TransR and RLKGR-ASM (experiment) and RLKGR-ASM (no pre-training) are listed (the most representative 10 relationships of NELL-995 and FB15K-237, respectively), as shown in FIGS. 5 and 6. As can be seen from the table, the reinforcement learning based inference method in most cases due to the embedding based methods (TransE, TransR) and the path based method (PRA), RLKGR-ASM (this experiment) achieved better results overall on each link prediction task at NELL-995. The MAP value of the invention is higher than that of other algorithms, and particularly, the invention obtains excellent effect on FB 15K-237; on NELL-995, the MAP index of this experiment increased by 7.8%, 2.7%, 13.9%, 1.9% compared to TransE, TransR, PRA and DeepPath, respectively; on FB15K-237, the MAP index of the experiment is increased by 8.1%, 7.4%, 7.2% and 4.1% compared with TransE, TransR, PRA and DeepPath respectively, which proves the effectiveness of the experiment.

In addition, since no supervised pre-training of expert data is performed, the MAP value of the RLKGR-ASM (without pre-training) linked prediction task on two data is superior to the embedded inference methods of TransE and TransR and the path-based inference method of PRA, although it is still superior to deep path and the experiment. On NELL-995, the overall MAP index of RLKGR-ASM (without pre-training) was improved by 5.4%, 0.3%, 11.5%, respectively, compared to the conventional algorithm; on FB15K-237, the overall MAP indicator of RLKGR-ASM (without pre-training) was improved by 2.7%, 2.0%, and 1.8%, respectively, compared to the conventional algorithm.

In addition, a detailed comparison of the model performance of the present algorithm with DeepPath is shown in FIG. 7, which lists the values of Hits @1, Hits @3, MRR and MAP for both algorithms on the NELL-995 dataset and the FB15K-237 dataset, respectively, on the link prediction task. As can be seen from the table, on the NELL-995 data set, the experiment compares that DeepPath increases the result indexes of Hits @1, Hits @3, MRR and MAP by 1.3%, 2.5%, 2.1% and 1.9%, respectively; on the FB15K-237 data set, the experiment compared with DeepPath shows that the values of the result indexes, i.e. Hits @1, Hits @3, MRR and MAP, are respectively increased by 4.6%, 6.3%, 4.9% and 4.1%.

Fact Prediction (Fact Prediction) aims at predicting the truth of an unknown Fact, the proportion of positive and negative triplets in a dataset being about 1: 10. this task is different from link prediction to rank target entities, but directly ranks all positive and negative samples of a particular relationship. In the fact prediction task, this section selects TransE, TransR, TransD, TransH and DeepPath to compare with RLKGR-ASM (this experiment). MAP is used as an evaluation index of a comparison experiment, and Hits @ N and MRR are used as auxiliary evaluation indexes when DeepPath is used for fine comparison.

FIG. 8 lists the MAP scores in the actual prediction task based on the embedded Trans series model and DeepPath, respectively, and this experiment. It can be seen that the experiment is superior to the embedded method and DeepPath method based on the traditional reinforcement learning reasoning method in the result of the fact prediction task, and in the NELL-995 data set, the MAP value of the experiment is respectively increased by 14.4%, 13.8%, 12.1%, 11.4% and 3.4% compared with TransE, TransR, TransD, TransH and DeepPath; in the FB15K-237 dataset, the MAP values for this experiment were increased by 4.2%, 1.0%, 1.7%, 1.6%, 0.8% over TransE, TransR, TransD, TransH, and DeepPath, respectively. Excellent performance was achieved on the NELL data set, but slightly improved in the FB15K data set.

In addition, taking the relations "athletehomemestadium", "works for", and "organization hiredperson" of the link prediction task in the NELL-995 dataset as an example, fig. 9 lists the number of inference paths used by PRA and RLKGR-ASM (this experiment), and it can be seen that the number of inference paths used in this experiment is far less than that of inference paths used in the reasoning experiment, which shows that the reinforcement learning method can achieve better mapping by a more compact learning path set compared with the inference method based on paths.

For the time overhead, as shown in fig. 10, the experiment adds a burden to the computational overhead due to the addition of the LSTM memory component, and compared with deep path, the average generation time per round of the experiment on the NELL-995 data set is 13.19862 seconds, which is increased by 31.88%; the average iteration time per round on the FB15K-237 data set was 18.01331 seconds, which increased 16.31%

When the motion sampler is not used, the situation that the intelligent agent has invalid motion selection and the like in the walking process can be generated, the computational power is consumed in an invalid mode, on a NELL-995 data set, the iteration time of each round is 14.25433 seconds when the motion sampler is not used, and the time overhead of the experiment can be reduced by 7.42% through the motion sampler; on the FB15K-237 data set, the time of each iteration is 19.23654 seconds when the motion sampler is not used, and the time overhead of the experiment can be reduced by 6.34% by the motion sampler.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited in scope to the specific embodiments. Such variations are obvious and all the inventions utilizing the concepts of the present invention are intended to be protected.

Claims

1. A reinforcement learning knowledge graph reasoning algorithm based on action sampling comprises the following steps:

step 1: selecting an optimal representation method for different data sets in a data processing layer, and representing the triple and the inference relation in the data as a feature vector;

and 2, step: pre-training the model by using a random Breadth First Strategy (BFS) and expert data in a pre-training layer so as to improve the convergence of the model;

and step 3: the steps are the core contents of the patent: adding a reward function for retraining, and adding an action sampler and an LSTM memory component into the RL model; the invention adopts a three-layer LSTM network to encode the historical search information, as shown in the formula;

h_t＝LSTM(h_t-1,w_t) When t is 0, h_t-1＝0

The LSTM with three layers receives the entity embedded vector at the moment, three threshold modules are added in the structure of the circulation body of the LSTM, and the problems of gradient disappearance and explosion probably existing in the traditional neural network are solved while the LSTM has a memory function; after the encoding is completed, the state of RL at this time is denoted as s_t＝(h_t，w_t) Inputting the state into a strategy network, training through a fully-connected neural network consisting of two layers of ReLU and one layer of Softmax to obtain an action probability matrix, and selecting the next action by the intelligent agent through the action probability matrix fed back by the strategy network to continuously expand the path;

the following formula is an output action probability matrix of the strategy network;

π_θ(a_t|s_t)＝σ(A_t×W₂ReLU(W|[h_t；s_t]))

in order to avoid the intelligent agent from going to excessive selection invalid paths during action selection, an action sampler is added in the chapter during action selection of the intelligent agent: recording termination node e whenever self condition occurs for random walk of agent_dAction (relation) r with this selection_dAnd added to the memory of the motion sampler, and recorded as invalid motion, and expressed as (e)_d，r_d) The pair of entity relationships of (1); in subsequent walks, assume that the agent arrives at e_tIf e is_tExisting in the physical memory set of the motion sampler, the motion sampler will remove r from the motion space when the next motion is selected_dAt the moment, the next action selected by the intelligent agent is not necessary to be an invalid action which is already generated before, so that the intelligent agent is encouraged to have a higher probability to carry out complete wandering once, a more information path set is searched, and meanwhile, the calculation power can be saved;

and 4, step 4: the output layer uses a policy network for output.

2. The method as claimed in claim 1, wherein step 1 is to select a representation learning method with better effect according to the strong and weak representation abilities of different representation learning methods on a specific data set, and to improve the representation ability of the reinforcement learning environment from the bottom layer.

3. The method of claim 1, wherein step 3 adds an LSTM memory component to encode historical information to help the agent to find inference paths more efficiently, the algorithm can get rid of pre-training and the accuracy of the acquisition is better than path-based and embedded-based inference methods; under the condition of pre-training, the method effectively improves the result precision, and compared with TransE, TransR, PRA and DeepPath, the MAP index of the experiment is respectively increased by 7.8%, 2.7%, 13.9% and 1.9% on NELL-995; on FB15K-237, the MAP index of the experiment increases by 8.1%, 7.4%, 7.2% and 4.1% compared with TransE, TransR, PRA and DeepPath respectively; for the factual prediction task, in NELL-995 data set, the MAP values for this experiment were increased by 14.4%, 13.8%, 12.1%, 11.4%, 3.4% over TransE, TransR, TransD, TransH, and DeepPath, respectively; in the FB15K-237 dataset, the MAP values for this experiment were increased by 4.2%, 1.0%, 1.7%, 1.6%, 0.8% over TransE, TransR, TransD, TransH, and DeepPath, respectively.

4. The method according to claim 1, wherein step 3 sets a motion sampler, so as to reduce invalid redundant motion selection of the agent in the walking process, promote the agent to select more meaningful paths, and effectively save time and cost: on NELL-995 data set, when the iteration time of each round is 14.25433 seconds without using the action sampler, the action sampler can reduce the time overhead of the experiment by 7.42%; on the FB15K-237 data set, the time of each iteration is 19.23654 seconds when the motion sampler is not used, and the time overhead of the experiment can be reduced by 6.34% by the motion sampler.