CN109241291B

CN109241291B - Knowledge graph optimal path query system and method based on deep reinforcement learning

Info

Publication number: CN109241291B
Application number: CN201810791353.6A
Authority: CN
Inventors: 黄震华
Original assignee: South China Normal University
Current assignee: South China Normal University
Priority date: 2018-07-18
Filing date: 2018-07-18
Publication date: 2022-02-15
Anticipated expiration: 2038-07-18
Also published as: CN109241291A

Abstract

The invention provides a method for inquiring an optimal path of a knowledge map based on deep reinforcement learning, which comprises two modules, namely a first module and a second module, wherein the first module is a knowledge map optimal path model offline training module, the second module is a knowledge map optimal path model online application module, the knowledge map optimal path model offline training module is provided with a deep reinforcement learning part, the current entity is subjected to deep reinforcement training learning to obtain the next entity, the next entity is subjected to repeated training learning to obtain an optimal path model, and the initial entity and the target entity are input into the optimal path model obtained by the first module to finally obtain the optimal path. The operation efficiency is improved.

Description

Knowledge graph optimal path query system and method based on deep reinforcement learning

Technical Field

The invention relates to the field of computers, in particular to a knowledge graph optimal path query system and a knowledge graph optimal path query method based on deep reinforcement learning.

Background

Knowledge Graph (Knowledge Graph) aims to describe and depict various entities (Entity) existing in the real world and relationships (relationship) among the entities, and is generally organized and represented by a directed Graph, nodes in the Graph represent the entities, edges are formed by relationships, and the relationships are used for connecting two entities and depicting whether the two entities have the relevance described by the relationship or not; if an edge exists between two entities, the association is shown, otherwise, the association is not shown. In practical application, a numerical value between 0 and 1 is added to each entity relationship (namely each edge of a graph) in the knowledge graph, so that the association degree between entities is reflected; the value may represent confidence, closeness, distance or cost, etc. according to different application requirements, and thus such a knowledge graph is referred to as a probabilistic knowledge graph.

The optimal path query between the probability knowledge graph entities has extremely important significance for searching the relationship between two entities in the knowledge graph field, and is one of core technologies applied to knowledge extraction, entity search, knowledge graph network optimization, knowledge graph entity relationship analysis and the like. For such complex data query and retrieval types, an effective data organization method and an efficient query processing method are needed to accurately and effectively calculate the result required by the user, and therefore, it is very necessary and very challenging to improve the query efficiency and reduce the processing cost. The topology of the probabilistic knowledge graph is a weighted directed graph.

At present, the mainstream graph optimal path query method includes Dijkstra algorithm, Floyd algorithm, Bellman-Ford algorithm and the like. However, with the advent of the big data era, the query efficiency of these methods has not been able to meet the acceptable time frame and the storage space that the machine can accommodate, and they have not been able to solve the problem of the optimal route query with the huge amount of data.

It is found that for a large-scale data network such as a probabilistic knowledge graph, if query time is required to be reduced, a strategy of exchanging space for time is often adopted, query results with high query frequency are stored, the Landmaeks-BFS method sorts the query frequency of probability knowledge graph entities according to users, optimal paths among commonly used entities are pruned, and the optimal paths among the entities are stored in a set. In addition, there are also some that employ acceleration techniques on query data preprocessing, such as a parallel query method based on bidirectional search, a query method based on target guidance, and a query method based on hierarchy. These techniques satisfy the requirement in query efficiency, however, since pruning discards some intermediate points, there is a decrease in query accuracy, and if mispruning is not done, the shortest path may not be queried, and if too few pruning is done between two points, it is easily degraded to breadth-first search, time-inefficient and poor scalability. The shortest path of the accurate query probability knowledge graph is difficult to be achieved, a balance between time and space is required, and the query quality is difficult to be ensured while the query time meets the requirements of users.

Disclosure of Invention

In order to overcome at least one defect (deficiency) in the prior art, the invention provides the optimal path query method between the probabilistic knowledge graph entities, which has high accuracy, strong generalization capability, high speed and easy expansion.

In order to solve the technical problems, the technical scheme of the invention is as follows:

the system comprises two modules, namely a first module and a second module, wherein the first module is a knowledge graph optimal path model offline training module, the second module is a knowledge graph optimal path model online application module, the knowledge graph optimal path model offline training module is provided with a depth-enhanced learning component, the current entity is subjected to depth-enhanced training learning to obtain a next entity, the next entity is subjected to repeated training learning to obtain an optimal path model, a starting entity and a target entity are input into the optimal path model obtained by the first module to finally obtain an optimal path, and the purposes of high accuracy, strong generalization capability, high speed and easiness in expansion are achieved through cooperation between the two modules.

Further, the deep reinforcement learning component is composed of an encoder, a network component and a logistic regression component, the network component comprises a conversion component and a training component, the conversion component comprises a CNN neural network and an FC neural network, and the training component comprises a reinforcement learning Policy strategy network and a reinforcement learning value network.

Further, the reinforcement learning Policy network is composed of five layers of fully-connected neural networks, the number of the first four layers of nodes of the reinforcement learning Policy neural network is reduced step by step, k neurons are arranged on the fifth layer, the first layer, the second layer and the third layer of the reinforcement learning Policy neural network are prevented from being over-fitted by a dropout technology, an activation function adopts a tanh function, batch standardization technology is adopted between the third layer and the fourth layer to enhance the generalization capability of the model, a sigmod function is adopted as the activation function, and the full connection is adopted between the fourth layer and the fifth layer to obtain the probability of k relations to be predicted, and the probability is used as the behavior selection of the next entity;

the reinforcement learning Value neural network is composed of five layers of fully-connected neural networks, the fully-connected neural networks which decrease gradually are adopted from the first layer to the fourth layer of the reinforcement learning Value neural network, only one neuron is arranged on the fifth layer, a dropout technology is adopted between the first layer and the second layer of the reinforcement learning Value neural network and between the second layer and the third layer of the reinforcement learning Value neural network to prevent overfitting, tanh functions are adopted for activation functions of the first layer and the second layer, a sigmod function is adopted for activation functions of the third layer, batch standardization technology is adopted between the third layer and the fourth layer to enhance the generalization capability of the model, relu functions are adopted for activation functions, full connection is adopted between the fourth layer and the fifth layer, and the output result is the income brought by the accumulation of the current state to the target state predicted by the Value network.

The invention provides a knowledge graph optimal path query method based on deep reinforcement learning, which specifically comprises the following steps:

s1, firstly, sorting entity relations in a probability knowledge graph from large to small according to user access frequency in unit time, selecting n relations, and generating a required data sample set;

s2, inputting the data sample set into a deep reinforcement learning component for training and learning;

s3, respectively carrying out training learning of three stages, namely stage 1, stage 2 and stage 3, in the deep reinforcement learning component;

stage 1: an encoder is adopted to convert the entity into an initial word vector, and then the encoded initial word vector is further processed and converted into a word vector required by a deep reinforcement learning component through a 1-10-layer CNN convolutional neural network;

and (2) stage: predicting the next passing relationship of the current entity based on the reinforcement learning Policy network;

and (3) stage: performing value calculation on the selected strategy based on the reinforcement learning value network;

s4, obtaining an optimal queried path model after training and learning in the step S3;

s5, inputting a starting entity and a target entity, sequentially converting the starting entity and the target entity into word vectors, then fusing the two word vectors and inputting the two word vectors into the optimal path model of the query in the step S4 until the target entity is found, and finally obtaining an optimal query path with the starting point as the starting entity and the end point as the target entity.

Further, in the step S1, n relations are selected, where n is not less than 1/10 of the total number of the probabilistic knowledge graph entity relations, γ ═ n/2 relations are randomly selected from the n relations, and the γ relations corresponding to the probabilistic knowledge graph and two entities connected to each relation constitute a data sample set required by model training.

Further, the entity e to be input in the stage 1 of the step S3₁And e₂Converted into two word vectors G by an encoder and a network element_θ(e₁) And G_θ(e₂) Theta is a network parameter set to be optimized, and two word vectors G obtained in the stage 1 are used_θ(e₁) And G_θ(e₂) Similarity calculation is performed to find the cosine distance of the two, which is shown as the following formula:

D_θ(e₁,e₂)＝||G_θ(e₁)-G_θ(e₂)||_cos，

in the training process, the two received data samples may be denoted as { (F, e)₁,e₂) F is the label of each data sample, thereby constructing a trained loss function, as shown in the following formula:

where n is the total number of training samples.

Further, the loss function L (θ) needs to be minimized, and the loss function L (θ) can be refined as:

L_srepresents a loss function between the same entities, and L_uRepresenting loss functions between different entities, it being necessary for L to be_uAs small as possible, so that L_sAs large as possible.

Further, the step of step S3The stage 2 and the stage 3 are carried out in a training part in a deep reinforcement learning part, the training part comprises a strategy network and a value network, the stage 2 carries out strategy training, the stage 3 carries out value training and optimizes a parameter set of the two networks, namely a parameter theta of a Policy strategy network_pAnd the Value network parameter theta_vIn two training, there are four tuples<State, reward, action, model>Wherein the states are represented by entities in the probabilistic knowledge graph.

Further, the method comprises the following steps of obtaining a strategy function and a value function based on target-driven deep reinforcement learning in a strategy network and a value network: for the strategy function, fitting is carried out through a neural network estimated by a nonlinear function to obtain the strategy function of f (e)_t,g|θ_p) For the cost function, the profit from the current node to the target node is also fitted through a neural network estimated by a nonlinear function, and the cost function is obtained to be h (e)_t,g|θ_v)。

Further, the return obtained by the cost function is multiplied by the strategy estimation given by the strategy function to represent the loss function of the strategy network, as shown in the following formula:

L_f＝log f(e_t,g|θ_p)×((r_t+γh(e_t+1,g|θ_v)-h(e_t,g|θ_v))，

wherein γ ∈ (0,1) represents a discount factor and is in accordance with L_fFor parameter theta_pDerivation is carried out, and the parameter theta of the Policy network is updated in a gradient ascending manner_pTo obtain the following formula:

it is indicated that the derivation operation is performed,

representing the policy function f (e)_t,g|θ_p) The entropy term of (1) is the learning rate;

if the product of the current strategy and the income brought by the strategy selection is positive, the parameter theta of the Policy network is updated positively_pSuch that the likelihood of predicting the state at the next time increases; if the product is negative, updating the parameter theta of the Policy network reversely_pSuch that the probability of predicting the state next time is as small as possible until the policy of the current network prediction no longer fluctuates.

Further, the obtained cost function h (e)_t,g|θ_v) Actual profit r from current entity_t+γh(e_t+1,g|θ_v) And calculating the absolute value of the difference between the two to obtain a loss function of the value network, which is shown as the following formula:

L_h＝|(r_t+γ×h(e_t+1,g|θ_v))-h(e_t,g|θ_v)|，

wherein γ ∈ (0,1) represents a discount factor and is in accordance with L_hFor parameter theta_vDerivation is carried out, and the parameter theta of the Value network is updated in a gradient descending mode_vTo obtain the following formula:

representing the derivation operation, if the predicted benefit h (e)_t,g|θ_v) And the calculated profit r_t+γh(e_t+1,g|θ_v) If the error between the values is larger than the threshold Value l given by the user, the parameter theta of the Value network is updated_vMaking the predicted profit error as small as possible until the predicted profit h (e)_t,g|θ_v) And the calculated profit r_t+γh(e_t+1,g|θ_v) With an error of [ -l, l ] at a user-given threshold]No longer fluctuates within the range of (a).

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

(1) the invention provides a probabilistic knowledge graph, and performs probabilistic processing between 0 and 1 on entity relations, so that the optimal path query on the knowledge graph is more in line with the actual application requirements.

(2) Because the invention adopts the reinforcement learning mode to train, on one hand, the problem of poor final calculation effect caused by the irrational label design in the existing deep learning method is reduced, and on the other hand, the method reduces the search space by saving the shortest path from the current entity to a certain entity in each iteration process, so that the model has stronger adaptability and higher accuracy.

(3) The method is based on the deep learning technology, and the initial word vector and the target word vector are fused through two convolutional neural networks which have the same structure, share weights and are pre-trained, so that the training is prevented from being restarted due to the change of a target entity, the generalization capability of the model is improved, and the calculation accuracy is improved.

(4) The invention has clear logic structure in each module, flexible calculation mode and good loose coupling, can flexibly set network structure, meets the calculation requirement, is not limited by specific development tools and programming software, can be quickly expanded to distributed and parallelized development environments, can realize distributed calculation especially for reinforcement learning and deep learning, and improves the operation efficiency.

Drawings

Fig. 1 is a technical framework diagram of a knowledge graph optimal path query method based on deep reinforcement learning.

FIG. 2 is a logical block diagram of a deep reinforcement learning component.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples. Example 1

The invention provides a knowledge graph optimal path query system based on deep reinforcement learning, which comprises two modules, namely a first module and a second module, wherein the first module is a knowledge graph optimal path model offline training module, the second module is a knowledge graph optimal path model online application module, the knowledge graph optimal path model offline training module is provided with a deep reinforcement learning component, the current entity is subjected to deep reinforcement training learning, data is subjected to loading and changing training through the first module to obtain the next entity from the current entity to the target entity, the next entity is subjected to repeated training learning to obtain a trained optimal path model, the target entity and the initial entity are converted and input into the optimal path model generated by the first module in the second module to realize re-reinforcement, and finally an optimal query path can be obtained, through the cooperation use between two modules, reach the purpose that the degree of accuracy is high, the generalization ability is strong, fast and easily extension.

The first module firstly constructs a data sample set for the offline training of the optimal path model, and the construction is as follows: the method comprises the steps of firstly sequencing entity relations in a probability knowledge graph from large to small according to the access frequency of users in the latest m unit time, further selecting the first n relations, wherein n is not less than 1/8 of the total number of the entity relations of the probability knowledge graph, and then randomly selecting gamma-n/2 relations from the n relations, so that the gamma relations corresponding to the probability knowledge graph and two entities connected with each relation form a data sample set required by model training.

On the basis, the first module inputs each constructed data sample into the deep reinforcement learning component shown in fig. 2 for training and learning, searches and obtains a relationship with the highest next probability associated with the current entity, and after the obtaining is completed, the parameters of the deep reinforcement learning component are updated by fusing the return value of the next entity corresponding to the selected relationship. The process is iterated at the module one, and the parameters of the deep reinforcement learning component are continuously updated until the current entity is the target entity or the iteration number exceeds the maximum iteration threshold value given by the user, and a candidate path from the starting entity to the target entity is obtained at the moment. And then, the first module calculates the total return of the current candidate path and compares the total return with the total return of the complete path inquired before, if the benefit of the current path is higher than that of the path inquired before, the current path is used as the optimal path inquired before to obtain an optimal path model, and the process is repeatedly executed until the parameters of the deep reinforcement learning component are converged.

As shown in fig. 2, the deep reinforcement learning component of the module i is composed of a word2vec encoder, a CNN (Convolutional Neural Network) Neural Network, an FC (Full Connect) Neural Network, a reinforcement learning Policy Network, a reinforcement learning Value (Value) Network, and a logistic regression component. The training process of the deep reinforcement learning component is mainly divided into 3 stages, wherein in the stage 1, a word2vec encoder is adopted to convert an entity into an initial word vector, and then the encoded initial word vector is further processed and converted into a word vector required by the deep reinforcement learning component through a multilayer CNN convolutional neural network; stage 2, predicting the next passing relationship of the current entity based on a reinforcement learning Policy network; stage 3 performs Value calculation for the selected strategy based on a reinforcement learning Value (Value) network.

In stage 1, the invention firstly inputs c entities, respectively converts the c entities into corresponding c word vectors through a word2vec word embedding encoder, the dimensions of the c word vectors are the same, then randomly selects 2 word vectors from the c entity word vectors, and inputs the two word vectors into a multilayer CNN convolutional neural network, wherein the multilayer CNN convolutional neural network has a total structure of 8 layers: the first layer carries out convolution processing on 2 input entity word vectors respectively, the second layer carries out maximum pooling operation on convolution of the first layer, the third layer and the fourth layer continue to carry out convolution processing on data obtained by the second layer pooling layer, then, after passing through the maximum pooling layer of the fifth layer, the fifth layer and the seventh layer are sequentially accessed to carry out convolution processing, and finally, two final word vectors are obtained through the eighth layer average pooling layer. Particularly, after the second layer and the fifth layer complete the maximum pooling operation, the output results are subjected to batch standardization processing. Thus, the eighth layer gets the word vector as the output of stage 1. The task of the multilayer CNN convolutional neural network training is to calculate the distance between two word vectors obtained by the eighth layer, so that the distance between the word vectors obtained by the positive sample is as small as possible, and the distance between the word vectors obtained by the negative sample is as large as possible. In addition, the two multilayer convolutional neural networks have the same structure, and the network weights are shared.

The reinforcement learning Policy network is primarily trained in phase 2. The invention firstly takes the word vector of the current entity and the word vector of the target entity as input, and the output vector obtained by the full connection layer as the input word vector of the Policy network. The Policy network is composed of five layers of fully-connected neural networks, the number of nodes of the first four layers of neural networks is gradually reduced, and the fifth layer is provided with k neurons. Dropout technology is adopted between the first layer and the second layer and between the second layer and the third layer to prevent overfitting, and tanh function is adopted as the activation function. And a batch standardization technology is adopted between the third layer and the fourth layer to enhance the generalization capability of the model, and meanwhile, the sigmod function is adopted for the activation function. And the probability of k relations to be predicted is obtained by adopting full connection between the fourth layer and the fifth layer and is used as the behavior selection of the next entity. The output of Policy network is the most probable relationship and it is treated as the behavior (Action) obtained by Policy network. The k relationships are chosen as follows: first select k₁The relationship with the highest confidence coefficient, and then randomly selecting k-k from the rest relationships₁And sorting the k confidence coefficients according to the confidence coefficients from high to low, thereby obtaining the k relationship with the maximum confidence coefficient output by the Policy network. The training task of Policy networks is to select the best strategy possible, maximizing the revenue generated by the next entity to which the selected relationship arrives

While stage 3 is primarily trained on the reinforcement learning Value network. The input of the Value network is the same as that of the Policy network, namely, the word vector of the current entity and the word vector of the target entity are used as input, and the output vector is obtained through a full connection layer. The Value network is composed of five layers of fully-connected neural networks, the fully-connected neural networks which decrease step by step are adopted from the first layer to the fourth layer, and only one neuron is arranged at the fifth layer. Dropout technology is adopted between the first layer and the second layer and between the second layer and the third layer to prevent overfitting, tanh functions are adopted for the activation functions of the first layer and the second layer, and sigmod functions are adopted for the activation functions of the third layer. And a batch standardization technology is adopted between the third layer and the fourth layer to enhance the generalization capability of the model, and the relu functions are adopted as the activation functions. The fourth layer and the fifth layer are all connected, and the output result is the income brought by the accumulation of the current state to the target state predicted by the Value network. The training task of the Value network is to minimize the error between the predicted yield in the current state and the sum of the confidence of the relationship given by the Policy network and the predicted yield in the next state.

And a second module takes an initial entity and a target entity in the probability knowledge graph as input, respectively converts the initial entity and the target entity into one-dimensional word vectors through a word2vec word embedded encoder and an 8-layer CNN convolutional neural network in sequence, and then fuses the two one-dimensional word vectors to be used as the input of a reinforcement learning Policy network and a Value network. The Policy network and the Value network overlap each other, and from the starting entity, the current entity is given to the next entity with the optimal target entity each time until the target entity is found. Finally, an optimal query path with a starting point as a starting entity and an end point as a target entity is obtained.

The invention also provides a knowledge graph optimal path query method based on deep reinforcement learning, which specifically comprises the following steps:

s1, firstly, ordering the entity relations in the probability knowledge graph from large to small according to the access frequency of users in the latest m unit time, further selecting the first n relations, wherein n is not less than 1/8 of the total number of the entity relations of the probability knowledge graph, and then randomly selecting gamma-n/2 relations from the n relations, so that the gamma relations corresponding to the probability knowledge graph and two entities connected with each relation form a data sample set required by model training.

And S2, converting the input current entity and the input target entity into two one-dimensional word vectors with the length of 512 respectively by using a word2vec word embedded encoder of the google company.

And S3, then, respectively carrying out training learning of three stages of stage 1, stage 2 and stage 3 in the deep reinforcement learning component.

Stage 1: two CNN convolutional neural networks with completely same structures and shared weights are constructed, and the construction process is as follows:

the first layer of the CNN convolutional neural network comprises 512 neurons, 2 convolutional kernels of 2 multiplied by 1 are adopted, the sliding step length is fixed to be 2, and the layer is mainly used for carrying out convolution processing on one-dimensional word vectors (the length is equal to 512) obtained by the word2vec word embedded encoder, so that 2 one-dimensional vectors with the length of 256 are obtained. Next, the second layer of the CNN convolutional neural network performs maximum pooling operation on the 2 one-dimensional word vectors output by the first layer using 2 convolutional kernels with a convolutional kernel size of 2 × 1 and a sliding step size of 1, thereby obtaining 2 one-dimensional vectors with a length of 256. Then, on this basis, a batch of standard operations is performed on the 2 one-dimensional vectors. Then, the third layer of the CNN convolutional neural network uses 4 × 1 convolutional cores to perform convolutional processing on 2 one-dimensional vectors output by the second layer after batch calibration, and the sliding step length is fixed to 4, so as to obtain 8 one-dimensional vectors with the length of 64. Next, the fourth layer of the CNN convolutional neural network uses 1 convolution kernel of 4 × 1 with a sliding step of 1, and performs convolution processing again on the 8 one-dimensional vectors output from the third layer, thereby obtaining 8 one-dimensional vectors with a length of 64. Then, the fifth layer of the CNN convolutional neural network performs maximum pooling operation on the 8 one-dimensional vectors of the fourth layer again, the size of the convolution kernel is equal to 2 × 1, the number of the convolution kernels is equal to 4, and the sliding step size is 2, thereby obtaining 32 one-dimensional vectors with the length of 32. On this basis, a batch of standard operations is performed on the 32 one-dimensional vectors. Then, the sixth layer of the network performs convolution processing on 32 one-dimensional vectors subjected to batch calibration and output by the fifth layer by adopting 2 convolution kernels of 4 × 1, and the sliding step length is fixed to be 2, so that 64 one-dimensional vectors with the length of 16 are obtained. Then, the seventh layer of the network performs convolution processing on the 64 one-dimensional vectors output by the sixth layer by using 4 × 1 convolution cores with a sliding step size of 4, thereby obtaining 40 one-dimensional vectors with a length of 512. Finally, the eighth layer of the network adopts an average pooling operation, and finally obtains 256 one-dimensional vectors with the length of 4 dimensions, and then the 256 one-dimensional vectors are connected with 512 neurons through full connection, so that one-dimensional vectors with the length of 512 are obtained.

After two CNN convolutional neural networks with completely same structures and shared weights are constructed, the invention trains and optimizes the CNN convolutional neural networks through entities and relations in a probability knowledge graph, and the process is as follows:

the inputs of the two CNN convolutional neural networks are respectively two entities e₁And e₂And the output is two one-dimensional vectors G of length 512_θ(e₁) And G_θ(e₂) And theta is a network parameter set to be optimized. Then, similarity calculation is performed on the two one-dimensional vectors, namely, the cosine distance of the two one-dimensional vectors is calculated: d_θ(e₁,e₂)＝||G_θ(e₁)-G_θ(e₂)||_cosIf e is₁And e₂The two entities differ significantly, then D_θ(e₁,e₂) Is larger, and if e₁And are the same or similar, then D_θ(e₁,e₂) Is smaller.

Thus, during the training process, the data samples received by the two CNN convolutional neural networks can be expressed as { (F, e)₁,e₂) Where F is the label of each data sample, if e₁And e₂Representing the same entity, then F is 1 and F is 0. Thus, the loss function for the formation training is derived as:

where n is the total number of training samples.

On the basis of the above, use L_sRepresents a loss function between the same entities, and L_uRepresenting the loss function between different entities. To achieve the goal of minimizing the loss function L (θ), L needs to be minimized_uAs small as possible, so that L_sAs large as possible. The trained loss function L (θ) can thus be refined as:

in the training process, the same entity distance can be finally enabled to be as small as possible and different entity distances are as large as possible through the minimization of the loss function L (theta), so that the discrimination of the samples is increased. In addition, in the training process, 100 ten thousand sample entities are selected, 25 ten thousand pairs of identical entity pairs are randomly selected from the sample entities as positive samples, 25 ten thousand pairs of different entity pairs are randomly selected as negative samples, and the samples are mixed and input into a network for training.

After the two CNN convolutional neural networks are calculated, one-dimensional vectors with the length of 512 corresponding to the current entity and the target entity are obtained. Then, the two one-dimensional vectors are fully connected again, that is, the two one-dimensional vectors with the length of 512 are directly connected to obtain a one-dimensional vector with the length of 1024, and then the one-dimensional vector is connected to a fully connected layer with 512 neurons, and finally a one-dimensional vector with the length of 512 is obtained. We use it to represent the fused current entity and the target entity;

stage 2 and stage 3 are mainly used for training the Policy network and the Value network in the deep reinforcement learning component and optimizing parameter sets of the two networks, namely the parameter theta of the Policy network_pAnd the Value network parameter theta_v. Continuously and iteratively training the two stages to search the next optimal strategy and dynamically updating the parameter theta_pAnd theta_vUntil the global optimal strategy is obtained. Each iteration finds a target entity within a limited number of steps and updates the parameter theta_pAnd theta_v. In particular, module one sets the maximum number of iterations c_maxAnd if the current iteration number exceeds, stopping the iteration.

For this purpose, the invention firstly defines the quadruple needed in the training process of the two networks based on the probability knowledge graph<State, reward, action, model>Wherein the states are represented by entities in a probabilistic knowledge graph, e.g. the current entity e_tA target entity g and a starting entity s; current entity e_tTo the next entity e_t+1R for reporting_tIs represented by_tIs equal to e_tAnd e_t+1Confidence of the relationship between; the action m represents the action selected for the behavior of the agent and corresponds to the relationship between the current entity and the next entity in the probabilistic knowledge graph; finally, the model represents a Policy function or a cost function based on target-driven deep reinforcement learning in a Policy network or a Value network: for the strategy function, the invention fits through a neural network of nonlinear function estimation, i.e. the strategy function is f (e)_t,g|θ_p) For the cost function, the invention also adopts the neural network of the nonlinear function estimation to fit the benefit from the current node to the target node, namely the cost function is h (e)_t,g|θ_v)

And (2) stage: first, a parameter set theta of Policy network is set_pRandom initialization is performed. Then, the Policy network receives as input the one-dimensional vectors corresponding to the current entity and the target entity. The first layer of the Policy network of Policy has 256 neurons, and the neurons are fully connected with one-dimensional vectors (with the length of 512) corresponding to the current entity and the target entity; the second layer has 64 neurons; the third layer has 32 neurons; the fourth layer has 16 neurons; and the fifth layer has 10 neurons representing the value of the output 10 entities and the probability of selecting the 10 entities, the 10 entities are composed of the first 7 entities with higher confidence from the current entity to the next layer entity and 3 randomly selected entities from the rest entities, and if the number of the next layer entities is less than 10, redundant entity units are filled with 0. The first, second and third layers all employ a tanh activation function, while the fourth and fifth layers employ a sigmod activation function. Meanwhile, the prediction precision is improved by adopting a dropout technology and implementing batch standardization processing between layers. Finally, the fifth layer of 10 neurons outputs the probability of 10 relations selected by Policy network, and then the relation with the maximum probability is obtained through softmax function as the choice of the behavior.

In the stage 2 training process, the loss function of the Policy network expressed by multiplying the return obtained based on the cost function by the Policy estimate given by the current Policy function is as follows:

L_f＝log f(e_t,g|θ_p)×((r_t+γh(e_t+1,g|θ_v)-h(e_t,g|θ_v))，

where γ ∈ (0,1) represents the discount factor. Then according to L_fFor parameter theta_pDerivative and update the parameter theta in a gradient ascending manner_pAnd then obtaining:

wherein the content of the first and second substances,

it is indicated that the derivation operation is performed,

representing the policy function f (e)_t,g|θ_p) The entropy term of (1) is the learning rate, and the purpose of adding the entropy term is to avoid that the Policy network gets the suboptimal Policy too early and falls into local optimization. If the product of the current strategy and the profit brought by selecting the strategy is positive, then theta is updated in the positive direction_pA value such that the likelihood of predicting the state next increases; if the product is negative, update θ inversely_pThe value is that the probability of predicting the state next time is as small as possible until the strategy of the current network prediction does not fluctuate any more;

and (3) stage: first, a parameter set theta of the Value network_vRandom initialization is performed. Then, as with Policy networks, the Value network receives as input one-dimensional vectors corresponding to the current entity and the target entity. The first layer of the Value network is provided with 256 neurons which are in full connection with one-dimensional vectors (with the length of 512) corresponding to the current entity and the target entity; the second layer has 128 neurons; the third layer has 64 neurons; the fourth layer has 32 neurons; the fifth layer has a neuron that represents the value of the current state. Dropout technology is adopted between the first layer and the second layer and between the second layer and the third layerAnd (4) stopping overfitting. The first layer and the second layer both adopt tanh activation functions, and the third layer and the fourth layer both adopt sigmod activation functions. And a batch standardization process is carried out between the third layer and the fourth layer to enhance the generalization capability of the model. And a fully-connected neural network is adopted between the fourth layer and the fifth layer to finally obtain the predicted value.

In the training process of the stage 3, the actual profit r of the current entity is calculated_t+γh(e_t+1,g|θ_v) And the predicted profit h (e)_t,g|θ_v) The absolute Value of the difference between the values, and as a function of the loss of the Value network, is shown as:

L_h＝|(r_t+γ×h(e_t+1,g|θ_v))-h(e_t,g|θ_v)|，

where γ ∈ (0,1) represents the discount factor. Then according to L_hFor parameter theta_vDerivative and update the parameter theta in a gradient decreasing manner_vAnd then obtaining: :

wherein the content of the first and second substances,

representing a derivation operation. If the predicted profit h (e)_t,g|θ_v) And the calculated profit r_t+γh(e_t+1,g|θ_v) With an error greater than a user-specified threshold/then update theta_vMaking the predicted profit error as small as possible until the predicted profit h (e)_t,g|θ_v) And the calculated profit r_t+γh(e_t+1,g|θ_v) With an error of [ -l, l ] at a user-given threshold]No longer fluctuating within the range of (1);

and S4, continuously updating the parameters of the deep reinforcement learning component in the iteration process until the current entity is the target entity or the iteration times exceed the maximum iteration threshold value given by the user, and obtaining a candidate path from the initial entity to the target entity. And then, calculating the total return of the current candidate path and comparing the total return with the total return of the complete path of the previous query, if the benefit of the current path is higher than that of the previous query path, taking the current path as the optimal path model of the query, and repeatedly executing the processes until the parameters of the deep reinforcement learning component are converged.

And S5, inputting entities in the two probability knowledge maps, namely a starting entity s and a target entity g, and converting the entities into one-dimensional vectors with the length of 512 through a trained word2vec word embedding encoder. Then, the two vectors are combined into a one-dimensional vector with the length of 1024, and the one-dimensional vector is used as the input of the trained multilayer CNN convolutional neural network to obtain one-dimensional vectors with the length of 512 corresponding to the starting entity and the target entity respectively. And then, on the basis, generating a new 1024-length vector by the two one-dimensional vectors through a full connection layer, and using the new 1024-length vector as the input of a trained reinforcement learning Policy strategy network and a Value network. The Policy network and the Value network overlap each other, and from the starting entity, the current entity is given to the next entity with the optimal target entity each time until the target entity is found. Thus, an optimal query Path (s, g) with a starting point of the starting entity s and an end point of the target entity g is finally obtained.

Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.

Claims

1. A system for inquiring an optimal path of a knowledge graph based on deep reinforcement learning is characterized by comprising two modules, namely a first module and a second module, wherein the first module is an offline training module of an optimal path model of the knowledge graph, the second module is an online application module of the optimal path model of the knowledge graph, the offline training module of the optimal path model of the knowledge graph is provided with a deep reinforcement learning component, the current entity is subjected to deep reinforcement training learning to obtain a next entity, the next entity is subjected to repeated training learning of the current entity to obtain an optimal path model, and a starting entity and a target entity are input into the optimal path model obtained by the first module to finally obtain an optimal path;

the deep reinforcement learning component consists of an encoder, a network component and a logistic regression component, wherein the network component comprises a conversion component and a training component, the conversion component comprises a CNN neural network and an FC neural network, and the training component comprises a reinforcement learning Policy strategy network and a reinforcement learning value network;

the reinforcement learning Policy network is composed of five layers of fully-connected neural networks, the number of the first four layers of nodes of the reinforcement learning Policy neural network is gradually reduced, the fifth layer is provided with k neurons, the first layer, the second layer and the third layer of the reinforcement learning Policy neural network adopt a dropout technology to prevent overfitting, an activation function adopts a tanh function, batch standardization technology is adopted between the third layer and the fourth layer to enhance the generalization capability of the model, a sigmod function is adopted as the activation function, and the full connection is adopted between the fourth layer and the fifth layer to obtain the probability of k relations to be predicted, and the probability is used as the behavior selection of the next entity.

2. The system according to claim 1, the method is characterized in that the reinforcement learning value neural network is composed of five layers of fully-connected neural networks, the fully-connected neural networks which are gradually decreased from the first layer to the fourth layer of the reinforcement learning value neural network are adopted, the fifth layer is only provided with one neuron, the first layer and the second layer of the reinforcement learning value neural network, the second layer and the third layer of the reinforcement learning value neural network are both prevented from being over-fitted by a dropout technology, the activation functions of the first layer and the second layer are both tanh functions, the third layer of the activation function adopts a sigmod function, the third layer and the fourth layer adopt a batch standardization technology to enhance the generalization capability of the model, the activation functions all adopt relu functions, the fourth layer and the fifth layer adopt full connection, the output result is the income brought by the accumulation of the current state to the target state predicted by the Value network.

3. A knowledge graph optimal path query method based on deep reinforcement learning is characterized by comprising the following steps:

4. The method as claimed in claim 3, wherein n is not less than 1/10 of the total number of the probabilistic knowledgebase entity relationships in step S1, γ ═ n/2 relationships are randomly selected from the n relationships, and the γ relationships in the probabilistic knowledgebase and two entities connected to each relationship form a data sample set required for model training.

5. The method for querying the optimal path of the knowledge-graph based on the deep reinforcement learning of claim 3, wherein the input entity e is input in stage 1 of the step S3₁And e₂Converted into two word vectors G by an encoder and a network element_θ(e₁) And G_θ(e₂) Theta is a network parameter set to be optimized, and two word vectors G obtained in the stage 1 are used_θ(e₁) And G_θ(e₂) Similarity calculation is performed to find the cosine distance of the two, which is shown as the following formula:

D_θ(e₁,e₂)＝||G_θ(e₁)-G_θ(e₂)||_cos，

wherein n is the total number of training samples;

the stage 2 and the stage 3 of the step S3 are performed in a training component in a deep reinforcement learning component, the stage 2 is used for strategy training, the stage 3 is used for value training, and a parameter set of the two networks, namely a parameter θ of a Policy strategy network, is optimized in the training process_pAnd the Value network parameter theta_vAnd is provided with quadruplets<State, reward, action, model>Wherein the states are represented by entities in the probabilistic knowledge graph.

6. The method for querying the optimal path of the knowledge-graph based on the deep reinforcement learning as claimed in claim 5, wherein the loss function L (θ) needs to be minimized, and the loss function L (θ) can be refined as follows:

7. The method for querying the optimal path of the knowledge-graph based on the deep reinforcement learning of claim 5, wherein the stages 2 and 3 of the step S3 are performed in a training component of the deep reinforcement learning component, so as to obtain a policy function and a cost function; for the strategy function, fitting is carried out through a neural network estimated by a nonlinear function to obtain the strategy function of f (e)_t,g|θ_p) For the cost function, the profit from the current node to the target node is also fitted through a neural network estimated by a nonlinear function, and the cost function is obtained to be h (e)_t,g|θ_v)。

8. The method for querying the optimal path of the knowledge-graph based on the deep reinforcement learning of claim 7, wherein the profit from the current node to the target node is multiplied by a policy estimate given by a policy function to represent a loss function of the policy network, as shown in the following formula:

L_f＝logf(e_t,g|θ_p)×((r_t+γh(e_t+1,g|θ_v)-h(e_t,g|θ_v))，

it is indicated that the derivation operation is performed,

9. The method for querying the optimal path of the knowledge-graph based on the deep reinforcement learning as claimed in claim 7, wherein the obtained cost function h (e) is_t,g|θ_v) Actual profit r from current entity_t+γh(e_t+1,g|θ_v) And calculating the absolute value of the difference between the two to obtain a loss function of the value network, which is shown as the following formula:

L_h＝|(r_t+γ×h(e_t+1,g|θ_v))-h(e_t,g|θ_v)|，

representing the derivation operation, if the predicted benefit h (e)_t,g|θ_v) And the calculated profit r_t+γh(e_t+1,g|θ_v) In-line with the aboveIf the error is larger than the threshold Value l given by the user, the parameter theta of the Value network is updated_vMaking the predicted profit error as small as possible until the predicted profit h (e)_t,g|θ_v) And the calculated profit r_t+γh(e_t+1,g|θ_v) With an error of [ -l, l ] at a user-given threshold]No longer fluctuates within the range of (a).