CN113850414B

CN113850414B - Logistics scheduling planning method based on graph neural network and reinforcement learning

Info

Publication number: CN113850414B
Application number: CN202110958524.1A
Authority: CN
Inventors: 马亿; 李峙钢; 郝建业
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-08-20
Filing date: 2021-08-20
Publication date: 2023-08-04
Anticipated expiration: 2041-08-20
Also published as: CN113850414A

Abstract

The invention discloses a logistics scheduling planning method based on a graph neural network and reinforcement learning, which comprises the following steps of 1, constructing a complete solution of a vehicle path planning problem example; step 2: the meta controller selects a disturbance controller or a lifting controller; after the lifting controller is selected, the lifting operator set forms an action space of the lifting controller; training a graph neural network in an action space; step 3, performing solution lifting; step 4: if the meta-controller selects a disturbance controller, the disturbance controller randomly selects a disturbance operator to disturb and reconstruct a feasible solution, and then iterative lifting is performed to find an optimal solution; step 5: and selecting the solution with the smallest total path length from all the feasible solutions accessed in the lifting and disturbance processes as the optimal solution and the final output solution of the whole algorithm. Compared with the prior art, the method can efficiently search the better solution of the given problem, and has practical significance on planning problems such as logistics, order allocation and the like.

Description

Logistics scheduling planning method based on graph neural network and reinforcement learning

Technical Field

The invention relates to a graph neural network and reinforcement learning technology, in particular to a method for controlling and selecting heuristic operators by combining a strategy gradient algorithm in the graph neural network and reinforcement learning.

Background

The NP-difficult combined optimization problem is an integer constraint optimization problem which is difficult to solve in a large-scale optimization mode, a robust approximation algorithm aiming at the NP-difficult combined optimization problem has various practical applications, and is a support of modern industries such as traffic, supply chains, energy sources, finance, scheduling and the like. A typical example is a traveller problem algorithm (Traveling Salesman Problem, TSP) in which a graph is given, the goal being to search the permutation space, finding the optimal node sequence with the smallest total edge weight sum (tour length) at a time guaranteed to be available and only once. TSP and its variants have numerous applications in planning, manufacturing, genetics, etc.

Although most successful machine learning techniques fall within the field of supervised learning, i.e., learning the mapping of training inputs to outputs, supervised learning is not applicable to most combinatorial optimization problems because one cannot obtain optimal labels (label). Meanwhile, when solving the path planning problem, the conventional reinforcement learning method is limited to only training and solving the problem of fixed node scale due to the fixed network parameter scale.

Disclosure of Invention

In order to overcome the problems in the prior art, the invention provides a logistics scheduling planning method based on a graph neural network and reinforcement learning technology, and aims at the input problem example, and the technologies such as the graph neural network, a strategy gradient reinforcement learning algorithm, a heuristic operator and the like are combined and used, so that the search of a better solution of the logistics scheduling planning problem can be effectively carried out.

The technical scheme of the invention is as follows:

a logistics scheduling planning method based on a graph neural network and reinforcement learning comprises the following steps:

step 1: building a complete solution to an instance of a vehicle path planning problem

Step 2: transmitting the complete feasible solution to a meta-controller, and selecting a disturbance controller if the current solution is not lifted through L rounds; otherwise, selecting a lifting controller;

after the lifting controller is selected, selecting an optimal lifting operator for the problem instance and the feasible solution based on the graph neural network trained by reinforcement learning, wherein all lifting operator sets form an action space of the lifting controller; training the graph neural network in the action space specifically comprises the following operations:

the network is trained and updated by using a gradient update formula of a classical baseline-based gradient descent method, and the expression is as follows:

where s is the current solution, pi is the current policy (policy), L (pi|s) is the way of the new solution from the current solution and the current policyThe total length of the process, b(s), is a baseline function, the current solution is used to obtain a basic value function to help training, and the whole (L (pi|s) -b (s)) is a return value (reward) after the current solution s is selected to act according to a strategy, and the log _θ (pi|s) is the log of the policy action probability;

establishing a solution model comprising state design, action design, rewarding value design and strategy network design;

generating a probability vector for the state of the input, wherein the probability vector is the motion probability generated for a given input solution, and then selecting a lifting operator to try to lift the current solution according to the generated motion probability;

step 3, inputting a current problem instance and a current solution into the solution model trained in the step 2, after obtaining the action probability, selecting a lifting operator to lift the current solution according to the action probability and obtaining a new solution, if the new solution is better than the original solution, updating the original solution by the new solution, and then carrying out iterative lifting, otherwise, keeping carrying out iterative lifting on the original solution, and carrying out next iteration;

step 4: if the meta-controller selects a disturbance controller, the disturbance controller randomly selects a disturbance operator to disturb and reconstruct a feasible solution, and then, the next iteration is performed by carrying out iteration lifting to find an optimal solution;

step 5: when the total step number reaches T, ending the search of the current problem optimal solution of the round; the solution with the smallest total path length in all the feasible solutions accessed in the lifting and disturbance processes is selected as the optimal solution and the final output solution of the whole algorithm.

The method comprises the following steps of establishing a solution model comprising state design, action design, rewarding value design and strategy network design, wherein the specific process is as follows:

(1) The state is composed of the requirement of the current node, the position coordinates of the current node, the vehicle capacity left when the current node is accessed in the path, the history action taken in the previous h step and the influence of the history action in the previous h step;

the expression of the state is as follows:

X _v ＝[c _i ,(x _i ,y _i ),C _i ,a _t-h ,e _t-h ]

wherein ,a_t-h Is an action before the current step number t is h steps, e _t-h Is the influence of motion, x _i ,y _i Is the position coordinate of the current node i, c _i C is the requirement of the current node _i Is the vehicle capacity left when the current node is accessed in the path;

(2) The action space consists of a lifting operator set, and comprises an intra-path operator and an inter-path operator, wherein the intra-path operator is used for attempting to reduce the total path length of a single path, and the inter-path operator is used for attempting to reduce the total path length of a solution and jumping out of local optimum by adjusting nodes in a plurality of paths;

(3) The rewards value is the return R obtained by the action adopted in one period ⁽ⁿ⁾ The expression is as follows:

R ⁽ⁿ⁾ ＝r _t+1 +r _t+2 +…+r _t+n-1 +Q(s _t+n ,a _t+n )

wherein ,r_t+1 Is the reward of the t+1 step, gamma is the decay factor, n is the number of steps in the long term, Q (S _t+n ,a _t+n ) The Q value corresponding to the state-action at the time of t+n steps;

(4) The strategy network design comprises the step of generating probability distribution on action space according to the input state by using a graph neural network, and specifically comprises the following operations:

the k-th layer of the graph neural network is calculated by:

wherein ,is the eigenvector of node v at the kth layer,/->Is the eigenvector of node v at layer k-1 and will +.>Initialized to X _v I.e. the initial feature vector in which the node is set, N (v) is the set of neighbor nodes of v, +.>Is the problem example high-dimensional information extracted by the kth layer of the graph neural network; the design of the graph neural network pair AGGREGATE and COMBINE functions is as follows:

aggregating node characteristics after the last layer of iteration by using READOUT function to obtain graph sign information h of the whole graph _G ：

For graph characterization information h _G The calculation formula of (2) is as follows:

after obtaining the graph feature vector, mapping the graph feature vector to an action space by using an MLP function, obtaining an action probability vector by using a softmax layer, and lifting a feasible solution according to action probability selection action, wherein the expression is as follows:

p _θ (π|s)＝SOFTMAX(MLP(h _G ))

wherein ,p_θ Is the distribution of this predictive model.

If the lifting operator operates in the current solution, replacing the new solution for the current solution to perform the next iteration if the obtained new solution is better than the original solution, and specifically comprising the following operations:

after the lifting operator operates the current solution, a new solution is obtained, if the total path length of the new solution is smaller than that of the original solution, the current solution is updated to be the new solution, and the number of non-lifting wheels is reset to be 0; if the total distance of the new solution is greater than that of the original solution, the original solution is continuously and iteratively lifted, and the number of non-lifted wheels is increased by one, so that the selection of the meta-controller is affected by the number of non-lifted wheels.

Applying the disturbance operator in the current solution to obtain a new solution, and replacing the current solution with the new solution; and performing subsequent iterative lifting, resetting the number of non-lifting rounds to 0, and jumping out of local optimum when searching.

Compared with the prior art, the invention has the following beneficial effects:

1) The method has the advantages that the current solution can be well lifted according to specific problem examples and operators with good selection of the current solution, so that the better solution of the given problem can be efficiently searched, and the method has practical significance on planning problems such as logistics, order allocation and the like;

2) The model generalization problem is well solved by means of the graph neural network technology, namely, the trained model has effects on the problems of different node scales, and has very strong practical value;

3) The reinforcement learning training model is adopted, so that the problem of the training data set of the supervised learning is well solved, the collection cost of the training data used by the supervised learning is eliminated, and the production cost can be effectively reduced.

Drawings

FIG. 1 is a schematic overall flow diagram of a logistics scheduling planning method based on a graph neural network and reinforcement learning technology;

FIG. 2 is a schematic diagram of the operation of the meta-controller;

FIG. 3 is a schematic diagram of the operation of the inter-path lifting operator;

FIG. 4 is a schematic diagram of the operation of the in-path lifting operator.

Detailed Description

The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.

Fig. 1 is a schematic overall flow chart of a logistics scheduling planning method based on a graphic neural network and reinforcement learning technology. The whole flow of the invention is detailed as follows:

step 1: the rest non-access nodes are sorted in ascending order according to the distance from the last vehicle node adding, if the distances are equal, the ascending order is arranged according to the node requirement; selecting a node which is ranked as a first node, namely adding the node, which causes the least increase of distance and the least demand, generating a solution for the vehicle path planning problem example by using a greedy algorithm, and adding the solution into the current feasible solution set; performing the process iteration until all nodes are added into the solution set to form a complete solution;

step 2: the complete current solution is passed to the meta-controller, which is selected to be the meta-controller if there is no lifting of the current solution through the L (the present invention uses l=6) round, otherwise the lifting controller is selected. As shown in fig. 2, a schematic diagram of the operation process of the meta-controller is shown.

If a lifting controller is selected, an optimal lifting operator is selected based on a graph neural network trained by reinforcement learning, the optimal lifting operator is operated in a current solution, if the obtained new solution is better than the original solution, the new solution is replaced by the current solution, and the next iteration is carried out; the boost controller builds from an initial solution (greedy algorithm in the first iteration period and solution after perturbation in the following iteration period) and then reduces the path total cost of the solution by selecting the preferred boost operator. Wherein all lifting operator sets constitute the action space (action space) of the lifting controller. Based on the strong ability of the graph neural network to extract and classify graph features, the graph neural network is trained to generate probability vectors in the motion space for states of inputs, which are the motion probabilities generated for a given input solution. And finally, selecting a lifting operator according to the generated action probability to try to lift the current solution.

Solution models (respectively comprising states, actions, bonus value designs and strategy networks) are built and trained, and the specific description is as follows:

1. state design

The model of the state is made up of problem instances, current solutions and running histories. Static state, i.e. the state of a problematic feature, is constant in different solutions for the same problem, such as the current node's requirements, and the current node's location coordinates. The state of the solution feature is changed continuously based on the current solution, such as the vehicle left after accessing the node according to the path of the current solutionCapacity. The running history then includes the actions and effects previously taken, such as a _t-h Is an action before the current step number t is H steps, wherein H is more than or equal to 1 and less than or equal to H (H is the history length and different H correspond to different strategies), e _t-h Is the effect of an action if this action reduces the sum of the total distances learned, e _t-h 1, otherwise e _t-h Is-1. As shown in table 1, is a complete signature status.

TABLE 1

2. Motion design

The action space is composed of a set of lifting operators, including intra-path operators that attempt to reduce the total path length of a single path (route), and inter-path operators that attempt to reduce the total path length of the solution by adjusting nodes in multiple paths. As shown in table 2, a subset of the lifting algorithms is provided.

TABLE 2

Wherein the same lifting operator should be considered as different actions for different parameters. Such as operator Relocate (2), for m=1, 2,3 should be considered as 3 actions. Where the length m and n refer to the number of nodes, e.g., a node segment finger path of m=3, contains three nodes. For different solutions, each action may bring influence of different degrees, and how to select a lifting operator with the highest lifting effect for each solution to be lifted becomes a great challenge.

3. Prize value design

Taking the total length of the optimal solution implementation obtained in the first round of lifting period (i.e. the whole process from one initial iteration of lifting to l=6 rounds of lifting without lifting) as the reference value of the current problem example, all the adopted lifting actions in each subsequent iteration of lifting period can obtain a return, and the return is equal to the difference value between the total length of the path of the optimal solution obtained in the corresponding lifting period and the reference value. Based on the Return design mode of the baseline (Baseline), the expression of Return (Return) obtained by the action adopted in one cycle in the classical reinforcement learning n-steps TD error is as follows:

R ⁽ⁿ⁾ ＝r _t+1 +γr _t+2 +…+γ ^n-2 r _t+n-1 +γ ^n-1 Q(S _t+n ,a _t+n )#(1)

in the formula (1), r _t+1 Is the reward of the t+1 step, gamma is the decay factor, n is the number of steps in the long term, Q (S _t+n ,a _t+n ) The Q value corresponding to the state-action at the t+n step is comprehensively reported for training by combining the long-term action and the benefit.

But it was found during the experiment that the first action normally taken after one iteration of the lifting cycle would be a greater return. This is because the new solution obtained by taking the perturbation operation after one iteration of the lifting cycle is taken as the initial solution for the next cycle, which is usually not good in quality (the total path length value is high), so the first action taken on this initial solution of poor quality will achieve a larger total path value drop, and the amplitude and frequency of each lifting will gradually decrease (as the solution will become more and more difficult to lift) as the solution is lifted over this iteration cycle, so that if the payback is calculated according to the decay factor described above, it is unfair for all actions taken to produce the locally optimal solution. The invention sets the decay factor γ=1, i.e. the Return obtained for the action producing the lifting effect in one iteration cycle is the same, and the Return (Return) expression obtained for the action taken in one cycle is as follows:

R ⁽ⁿ⁾ ＝r _t+1 +r _t+2 +…+r _t+n-1 +Q(S _t+n ,a _t+n )。

4. strategic network design

Generating probability distribution for an action space according to an input state (state) by using the graph neural network, and then selecting an action according to the action probability, wherein the process is a strategy decision process, and the principle of the graph neural network is described in detail below:

problem instance high-dimensional information extracted by the kth layer of the graph neural network is calculated by

In the formulas (2) and (3),is the eigenvector of node v at the kth layer and will +.>Initialized to X _v I.e., the initial feature vector for which the node is set, N (v) is the set of contiguous nodes of v. Wherein the design of the graph neural network pair AGGREGATE and COMBINE functions is as follows:

AGGREGATE and COMBINE functions are summation functions, and the summation function with the best effect is selected as an aggregation function after experiments.

The graph neural network uses READOUT function to gather node characteristics after the last layer of iteration to obtain graph sign information h of the whole graph _G :

In equation (5), the READOUT function may be some simple permutation-invariant function (i.e., the output is not changed by the order of the inputs), such as a summation function and an averaging function.

The calculation formula for graph characterization in the graph neural network is as follows:

in order to keep structural information to the greatest extent and realize the task of distinguishing each graph and classifying the graphs, the graph neural network adopts a method for calculating node characteristics of the same level by using READOUT functions to obtain graph characteristic vectors of each layer, and then connects the graph characteristic vectors of each layer to obtain the final graph characteristic vector. According to the invention, a summation function with a good effect is selected as READOUT through experiments.

After obtaining the image feature vector, the invention maps the image feature vector to an action space by using MLP (the invention is set as two layers, the hidden layer dimension is 64), finally obtains an action probability vector by using softmax layer, and selects action to promote the solution according to the action probability.

p _θ (π|s)＝SOFTMAX(MLP(h _G )) (7)

The graphic neural network is trained through supervised training, namely, the graphic neural network is provided for a network data set and labels to train, but for a specific solution, it is difficult to know in advance which action is adopted to perform the most efficient improvement on the solution, that is, no existing labels are used for supervised learning. The invention uses a reinforcement learning method to train network parameters of the whole algorithm, and uses a gradient update formula of a gradient-line-based gradient-descent method to train and update the network:

in equation (8), s is the current solution, pi is the current strategy, L (pi|s) is the total length of the path of the new solution obtained from the current solution and the current strategy, b(s) is the baseline function, the base value function is obtained from the current solution to help training, and the whole (L (pi|s) -b (s)) is the benefit value (reward) after the current solution s selects the action according to the strategy _θ (pi|s) is the log of the policy action probability. The formula means that an action is selected by the current solution according to the current policy, and if the probability of the action is small (0-1) when it is generated by the policy, the log value of the probability is a negative larger value. Thus, if a larger benefit (recall) is instead generated from this less probable action, the calculation according to equation (8) will result in a larger gradient, as a larger update is required to change this situation.

As described above for the network model, the input state characteristics of the problem characteristics and the solution characteristics are input to the node characteristics of each node in the network. The network is set to be 5 layers, and then the node characteristic vector is calculated and updated layer by layer according to a neighborhood searching formula and a combination formula of each layer. After the last layer of network is calculated, calculating to obtain the graph symptom vector represented by the current solution according to the graph characterization calculation formula, and finally mapping the graph symptom vector to an action space to obtain the action probability. The above is the workflow and principle of the whole policy network.

Step 3: and 2, inputting a current problem instance and a current solution into the solution model trained in the step 2, and after the action probability is obtained, selecting a lifting operator to lift the current solution according to the action probability to obtain a new solution. As shown in fig. 3, the operation process of the inter-path lifting operator is schematically shown. As shown in fig. 4, a schematic diagram of the operation of the in-path lifting operator is shown. If the new solution is better than the original solution, the original solution is updated by the new solution, and then iteration lifting is carried out, otherwise, the original solution is kept to be subjected to iteration lifting.

Step 4: if the meta-controller selects the disturbance controller, i.e. when the solution reaches a local optimum (the invention is an L-round solution without lifting). The perturbation controller shuffles and reconstructs a feasible solution by randomly selecting a perturbation operator (Perturbation Operators), and then iteratively lifting to find the optimal solution. As shown in table 3, a subset of the perturbation calculations.

TABLE 3 Table 3

The invention sets a dynamic threshold mechanism to ensure that the new solution generated after the disturbance is not much worse than the current solution or the current optimal solution.

The design of the dynamic threshold mechanism of the invention is to limit the quality of the new solution after disturbance to the total path length of the optimal solution plus 0.05, if the new solution meeting the condition cannot be found by the disturbance within 50 steps, the threshold is added with 0.1, the steps are clear and the previous steps are repeated until the solution meeting the condition is found. The problem for different node sizes can be adjusted by adjusting the initial value of the threshold and the increment of the threshold to adjust the convergence speed and the convergence result.

Step 5: when the total step number (the number of lifting operations plus disturbance operations) reaches T (t=40000 is adopted in the invention), the search of the current problem optimal solution of the round is finished. In order to balance exploration and mining, the invention adopts an E-greedy method, namely, 5% probability can lead a lifting controller (strategy network) to select a random action, otherwise, the action is selected according to the action probability of the solution output by the network.

The invention uses the integration strategy as the final strategy of the whole method, namely, 6 different strategies are trained according to different historical action lengths (H=1, 2,3,4,5 and 6) in the state characteristic vector (ensuring that other parts of the network are identical), and the solution with the best quality in the 6 strategies is selected at each time point in the operation process, namely, the solution with the smallest total path length is selected to realize the final integration solution of the algorithm.

Implementation of the invention on 100 CVRP-20 problem instance data, each CVPR-20 problem is defined as follows:

for each instance, 20 nodes are randomly generated, with the first node being set as the repository and the remaining nodes being client nodes. Random generation of [1,9 ] for each client node]Is 0, and the capacity of the vehicle is 30. The location ((x) of each node (including the warehouse) _i ,y _i ) Uniformly sampling from unit square (i.e. x _i ,y _i Are all uniformly taken from [0,1 ]]In range), and travel cost between two nodes c _i,j Then it is simply the euclidean distance between two nodes.

For a problem instance, a strategy first generates a randomly feasible solution for the problem, then iteratively promotes the solution t=40000 rounds (strategy performance when the number of asynchronies will be shown in detail later) according to the strategy, and then selects the best solution of the 40000 found solutions as the final solution of the algorithm for the problem instance. While for an integrated strategy, a different set of strategies is trained using actions and effects of different historic lengths, and for each problem instance, the best solution among these strategies is chosen as the final integrated strategy solution. The experimental results of the invention are all obtained by carrying out average calculation on 100 randomly sampled problem examples.

As described above, the present invention trains 6 different strategies based on 6 different historic actions and impact lengths (H E [1,2,3,4,5,6 ]), 10 problem instances are randomly extracted below to show the total path length of the final solution of the 6 different strategies on the 10 problem instances, and the present invention adopts an integrated strategy method to integrate the advantages of each strategy. And finally, obtaining a comprehensive strategy close to the optimal solution as a final solution of the algorithm. As shown in table 4, the embodiment strategy for the CVRP problem is at 10 problem instances. The present invention has a very fast convergence speed and final solution, both for single policies and for aggregate policies.

TABLE 4 Table 4

Claims

1. The logistics scheduling planning method based on the graph neural network and reinforcement learning is characterized by comprising the following steps of:

step 1: constructing a complete solution of a vehicle path planning problem instance;

where s is the current solution, pi is the current strategy (policy), L (pi|s) is the total length of the path of the new solution obtained from the current solution and the current strategy, b(s) is the baseline function, the base value function is obtained from the current solution to aid training, and the whole (L (pi|s) -b (s)) is the return value reward after the current solution s selects actions according to the strategy _θ (pi|s) is the log of the policy action probability;

establishing a solution model comprising state design, action design, rewarding value design and strategy network design; the specific process is as follows:

the expression of the state is as follows:

X _v ＝[c _i ，(x _i ，y _i )，C _i ，a _t-h ，e _t-h ]

wherein ,a_t-h Is an action before the current step number t is h steps, e _t-h Is the influence of motion, x _i ,y _i Is the position coordinate of the current node i, c _i Is the requirement of the current node, C _i Is the vehicle capacity left when the current node is accessed in the path;

R ⁽ⁿ⁾ ＝r _t+1 +r _t+2 +...+r _t+n-1 +Q(S _t+n ，a _t+n )

wherein ,r_t+1 Is the reward of the t+1 step, gamma is the decay factor, n is the number of steps in the long term, Q (S _t+n ,a _t+n ) The Q value corresponding to state-action at the time of t+n steps;

the k-th layer of the graph neural network is calculated by:

wherein ,is the eigenvector of node v at the kth layer,/->Is the eigenvector of node v at layer k-1 and willInitialized to X _v I.e. the initial feature vector in which the node is set, N (v) is the set of neighbor nodes of v, +.>Is the problem example high-dimensional information extracted by the kth layer of the graph neural network; the design of the graph neural network pair AGGREGATE and COMBINE functions is as follows:

p _θ (n|S)＝SOFTMAX(MLP(h _G ))

wherein ,p_θ Is the distribution of this predictive model;