CN113850414B  Logistics scheduling planning method based on graph neural network and reinforcement learning  Google Patents
Logistics scheduling planning method based on graph neural network and reinforcement learning Download PDFInfo
 Publication number
 CN113850414B CN113850414B CN202110958524.1A CN202110958524A CN113850414B CN 113850414 B CN113850414 B CN 113850414B CN 202110958524 A CN202110958524 A CN 202110958524A CN 113850414 B CN113850414 B CN 113850414B
 Authority
 CN
 China
 Prior art keywords
 solution
 lifting
 current
 action
 controller
 Prior art date
 Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
 Active
Links
 238000013528 artificial neural network Methods 0.000 title claims abstract description 37
 238000000034 method Methods 0.000 title claims abstract description 33
 230000002787 reinforcement Effects 0.000 title claims abstract description 18
 230000009471 action Effects 0.000 claims abstract description 78
 230000008569 process Effects 0.000 claims abstract description 13
 238000012549 training Methods 0.000 claims abstract description 13
 238000013461 design Methods 0.000 claims description 24
 239000013598 vector Substances 0.000 claims description 24
 230000006870 function Effects 0.000 claims description 22
 238000004364 calculation method Methods 0.000 claims description 7
 238000009826 distribution Methods 0.000 claims description 5
 238000012512 characterization method Methods 0.000 claims description 4
 230000007774 longterm Effects 0.000 claims description 4
 238000013507 mapping Methods 0.000 claims description 4
 238000011478 gradient descent method Methods 0.000 claims description 3
 230000009191 jumping Effects 0.000 claims description 3
 230000004931 aggregating effect Effects 0.000 claims description 2
 230000000694 effects Effects 0.000 description 8
 238000010586 diagram Methods 0.000 description 6
 238000005516 engineering process Methods 0.000 description 6
 230000008901 benefit Effects 0.000 description 5
 238000005457 optimization Methods 0.000 description 5
 238000002474 experimental method Methods 0.000 description 3
 230000001174 ascending effect Effects 0.000 description 2
 230000010354 integration Effects 0.000 description 2
 238000004519 manufacturing process Methods 0.000 description 2
 230000007246 mechanism Effects 0.000 description 2
 208000024891 symptom Diseases 0.000 description 2
 238000012935 Averaging Methods 0.000 description 1
 230000002776 aggregation Effects 0.000 description 1
 238000004220 aggregation Methods 0.000 description 1
 230000009286 beneficial effect Effects 0.000 description 1
 230000008859 change Effects 0.000 description 1
 230000006872 improvement Effects 0.000 description 1
 238000010801 machine learning Methods 0.000 description 1
 238000005065 mining Methods 0.000 description 1
 238000005070 sampling Methods 0.000 description 1
 230000003068 static effect Effects 0.000 description 1
Classifications

 G—PHYSICS
 G06—COMPUTING; CALCULATING OR COUNTING
 G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
 G06Q10/00—Administration; Management
 G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
 G06Q10/047—Optimisation of routes or paths, e.g. travelling salesman problem

 G—PHYSICS
 G06—COMPUTING; CALCULATING OR COUNTING
 G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
 G06N3/00—Computing arrangements based on biological models
 G06N3/02—Neural networks
 G06N3/04—Architecture, e.g. interconnection topology
 G06N3/045—Combinations of networks

 G—PHYSICS
 G06—COMPUTING; CALCULATING OR COUNTING
 G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
 G06N3/00—Computing arrangements based on biological models
 G06N3/02—Neural networks
 G06N3/08—Learning methods

 G—PHYSICS
 G06—COMPUTING; CALCULATING OR COUNTING
 G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
 G06Q10/00—Administration; Management
 G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
 G06Q10/063—Operations research, analysis or management
 G06Q10/0631—Resource planning, allocation, distributing or scheduling for enterprises or organisations

 Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSSSECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSSREFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
 Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
 Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
 Y02T10/00—Road transport of goods or passengers
 Y02T10/10—Internal combustion engine [ICE] based vehicles
 Y02T10/40—Engine management systems
Abstract
The invention discloses a logistics scheduling planning method based on a graph neural network and reinforcement learning, which comprises the following steps of 1, constructing a complete solution of a vehicle path planning problem example; step 2: the meta controller selects a disturbance controller or a lifting controller; after the lifting controller is selected, the lifting operator set forms an action space of the lifting controller; training a graph neural network in an action space; step 3, performing solution lifting; step 4: if the metacontroller selects a disturbance controller, the disturbance controller randomly selects a disturbance operator to disturb and reconstruct a feasible solution, and then iterative lifting is performed to find an optimal solution; step 5: and selecting the solution with the smallest total path length from all the feasible solutions accessed in the lifting and disturbance processes as the optimal solution and the final output solution of the whole algorithm. Compared with the prior art, the method can efficiently search the better solution of the given problem, and has practical significance on planning problems such as logistics, order allocation and the like.
Description
Technical Field
The invention relates to a graph neural network and reinforcement learning technology, in particular to a method for controlling and selecting heuristic operators by combining a strategy gradient algorithm in the graph neural network and reinforcement learning.
Background
The NPdifficult combined optimization problem is an integer constraint optimization problem which is difficult to solve in a largescale optimization mode, a robust approximation algorithm aiming at the NPdifficult combined optimization problem has various practical applications, and is a support of modern industries such as traffic, supply chains, energy sources, finance, scheduling and the like. A typical example is a traveller problem algorithm (Traveling Salesman Problem, TSP) in which a graph is given, the goal being to search the permutation space, finding the optimal node sequence with the smallest total edge weight sum (tour length) at a time guaranteed to be available and only once. TSP and its variants have numerous applications in planning, manufacturing, genetics, etc.
Although most successful machine learning techniques fall within the field of supervised learning, i.e., learning the mapping of training inputs to outputs, supervised learning is not applicable to most combinatorial optimization problems because one cannot obtain optimal labels (label). Meanwhile, when solving the path planning problem, the conventional reinforcement learning method is limited to only training and solving the problem of fixed node scale due to the fixed network parameter scale.
Disclosure of Invention
In order to overcome the problems in the prior art, the invention provides a logistics scheduling planning method based on a graph neural network and reinforcement learning technology, and aims at the input problem example, and the technologies such as the graph neural network, a strategy gradient reinforcement learning algorithm, a heuristic operator and the like are combined and used, so that the search of a better solution of the logistics scheduling planning problem can be effectively carried out.
The technical scheme of the invention is as follows:
a logistics scheduling planning method based on a graph neural network and reinforcement learning comprises the following steps:
step 1: building a complete solution to an instance of a vehicle path planning problem
Step 2: transmitting the complete feasible solution to a metacontroller, and selecting a disturbance controller if the current solution is not lifted through L rounds; otherwise, selecting a lifting controller;
after the lifting controller is selected, selecting an optimal lifting operator for the problem instance and the feasible solution based on the graph neural network trained by reinforcement learning, wherein all lifting operator sets form an action space of the lifting controller; training the graph neural network in the action space specifically comprises the following operations:
the network is trained and updated by using a gradient update formula of a classical baselinebased gradient descent method, and the expression is as follows:
where s is the current solution, pi is the current policy (policy), L (pis) is the way of the new solution from the current solution and the current policyThe total length of the process, b(s), is a baseline function, the current solution is used to obtain a basic value function to help training, and the whole (L (pis) b (s)) is a return value (reward) after the current solution s is selected to act according to a strategy, and the log _{θ} (pis) is the log of the policy action probability;
establishing a solution model comprising state design, action design, rewarding value design and strategy network design;
generating a probability vector for the state of the input, wherein the probability vector is the motion probability generated for a given input solution, and then selecting a lifting operator to try to lift the current solution according to the generated motion probability;
step 3, inputting a current problem instance and a current solution into the solution model trained in the step 2, after obtaining the action probability, selecting a lifting operator to lift the current solution according to the action probability and obtaining a new solution, if the new solution is better than the original solution, updating the original solution by the new solution, and then carrying out iterative lifting, otherwise, keeping carrying out iterative lifting on the original solution, and carrying out next iteration;
step 4: if the metacontroller selects a disturbance controller, the disturbance controller randomly selects a disturbance operator to disturb and reconstruct a feasible solution, and then, the next iteration is performed by carrying out iteration lifting to find an optimal solution;
step 5: when the total step number reaches T, ending the search of the current problem optimal solution of the round; the solution with the smallest total path length in all the feasible solutions accessed in the lifting and disturbance processes is selected as the optimal solution and the final output solution of the whole algorithm.
The method comprises the following steps of establishing a solution model comprising state design, action design, rewarding value design and strategy network design, wherein the specific process is as follows:
(1) The state is composed of the requirement of the current node, the position coordinates of the current node, the vehicle capacity left when the current node is accessed in the path, the history action taken in the previous h step and the influence of the history action in the previous h step;
the expression of the state is as follows:
X _{v} ＝[c _{i} ,(x _{i} ,y _{i} ),C _{i} ,a _{th} ,e _{th} ]
wherein ,a_{th} Is an action before the current step number t is h steps, e _{th} Is the influence of motion, x _{i} ,y _{i} Is the position coordinate of the current node i, c _{i} C is the requirement of the current node _{i} Is the vehicle capacity left when the current node is accessed in the path;
(2) The action space consists of a lifting operator set, and comprises an intrapath operator and an interpath operator, wherein the intrapath operator is used for attempting to reduce the total path length of a single path, and the interpath operator is used for attempting to reduce the total path length of a solution and jumping out of local optimum by adjusting nodes in a plurality of paths;
(3) The rewards value is the return R obtained by the action adopted in one period ^{(n)} The expression is as follows:
R ^{(n)} ＝r _{t+1} +r _{t+2} +…+r _{t+n1} +Q(s _{t+n} ,a _{t+n} )
wherein ,r_{t+1} Is the reward of the t+1 step, gamma is the decay factor, n is the number of steps in the long term, Q (S _{t+n} ,a _{t+n} ) The Q value corresponding to the stateaction at the time of t+n steps;
(4) The strategy network design comprises the step of generating probability distribution on action space according to the input state by using a graph neural network, and specifically comprises the following operations:
the kth layer of the graph neural network is calculated by:
wherein ,is the eigenvector of node v at the kth layer,/>Is the eigenvector of node v at layer k1 and will +.>Initialized to X _{v} I.e. the initial feature vector in which the node is set, N (v) is the set of neighbor nodes of v, +.>Is the problem example highdimensional information extracted by the kth layer of the graph neural network; the design of the graph neural network pair AGGREGATE and COMBINE functions is as follows:
aggregating node characteristics after the last layer of iteration by using READOUT function to obtain graph sign information h of the whole graph _{G} ：
For graph characterization information h _{G} The calculation formula of (2) is as follows:
after obtaining the graph feature vector, mapping the graph feature vector to an action space by using an MLP function, obtaining an action probability vector by using a softmax layer, and lifting a feasible solution according to action probability selection action, wherein the expression is as follows:
p _{θ} (πs)＝SOFTMAX(MLP(h _{G} ))
wherein ,p_{θ} Is the distribution of this predictive model.
If the lifting operator operates in the current solution, replacing the new solution for the current solution to perform the next iteration if the obtained new solution is better than the original solution, and specifically comprising the following operations:
after the lifting operator operates the current solution, a new solution is obtained, if the total path length of the new solution is smaller than that of the original solution, the current solution is updated to be the new solution, and the number of nonlifting wheels is reset to be 0; if the total distance of the new solution is greater than that of the original solution, the original solution is continuously and iteratively lifted, and the number of nonlifted wheels is increased by one, so that the selection of the metacontroller is affected by the number of nonlifted wheels.
Applying the disturbance operator in the current solution to obtain a new solution, and replacing the current solution with the new solution; and performing subsequent iterative lifting, resetting the number of nonlifting rounds to 0, and jumping out of local optimum when searching.
Compared with the prior art, the invention has the following beneficial effects:
1) The method has the advantages that the current solution can be well lifted according to specific problem examples and operators with good selection of the current solution, so that the better solution of the given problem can be efficiently searched, and the method has practical significance on planning problems such as logistics, order allocation and the like;
2) The model generalization problem is well solved by means of the graph neural network technology, namely, the trained model has effects on the problems of different node scales, and has very strong practical value;
3) The reinforcement learning training model is adopted, so that the problem of the training data set of the supervised learning is well solved, the collection cost of the training data used by the supervised learning is eliminated, and the production cost can be effectively reduced.
Drawings
FIG. 1 is a schematic overall flow diagram of a logistics scheduling planning method based on a graph neural network and reinforcement learning technology;
FIG. 2 is a schematic diagram of the operation of the metacontroller;
FIG. 3 is a schematic diagram of the operation of the interpath lifting operator;
FIG. 4 is a schematic diagram of the operation of the inpath lifting operator.
Detailed Description
The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.
Fig. 1 is a schematic overall flow chart of a logistics scheduling planning method based on a graphic neural network and reinforcement learning technology. The whole flow of the invention is detailed as follows:
step 1: the rest nonaccess nodes are sorted in ascending order according to the distance from the last vehicle node adding, if the distances are equal, the ascending order is arranged according to the node requirement; selecting a node which is ranked as a first node, namely adding the node, which causes the least increase of distance and the least demand, generating a solution for the vehicle path planning problem example by using a greedy algorithm, and adding the solution into the current feasible solution set; performing the process iteration until all nodes are added into the solution set to form a complete solution;
step 2: the complete current solution is passed to the metacontroller, which is selected to be the metacontroller if there is no lifting of the current solution through the L (the present invention uses l=6) round, otherwise the lifting controller is selected. As shown in fig. 2, a schematic diagram of the operation process of the metacontroller is shown.
If a lifting controller is selected, an optimal lifting operator is selected based on a graph neural network trained by reinforcement learning, the optimal lifting operator is operated in a current solution, if the obtained new solution is better than the original solution, the new solution is replaced by the current solution, and the next iteration is carried out; the boost controller builds from an initial solution (greedy algorithm in the first iteration period and solution after perturbation in the following iteration period) and then reduces the path total cost of the solution by selecting the preferred boost operator. Wherein all lifting operator sets constitute the action space (action space) of the lifting controller. Based on the strong ability of the graph neural network to extract and classify graph features, the graph neural network is trained to generate probability vectors in the motion space for states of inputs, which are the motion probabilities generated for a given input solution. And finally, selecting a lifting operator according to the generated action probability to try to lift the current solution.
Solution models (respectively comprising states, actions, bonus value designs and strategy networks) are built and trained, and the specific description is as follows:
1. state design
The model of the state is made up of problem instances, current solutions and running histories. Static state, i.e. the state of a problematic feature, is constant in different solutions for the same problem, such as the current node's requirements, and the current node's location coordinates. The state of the solution feature is changed continuously based on the current solution, such as the vehicle left after accessing the node according to the path of the current solutionCapacity. The running history then includes the actions and effects previously taken, such as a _{th} Is an action before the current step number t is H steps, wherein H is more than or equal to 1 and less than or equal to H (H is the history length and different H correspond to different strategies), e _{th} Is the effect of an action if this action reduces the sum of the total distances learned, e _{th} 1, otherwise e _{th} Is1. As shown in table 1, is a complete signature status.
TABLE 1
2. Motion design
The action space is composed of a set of lifting operators, including intrapath operators that attempt to reduce the total path length of a single path (route), and interpath operators that attempt to reduce the total path length of the solution by adjusting nodes in multiple paths. As shown in table 2, a subset of the lifting algorithms is provided.
TABLE 2
Wherein the same lifting operator should be considered as different actions for different parameters. Such as operator Relocate (2), for m=1, 2,3 should be considered as 3 actions. Where the length m and n refer to the number of nodes, e.g., a node segment finger path of m=3, contains three nodes. For different solutions, each action may bring influence of different degrees, and how to select a lifting operator with the highest lifting effect for each solution to be lifted becomes a great challenge.
3. Prize value design
Taking the total length of the optimal solution implementation obtained in the first round of lifting period (i.e. the whole process from one initial iteration of lifting to l=6 rounds of lifting without lifting) as the reference value of the current problem example, all the adopted lifting actions in each subsequent iteration of lifting period can obtain a return, and the return is equal to the difference value between the total length of the path of the optimal solution obtained in the corresponding lifting period and the reference value. Based on the Return design mode of the baseline (Baseline), the expression of Return (Return) obtained by the action adopted in one cycle in the classical reinforcement learning nsteps TD error is as follows:
R ^{(n)} ＝r _{t+1} +γr _{t+2} +…+γ ^{n2} r _{t+n1} +γ ^{n1} Q(S _{t+n} ,a _{t+n} )#(1)
in the formula (1), r _{t+1} Is the reward of the t+1 step, gamma is the decay factor, n is the number of steps in the long term, Q (S _{t+n} ,a _{t+n} ) The Q value corresponding to the stateaction at the t+n step is comprehensively reported for training by combining the longterm action and the benefit.
But it was found during the experiment that the first action normally taken after one iteration of the lifting cycle would be a greater return. This is because the new solution obtained by taking the perturbation operation after one iteration of the lifting cycle is taken as the initial solution for the next cycle, which is usually not good in quality (the total path length value is high), so the first action taken on this initial solution of poor quality will achieve a larger total path value drop, and the amplitude and frequency of each lifting will gradually decrease (as the solution will become more and more difficult to lift) as the solution is lifted over this iteration cycle, so that if the payback is calculated according to the decay factor described above, it is unfair for all actions taken to produce the locally optimal solution. The invention sets the decay factor γ=1, i.e. the Return obtained for the action producing the lifting effect in one iteration cycle is the same, and the Return (Return) expression obtained for the action taken in one cycle is as follows:
R ^{(n)} ＝r _{t+1} +r _{t+2} +…+r _{t+n1} +Q(S _{t+n} ,a _{t+n} )。
4. strategic network design
Generating probability distribution for an action space according to an input state (state) by using the graph neural network, and then selecting an action according to the action probability, wherein the process is a strategy decision process, and the principle of the graph neural network is described in detail below:
problem instance highdimensional information extracted by the kth layer of the graph neural network is calculated by
In the formulas (2) and (3),is the eigenvector of node v at the kth layer and will +.>Initialized to X _{v} I.e., the initial feature vector for which the node is set, N (v) is the set of contiguous nodes of v. Wherein the design of the graph neural network pair AGGREGATE and COMBINE functions is as follows:
AGGREGATE and COMBINE functions are summation functions, and the summation function with the best effect is selected as an aggregation function after experiments.
The graph neural network uses READOUT function to gather node characteristics after the last layer of iteration to obtain graph sign information h of the whole graph _{G} :
In equation (5), the READOUT function may be some simple permutationinvariant function (i.e., the output is not changed by the order of the inputs), such as a summation function and an averaging function.
The calculation formula for graph characterization in the graph neural network is as follows:
in order to keep structural information to the greatest extent and realize the task of distinguishing each graph and classifying the graphs, the graph neural network adopts a method for calculating node characteristics of the same level by using READOUT functions to obtain graph characteristic vectors of each layer, and then connects the graph characteristic vectors of each layer to obtain the final graph characteristic vector. According to the invention, a summation function with a good effect is selected as READOUT through experiments.
After obtaining the image feature vector, the invention maps the image feature vector to an action space by using MLP (the invention is set as two layers, the hidden layer dimension is 64), finally obtains an action probability vector by using softmax layer, and selects action to promote the solution according to the action probability.
p _{θ} (πs)＝SOFTMAX(MLP(h _{G} )) (7)
The graphic neural network is trained through supervised training, namely, the graphic neural network is provided for a network data set and labels to train, but for a specific solution, it is difficult to know in advance which action is adopted to perform the most efficient improvement on the solution, that is, no existing labels are used for supervised learning. The invention uses a reinforcement learning method to train network parameters of the whole algorithm, and uses a gradient update formula of a gradientlinebased gradientdescent method to train and update the network:
in equation (8), s is the current solution, pi is the current strategy, L (pis) is the total length of the path of the new solution obtained from the current solution and the current strategy, b(s) is the baseline function, the base value function is obtained from the current solution to help training, and the whole (L (pis) b (s)) is the benefit value (reward) after the current solution s selects the action according to the strategy _{θ} (pis) is the log of the policy action probability. The formula means that an action is selected by the current solution according to the current policy, and if the probability of the action is small (01) when it is generated by the policy, the log value of the probability is a negative larger value. Thus, if a larger benefit (recall) is instead generated from this less probable action, the calculation according to equation (8) will result in a larger gradient, as a larger update is required to change this situation.
As described above for the network model, the input state characteristics of the problem characteristics and the solution characteristics are input to the node characteristics of each node in the network. The network is set to be 5 layers, and then the node characteristic vector is calculated and updated layer by layer according to a neighborhood searching formula and a combination formula of each layer. After the last layer of network is calculated, calculating to obtain the graph symptom vector represented by the current solution according to the graph characterization calculation formula, and finally mapping the graph symptom vector to an action space to obtain the action probability. The above is the workflow and principle of the whole policy network.
Step 3: and 2, inputting a current problem instance and a current solution into the solution model trained in the step 2, and after the action probability is obtained, selecting a lifting operator to lift the current solution according to the action probability to obtain a new solution. As shown in fig. 3, the operation process of the interpath lifting operator is schematically shown. As shown in fig. 4, a schematic diagram of the operation of the inpath lifting operator is shown. If the new solution is better than the original solution, the original solution is updated by the new solution, and then iteration lifting is carried out, otherwise, the original solution is kept to be subjected to iteration lifting.
Step 4: if the metacontroller selects the disturbance controller, i.e. when the solution reaches a local optimum (the invention is an Lround solution without lifting). The perturbation controller shuffles and reconstructs a feasible solution by randomly selecting a perturbation operator (Perturbation Operators), and then iteratively lifting to find the optimal solution. As shown in table 3, a subset of the perturbation calculations.
TABLE 3 Table 3
The invention sets a dynamic threshold mechanism to ensure that the new solution generated after the disturbance is not much worse than the current solution or the current optimal solution.
The design of the dynamic threshold mechanism of the invention is to limit the quality of the new solution after disturbance to the total path length of the optimal solution plus 0.05, if the new solution meeting the condition cannot be found by the disturbance within 50 steps, the threshold is added with 0.1, the steps are clear and the previous steps are repeated until the solution meeting the condition is found. The problem for different node sizes can be adjusted by adjusting the initial value of the threshold and the increment of the threshold to adjust the convergence speed and the convergence result.
Step 5: when the total step number (the number of lifting operations plus disturbance operations) reaches T (t=40000 is adopted in the invention), the search of the current problem optimal solution of the round is finished. In order to balance exploration and mining, the invention adopts an Egreedy method, namely, 5% probability can lead a lifting controller (strategy network) to select a random action, otherwise, the action is selected according to the action probability of the solution output by the network.
The invention uses the integration strategy as the final strategy of the whole method, namely, 6 different strategies are trained according to different historical action lengths (H=1, 2,3,4,5 and 6) in the state characteristic vector (ensuring that other parts of the network are identical), and the solution with the best quality in the 6 strategies is selected at each time point in the operation process, namely, the solution with the smallest total path length is selected to realize the final integration solution of the algorithm.
Implementation of the invention on 100 CVRP20 problem instance data, each CVPR20 problem is defined as follows:
for each instance, 20 nodes are randomly generated, with the first node being set as the repository and the remaining nodes being client nodes. Random generation of [1,9 ] for each client node]Is 0, and the capacity of the vehicle is 30. The location ((x) of each node (including the warehouse) _{i} ,y _{i} ) Uniformly sampling from unit square (i.e. x _{i} ,y _{i} Are all uniformly taken from [0,1 ]]In range), and travel cost between two nodes c _{i,j} Then it is simply the euclidean distance between two nodes.
For a problem instance, a strategy first generates a randomly feasible solution for the problem, then iteratively promotes the solution t=40000 rounds (strategy performance when the number of asynchronies will be shown in detail later) according to the strategy, and then selects the best solution of the 40000 found solutions as the final solution of the algorithm for the problem instance. While for an integrated strategy, a different set of strategies is trained using actions and effects of different historic lengths, and for each problem instance, the best solution among these strategies is chosen as the final integrated strategy solution. The experimental results of the invention are all obtained by carrying out average calculation on 100 randomly sampled problem examples.
As described above, the present invention trains 6 different strategies based on 6 different historic actions and impact lengths (H E [1,2,3,4,5,6 ]), 10 problem instances are randomly extracted below to show the total path length of the final solution of the 6 different strategies on the 10 problem instances, and the present invention adopts an integrated strategy method to integrate the advantages of each strategy. And finally, obtaining a comprehensive strategy close to the optimal solution as a final solution of the algorithm. As shown in table 4, the embodiment strategy for the CVRP problem is at 10 problem instances. The present invention has a very fast convergence speed and final solution, both for single policies and for aggregate policies.
TABLE 4 Table 4
Claims (1)
1. The logistics scheduling planning method based on the graph neural network and reinforcement learning is characterized by comprising the following steps of:
step 1: constructing a complete solution of a vehicle path planning problem instance;
step 2: transmitting the complete feasible solution to a metacontroller, and selecting a disturbance controller if the current solution is not lifted through L rounds; otherwise, selecting a lifting controller;
after the lifting controller is selected, selecting an optimal lifting operator for the problem instance and the feasible solution based on the graph neural network trained by reinforcement learning, wherein all lifting operator sets form an action space of the lifting controller; training the graph neural network in the action space specifically comprises the following operations:
the network is trained and updated by using a gradient update formula of a classical baselinebased gradient descent method, and the expression is as follows:
where s is the current solution, pi is the current strategy (policy), L (pis) is the total length of the path of the new solution obtained from the current solution and the current strategy, b(s) is the baseline function, the base value function is obtained from the current solution to aid training, and the whole (L (pis) b (s)) is the return value reward after the current solution s selects actions according to the strategy _{θ} (pis) is the log of the policy action probability;
establishing a solution model comprising state design, action design, rewarding value design and strategy network design; the specific process is as follows:
(1) The state is composed of the requirement of the current node, the position coordinates of the current node, the vehicle capacity left when the current node is accessed in the path, the history action taken in the previous h step and the influence of the history action in the previous h step;
the expression of the state is as follows:
X _{v} ＝[c _{i} ，(x _{i} ，y _{i} )，C _{i} ，a _{th} ，e _{th} ]
wherein ,a_{th} Is an action before the current step number t is h steps, e _{th} Is the influence of motion, x _{i} ,y _{i} Is the position coordinate of the current node i, c _{i} Is the requirement of the current node, C _{i} Is the vehicle capacity left when the current node is accessed in the path;
(2) The action space consists of a lifting operator set, and comprises an intrapath operator and an interpath operator, wherein the intrapath operator is used for attempting to reduce the total path length of a single path, and the interpath operator is used for attempting to reduce the total path length of a solution and jumping out of local optimum by adjusting nodes in a plurality of paths;
(3) The rewards value is the return R obtained by the action adopted in one period ^{(n)} The expression is as follows:
R ^{(n)} ＝r _{t+1} +r _{t+2} +...+r _{t+n1} +Q(S _{t+n} ，a _{t+n} )
wherein ,r_{t+1} Is the reward of the t+1 step, gamma is the decay factor, n is the number of steps in the long term, Q (S _{t+n} ,a _{t+n} ) The Q value corresponding to stateaction at the time of t+n steps;
(4) The strategy network design comprises the step of generating probability distribution on action space according to the input state by using a graph neural network, and specifically comprises the following operations:
the kth layer of the graph neural network is calculated by:
wherein ,is the eigenvector of node v at the kth layer,/>Is the eigenvector of node v at layer k1 and willInitialized to X _{v} I.e. the initial feature vector in which the node is set, N (v) is the set of neighbor nodes of v, +.>Is the problem example highdimensional information extracted by the kth layer of the graph neural network; the design of the graph neural network pair AGGREGATE and COMBINE functions is as follows:
aggregating node characteristics after the last layer of iteration by using READOUT function to obtain graph sign information h of the whole graph _{G} ：
For graph characterization information h _{G} The calculation formula of (2) is as follows:
after obtaining the graph feature vector, mapping the graph feature vector to an action space by using an MLP function, obtaining an action probability vector by using a softmax layer, and lifting a feasible solution according to action probability selection action, wherein the expression is as follows:
p _{θ} (nS)＝SOFTMAX(MLP(h _{G} ))
wherein ,p_{θ} Is the distribution of this predictive model;
generating a probability vector for the state of the input, wherein the probability vector is the motion probability generated for a given input solution, and then selecting a lifting operator to try to lift the current solution according to the generated motion probability;
step 3, inputting a current problem instance and a current solution into the solution model trained in the step 2, after obtaining the action probability, selecting a lifting operator to lift the current solution according to the action probability and obtaining a new solution, if the new solution is better than the original solution, updating the original solution by the new solution, and then carrying out iterative lifting, otherwise, keeping carrying out iterative lifting on the original solution, and carrying out next iteration;
step 4: if the metacontroller selects a disturbance controller, the disturbance controller randomly selects a disturbance operator to disturb and reconstruct a feasible solution, and then, the next iteration is performed by carrying out iteration lifting to find an optimal solution;
step 5: when the total step number reaches T, ending the search of the current problem optimal solution of the round; the solution with the smallest total path length in all the feasible solutions accessed in the lifting and disturbance processes is selected as the optimal solution and the final output solution of the whole algorithm.
Priority Applications (1)
Application Number  Priority Date  Filing Date  Title 

CN202110958524.1A CN113850414B (en)  20210820  20210820  Logistics scheduling planning method based on graph neural network and reinforcement learning 
Applications Claiming Priority (1)
Application Number  Priority Date  Filing Date  Title 

CN202110958524.1A CN113850414B (en)  20210820  20210820  Logistics scheduling planning method based on graph neural network and reinforcement learning 
Publications (2)
Publication Number  Publication Date 

CN113850414A CN113850414A (en)  20211228 
CN113850414B true CN113850414B (en)  20230804 
Family
ID=78975656
Family Applications (1)
Application Number  Title  Priority Date  Filing Date 

CN202110958524.1A Active CN113850414B (en)  20210820  20210820  Logistics scheduling planning method based on graph neural network and reinforcement learning 
Country Status (1)
Country  Link 

CN (1)  CN113850414B (en) 
Families Citing this family (1)
Publication number  Priority date  Publication date  Assignee  Title 

CN116187611B (en) *  20230425  20230725  南方科技大学  Multiagent path planning method and terminal 
Citations (2)
Publication number  Priority date  Publication date  Assignee  Title 

CN111797992A (en) *  20200525  20201020  华为技术有限公司  Machine learning optimization method and device 
CN113159432A (en) *  20210428  20210723  杭州电子科技大学  Multiagent path planning method based on deep reinforcement learning 
Family Cites Families (1)
Publication number  Priority date  Publication date  Assignee  Title 

US11443346B2 (en) *  20191014  20220913  Visa International Service Association  Group item recommendations for ephemeral groups based on mutual information maximization 

2021
 20210820 CN CN202110958524.1A patent/CN113850414B/en active Active
Patent Citations (2)
Publication number  Priority date  Publication date  Assignee  Title 

CN111797992A (en) *  20200525  20201020  华为技术有限公司  Machine learning optimization method and device 
CN113159432A (en) *  20210428  20210723  杭州电子科技大学  Multiagent path planning method based on deep reinforcement learning 
NonPatent Citations (1)
Title 

A MultiGraph Attributed Reinforcement Learning based Optimization Algorithm for Largescale Hybrid Flow Shop Scheduling Problem;Fei Ni 等;《Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining》;20210814;全文 * 
Also Published As
Publication number  Publication date 

CN113850414A (en)  20211228 
Similar Documents
Publication  Publication Date  Title 

Gharehchopogh et al.  Advances in sparrow search algorithm: a comprehensive survey  
Liu et al.  An affinity propagation clustering based particle swarm optimizer for dynamic optimization  
CN109214449A (en)  A kind of electric grid investment needing forecasting method  
Zhao et al.  DGM (1, 1) model optimized by MVO (multiverse optimizer) for annual peak load forecasting  
Ren et al.  Solving flowshop scheduling problem with a reinforcement learning algorithm that generalizes the value function with neural network  
Donate et al.  Evolutionary optimization of sparsely connected and timelagged neural networks for time series forecasting  
CN113850414B (en)  Logistics scheduling planning method based on graph neural network and reinforcement learning  
Ducange et al.  Multiobjective evolutionary fuzzy systems  
CN113627471A (en)  Data classification method, system, equipment and information data processing terminal  
Kalinli et al.  Training recurrent neural networks by using parallel tabu search algorithm based on crossover operation  
Liu et al.  An Improved Adam Optimization Algorithm Combining Adaptive Coefficients and Composite Gradients Based on Randomized Block Coordinate Descent  
Zhang et al.  Braininspired experience reinforcement model for bin packing in varying environments  
Guo et al.  Multiobjective combinatorial generative adversarial optimization and its application in crowdsensing  
CN112836846B (en)  Multidepot and multidirection combined transportation scheduling doublelayer optimization algorithm for cigarette delivery  
Meng et al.  Research on multiobjective job shop scheduling with dual particle swarm algorithm based on greedy strategy  
PontesFilho et al.  EvoDynamic: a framework for the evolution of generally represented dynamical systems and its application to criticality  
Lee et al.  Boundaryfocused generative adversarial networks for imbalanced and multimodal time series  
Ayres et al.  The extreme value evolving predictor  
Nolle et al.  Intelligent computational optimization in engineering: Techniques and applications  
Oxenstierna  Warehouse Vehicle Routing using Deep Reinforcement Learning  
CN109993271A (en)  Grey neural network forecasting based on theory of games  
Bøhn et al.  On the Effects of Properties of the Minibatch in Reinforcement Learning  
Hu et al.  Research on flexible jobshop scheduling problem based on the dragonfly algorithm  
Zhu et al.  An Efficient Hybrid Feature Selection Method Using the Artificial Immune Algorithm for HighDimensional Data  
Wu et al.  An algorithm for solving travelling salesman problem based on improved particle swarm optimisation and dynamic step Hopfield network 
Legal Events
Date  Code  Title  Description 

PB01  Publication  
PB01  Publication  
SE01  Entry into force of request for substantive examination  
SE01  Entry into force of request for substantive examination  
GR01  Patent grant  
GR01  Patent grant 