CN113850414B - Logistics scheduling planning method based on graph neural network and reinforcement learning - Google Patents

Logistics scheduling planning method based on graph neural network and reinforcement learning Download PDF

Info

Publication number
CN113850414B
CN113850414B CN202110958524.1A CN202110958524A CN113850414B CN 113850414 B CN113850414 B CN 113850414B CN 202110958524 A CN202110958524 A CN 202110958524A CN 113850414 B CN113850414 B CN 113850414B
Authority
CN
China
Prior art keywords
solution
lifting
current
action
controller
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110958524.1A
Other languages
Chinese (zh)
Other versions
CN113850414A (en
Inventor
马亿
李峙钢
郝建业
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202110958524.1A priority Critical patent/CN113850414B/en
Publication of CN113850414A publication Critical patent/CN113850414A/en
Application granted granted Critical
Publication of CN113850414B publication Critical patent/CN113850414B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • G06Q10/047Optimisation of routes or paths, e.g. travelling salesman problem
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Software Systems (AREA)
  • Development Economics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Business, Economics & Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Game Theory and Decision Science (AREA)
  • Health & Medical Sciences (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Educational Administration (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a logistics scheduling planning method based on a graph neural network and reinforcement learning, which comprises the following steps of 1, constructing a complete solution of a vehicle path planning problem example; step 2: the meta controller selects a disturbance controller or a lifting controller; after the lifting controller is selected, the lifting operator set forms an action space of the lifting controller; training a graph neural network in an action space; step 3, performing solution lifting; step 4: if the meta-controller selects a disturbance controller, the disturbance controller randomly selects a disturbance operator to disturb and reconstruct a feasible solution, and then iterative lifting is performed to find an optimal solution; step 5: and selecting the solution with the smallest total path length from all the feasible solutions accessed in the lifting and disturbance processes as the optimal solution and the final output solution of the whole algorithm. Compared with the prior art, the method can efficiently search the better solution of the given problem, and has practical significance on planning problems such as logistics, order allocation and the like.

Description

Logistics scheduling planning method based on graph neural network and reinforcement learning
Technical Field
The invention relates to a graph neural network and reinforcement learning technology, in particular to a method for controlling and selecting heuristic operators by combining a strategy gradient algorithm in the graph neural network and reinforcement learning.
Background
The NP-difficult combined optimization problem is an integer constraint optimization problem which is difficult to solve in a large-scale optimization mode, a robust approximation algorithm aiming at the NP-difficult combined optimization problem has various practical applications, and is a support of modern industries such as traffic, supply chains, energy sources, finance, scheduling and the like. A typical example is a traveller problem algorithm (Traveling Salesman Problem, TSP) in which a graph is given, the goal being to search the permutation space, finding the optimal node sequence with the smallest total edge weight sum (tour length) at a time guaranteed to be available and only once. TSP and its variants have numerous applications in planning, manufacturing, genetics, etc.
Although most successful machine learning techniques fall within the field of supervised learning, i.e., learning the mapping of training inputs to outputs, supervised learning is not applicable to most combinatorial optimization problems because one cannot obtain optimal labels (label). Meanwhile, when solving the path planning problem, the conventional reinforcement learning method is limited to only training and solving the problem of fixed node scale due to the fixed network parameter scale.
Disclosure of Invention
In order to overcome the problems in the prior art, the invention provides a logistics scheduling planning method based on a graph neural network and reinforcement learning technology, and aims at the input problem example, and the technologies such as the graph neural network, a strategy gradient reinforcement learning algorithm, a heuristic operator and the like are combined and used, so that the search of a better solution of the logistics scheduling planning problem can be effectively carried out.
The technical scheme of the invention is as follows:
a logistics scheduling planning method based on a graph neural network and reinforcement learning comprises the following steps:
step 1: building a complete solution to an instance of a vehicle path planning problem
Step 2: transmitting the complete feasible solution to a meta-controller, and selecting a disturbance controller if the current solution is not lifted through L rounds; otherwise, selecting a lifting controller;
after the lifting controller is selected, selecting an optimal lifting operator for the problem instance and the feasible solution based on the graph neural network trained by reinforcement learning, wherein all lifting operator sets form an action space of the lifting controller; training the graph neural network in the action space specifically comprises the following operations:
the network is trained and updated by using a gradient update formula of a classical baseline-based gradient descent method, and the expression is as follows:
where s is the current solution, pi is the current policy (policy), L (pi|s) is the way of the new solution from the current solution and the current policyThe total length of the process, b(s), is a baseline function, the current solution is used to obtain a basic value function to help training, and the whole (L (pi|s) -b (s)) is a return value (reward) after the current solution s is selected to act according to a strategy, and the log θ (pi|s) is the log of the policy action probability;
establishing a solution model comprising state design, action design, rewarding value design and strategy network design;
generating a probability vector for the state of the input, wherein the probability vector is the motion probability generated for a given input solution, and then selecting a lifting operator to try to lift the current solution according to the generated motion probability;
step 3, inputting a current problem instance and a current solution into the solution model trained in the step 2, after obtaining the action probability, selecting a lifting operator to lift the current solution according to the action probability and obtaining a new solution, if the new solution is better than the original solution, updating the original solution by the new solution, and then carrying out iterative lifting, otherwise, keeping carrying out iterative lifting on the original solution, and carrying out next iteration;
step 4: if the meta-controller selects a disturbance controller, the disturbance controller randomly selects a disturbance operator to disturb and reconstruct a feasible solution, and then, the next iteration is performed by carrying out iteration lifting to find an optimal solution;
step 5: when the total step number reaches T, ending the search of the current problem optimal solution of the round; the solution with the smallest total path length in all the feasible solutions accessed in the lifting and disturbance processes is selected as the optimal solution and the final output solution of the whole algorithm.
The method comprises the following steps of establishing a solution model comprising state design, action design, rewarding value design and strategy network design, wherein the specific process is as follows:
(1) The state is composed of the requirement of the current node, the position coordinates of the current node, the vehicle capacity left when the current node is accessed in the path, the history action taken in the previous h step and the influence of the history action in the previous h step;
the expression of the state is as follows:
X v =[c i ,(x i ,y i ),C i ,a t-h ,e t-h ]
wherein ,at-h Is an action before the current step number t is h steps, e t-h Is the influence of motion, x i ,y i Is the position coordinate of the current node i, c i C is the requirement of the current node i Is the vehicle capacity left when the current node is accessed in the path;
(2) The action space consists of a lifting operator set, and comprises an intra-path operator and an inter-path operator, wherein the intra-path operator is used for attempting to reduce the total path length of a single path, and the inter-path operator is used for attempting to reduce the total path length of a solution and jumping out of local optimum by adjusting nodes in a plurality of paths;
(3) The rewards value is the return R obtained by the action adopted in one period (n) The expression is as follows:
R (n) =r t+1 +r t+2 +…+r t+n-1 +Q(s t+n ,a t+n )
wherein ,rt+1 Is the reward of the t+1 step, gamma is the decay factor, n is the number of steps in the long term, Q (S t+n ,a t+n ) The Q value corresponding to the state-action at the time of t+n steps;
(4) The strategy network design comprises the step of generating probability distribution on action space according to the input state by using a graph neural network, and specifically comprises the following operations:
the k-th layer of the graph neural network is calculated by:
wherein ,is the eigenvector of node v at the kth layer,/->Is the eigenvector of node v at layer k-1 and will +.>Initialized to X v I.e. the initial feature vector in which the node is set, N (v) is the set of neighbor nodes of v, +.>Is the problem example high-dimensional information extracted by the kth layer of the graph neural network; the design of the graph neural network pair AGGREGATE and COMBINE functions is as follows:
aggregating node characteristics after the last layer of iteration by using READOUT function to obtain graph sign information h of the whole graph G
For graph characterization information h G The calculation formula of (2) is as follows:
after obtaining the graph feature vector, mapping the graph feature vector to an action space by using an MLP function, obtaining an action probability vector by using a softmax layer, and lifting a feasible solution according to action probability selection action, wherein the expression is as follows:
p θ (π|s)=SOFTMAX(MLP(h G ))
wherein ,pθ Is the distribution of this predictive model.
If the lifting operator operates in the current solution, replacing the new solution for the current solution to perform the next iteration if the obtained new solution is better than the original solution, and specifically comprising the following operations:
after the lifting operator operates the current solution, a new solution is obtained, if the total path length of the new solution is smaller than that of the original solution, the current solution is updated to be the new solution, and the number of non-lifting wheels is reset to be 0; if the total distance of the new solution is greater than that of the original solution, the original solution is continuously and iteratively lifted, and the number of non-lifted wheels is increased by one, so that the selection of the meta-controller is affected by the number of non-lifted wheels.
Applying the disturbance operator in the current solution to obtain a new solution, and replacing the current solution with the new solution; and performing subsequent iterative lifting, resetting the number of non-lifting rounds to 0, and jumping out of local optimum when searching.
Compared with the prior art, the invention has the following beneficial effects:
1) The method has the advantages that the current solution can be well lifted according to specific problem examples and operators with good selection of the current solution, so that the better solution of the given problem can be efficiently searched, and the method has practical significance on planning problems such as logistics, order allocation and the like;
2) The model generalization problem is well solved by means of the graph neural network technology, namely, the trained model has effects on the problems of different node scales, and has very strong practical value;
3) The reinforcement learning training model is adopted, so that the problem of the training data set of the supervised learning is well solved, the collection cost of the training data used by the supervised learning is eliminated, and the production cost can be effectively reduced.
Drawings
FIG. 1 is a schematic overall flow diagram of a logistics scheduling planning method based on a graph neural network and reinforcement learning technology;
FIG. 2 is a schematic diagram of the operation of the meta-controller;
FIG. 3 is a schematic diagram of the operation of the inter-path lifting operator;
FIG. 4 is a schematic diagram of the operation of the in-path lifting operator.
Detailed Description
The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.
Fig. 1 is a schematic overall flow chart of a logistics scheduling planning method based on a graphic neural network and reinforcement learning technology. The whole flow of the invention is detailed as follows:
step 1: the rest non-access nodes are sorted in ascending order according to the distance from the last vehicle node adding, if the distances are equal, the ascending order is arranged according to the node requirement; selecting a node which is ranked as a first node, namely adding the node, which causes the least increase of distance and the least demand, generating a solution for the vehicle path planning problem example by using a greedy algorithm, and adding the solution into the current feasible solution set; performing the process iteration until all nodes are added into the solution set to form a complete solution;
step 2: the complete current solution is passed to the meta-controller, which is selected to be the meta-controller if there is no lifting of the current solution through the L (the present invention uses l=6) round, otherwise the lifting controller is selected. As shown in fig. 2, a schematic diagram of the operation process of the meta-controller is shown.
If a lifting controller is selected, an optimal lifting operator is selected based on a graph neural network trained by reinforcement learning, the optimal lifting operator is operated in a current solution, if the obtained new solution is better than the original solution, the new solution is replaced by the current solution, and the next iteration is carried out; the boost controller builds from an initial solution (greedy algorithm in the first iteration period and solution after perturbation in the following iteration period) and then reduces the path total cost of the solution by selecting the preferred boost operator. Wherein all lifting operator sets constitute the action space (action space) of the lifting controller. Based on the strong ability of the graph neural network to extract and classify graph features, the graph neural network is trained to generate probability vectors in the motion space for states of inputs, which are the motion probabilities generated for a given input solution. And finally, selecting a lifting operator according to the generated action probability to try to lift the current solution.
Solution models (respectively comprising states, actions, bonus value designs and strategy networks) are built and trained, and the specific description is as follows:
1. state design
The model of the state is made up of problem instances, current solutions and running histories. Static state, i.e. the state of a problematic feature, is constant in different solutions for the same problem, such as the current node's requirements, and the current node's location coordinates. The state of the solution feature is changed continuously based on the current solution, such as the vehicle left after accessing the node according to the path of the current solutionCapacity. The running history then includes the actions and effects previously taken, such as a t-h Is an action before the current step number t is H steps, wherein H is more than or equal to 1 and less than or equal to H (H is the history length and different H correspond to different strategies), e t-h Is the effect of an action if this action reduces the sum of the total distances learned, e t-h 1, otherwise e t-h Is-1. As shown in table 1, is a complete signature status.
TABLE 1
2. Motion design
The action space is composed of a set of lifting operators, including intra-path operators that attempt to reduce the total path length of a single path (route), and inter-path operators that attempt to reduce the total path length of the solution by adjusting nodes in multiple paths. As shown in table 2, a subset of the lifting algorithms is provided.
TABLE 2
Wherein the same lifting operator should be considered as different actions for different parameters. Such as operator Relocate (2), for m=1, 2,3 should be considered as 3 actions. Where the length m and n refer to the number of nodes, e.g., a node segment finger path of m=3, contains three nodes. For different solutions, each action may bring influence of different degrees, and how to select a lifting operator with the highest lifting effect for each solution to be lifted becomes a great challenge.
3. Prize value design
Taking the total length of the optimal solution implementation obtained in the first round of lifting period (i.e. the whole process from one initial iteration of lifting to l=6 rounds of lifting without lifting) as the reference value of the current problem example, all the adopted lifting actions in each subsequent iteration of lifting period can obtain a return, and the return is equal to the difference value between the total length of the path of the optimal solution obtained in the corresponding lifting period and the reference value. Based on the Return design mode of the baseline (Baseline), the expression of Return (Return) obtained by the action adopted in one cycle in the classical reinforcement learning n-steps TD error is as follows:
R (n) =r t+1 +γr t+2 +…+γ n-2 r t+n-1n-1 Q(S t+n ,a t+n )#(1)
in the formula (1), r t+1 Is the reward of the t+1 step, gamma is the decay factor, n is the number of steps in the long term, Q (S t+n ,a t+n ) The Q value corresponding to the state-action at the t+n step is comprehensively reported for training by combining the long-term action and the benefit.
But it was found during the experiment that the first action normally taken after one iteration of the lifting cycle would be a greater return. This is because the new solution obtained by taking the perturbation operation after one iteration of the lifting cycle is taken as the initial solution for the next cycle, which is usually not good in quality (the total path length value is high), so the first action taken on this initial solution of poor quality will achieve a larger total path value drop, and the amplitude and frequency of each lifting will gradually decrease (as the solution will become more and more difficult to lift) as the solution is lifted over this iteration cycle, so that if the payback is calculated according to the decay factor described above, it is unfair for all actions taken to produce the locally optimal solution. The invention sets the decay factor γ=1, i.e. the Return obtained for the action producing the lifting effect in one iteration cycle is the same, and the Return (Return) expression obtained for the action taken in one cycle is as follows:
R (n) =r t+1 +r t+2 +…+r t+n-1 +Q(S t+n ,a t+n )。
4. strategic network design
Generating probability distribution for an action space according to an input state (state) by using the graph neural network, and then selecting an action according to the action probability, wherein the process is a strategy decision process, and the principle of the graph neural network is described in detail below:
problem instance high-dimensional information extracted by the kth layer of the graph neural network is calculated by
In the formulas (2) and (3),is the eigenvector of node v at the kth layer and will +.>Initialized to X v I.e., the initial feature vector for which the node is set, N (v) is the set of contiguous nodes of v. Wherein the design of the graph neural network pair AGGREGATE and COMBINE functions is as follows:
AGGREGATE and COMBINE functions are summation functions, and the summation function with the best effect is selected as an aggregation function after experiments.
The graph neural network uses READOUT function to gather node characteristics after the last layer of iteration to obtain graph sign information h of the whole graph G :
In equation (5), the READOUT function may be some simple permutation-invariant function (i.e., the output is not changed by the order of the inputs), such as a summation function and an averaging function.
The calculation formula for graph characterization in the graph neural network is as follows:
in order to keep structural information to the greatest extent and realize the task of distinguishing each graph and classifying the graphs, the graph neural network adopts a method for calculating node characteristics of the same level by using READOUT functions to obtain graph characteristic vectors of each layer, and then connects the graph characteristic vectors of each layer to obtain the final graph characteristic vector. According to the invention, a summation function with a good effect is selected as READOUT through experiments.
After obtaining the image feature vector, the invention maps the image feature vector to an action space by using MLP (the invention is set as two layers, the hidden layer dimension is 64), finally obtains an action probability vector by using softmax layer, and selects action to promote the solution according to the action probability.
p θ (π|s)=SOFTMAX(MLP(h G )) (7)
The graphic neural network is trained through supervised training, namely, the graphic neural network is provided for a network data set and labels to train, but for a specific solution, it is difficult to know in advance which action is adopted to perform the most efficient improvement on the solution, that is, no existing labels are used for supervised learning. The invention uses a reinforcement learning method to train network parameters of the whole algorithm, and uses a gradient update formula of a gradient-line-based gradient-descent method to train and update the network:
in equation (8), s is the current solution, pi is the current strategy, L (pi|s) is the total length of the path of the new solution obtained from the current solution and the current strategy, b(s) is the baseline function, the base value function is obtained from the current solution to help training, and the whole (L (pi|s) -b (s)) is the benefit value (reward) after the current solution s selects the action according to the strategy θ (pi|s) is the log of the policy action probability. The formula means that an action is selected by the current solution according to the current policy, and if the probability of the action is small (0-1) when it is generated by the policy, the log value of the probability is a negative larger value. Thus, if a larger benefit (recall) is instead generated from this less probable action, the calculation according to equation (8) will result in a larger gradient, as a larger update is required to change this situation.
As described above for the network model, the input state characteristics of the problem characteristics and the solution characteristics are input to the node characteristics of each node in the network. The network is set to be 5 layers, and then the node characteristic vector is calculated and updated layer by layer according to a neighborhood searching formula and a combination formula of each layer. After the last layer of network is calculated, calculating to obtain the graph symptom vector represented by the current solution according to the graph characterization calculation formula, and finally mapping the graph symptom vector to an action space to obtain the action probability. The above is the workflow and principle of the whole policy network.
Step 3: and 2, inputting a current problem instance and a current solution into the solution model trained in the step 2, and after the action probability is obtained, selecting a lifting operator to lift the current solution according to the action probability to obtain a new solution. As shown in fig. 3, the operation process of the inter-path lifting operator is schematically shown. As shown in fig. 4, a schematic diagram of the operation of the in-path lifting operator is shown. If the new solution is better than the original solution, the original solution is updated by the new solution, and then iteration lifting is carried out, otherwise, the original solution is kept to be subjected to iteration lifting.
Step 4: if the meta-controller selects the disturbance controller, i.e. when the solution reaches a local optimum (the invention is an L-round solution without lifting). The perturbation controller shuffles and reconstructs a feasible solution by randomly selecting a perturbation operator (Perturbation Operators), and then iteratively lifting to find the optimal solution. As shown in table 3, a subset of the perturbation calculations.
TABLE 3 Table 3
The invention sets a dynamic threshold mechanism to ensure that the new solution generated after the disturbance is not much worse than the current solution or the current optimal solution.
The design of the dynamic threshold mechanism of the invention is to limit the quality of the new solution after disturbance to the total path length of the optimal solution plus 0.05, if the new solution meeting the condition cannot be found by the disturbance within 50 steps, the threshold is added with 0.1, the steps are clear and the previous steps are repeated until the solution meeting the condition is found. The problem for different node sizes can be adjusted by adjusting the initial value of the threshold and the increment of the threshold to adjust the convergence speed and the convergence result.
Step 5: when the total step number (the number of lifting operations plus disturbance operations) reaches T (t=40000 is adopted in the invention), the search of the current problem optimal solution of the round is finished. In order to balance exploration and mining, the invention adopts an E-greedy method, namely, 5% probability can lead a lifting controller (strategy network) to select a random action, otherwise, the action is selected according to the action probability of the solution output by the network.
The invention uses the integration strategy as the final strategy of the whole method, namely, 6 different strategies are trained according to different historical action lengths (H=1, 2,3,4,5 and 6) in the state characteristic vector (ensuring that other parts of the network are identical), and the solution with the best quality in the 6 strategies is selected at each time point in the operation process, namely, the solution with the smallest total path length is selected to realize the final integration solution of the algorithm.
Implementation of the invention on 100 CVRP-20 problem instance data, each CVPR-20 problem is defined as follows:
for each instance, 20 nodes are randomly generated, with the first node being set as the repository and the remaining nodes being client nodes. Random generation of [1,9 ] for each client node]Is 0, and the capacity of the vehicle is 30. The location ((x) of each node (including the warehouse) i ,y i ) Uniformly sampling from unit square (i.e. x i ,y i Are all uniformly taken from [0,1 ]]In range), and travel cost between two nodes c i,j Then it is simply the euclidean distance between two nodes.
For a problem instance, a strategy first generates a randomly feasible solution for the problem, then iteratively promotes the solution t=40000 rounds (strategy performance when the number of asynchronies will be shown in detail later) according to the strategy, and then selects the best solution of the 40000 found solutions as the final solution of the algorithm for the problem instance. While for an integrated strategy, a different set of strategies is trained using actions and effects of different historic lengths, and for each problem instance, the best solution among these strategies is chosen as the final integrated strategy solution. The experimental results of the invention are all obtained by carrying out average calculation on 100 randomly sampled problem examples.
As described above, the present invention trains 6 different strategies based on 6 different historic actions and impact lengths (H E [1,2,3,4,5,6 ]), 10 problem instances are randomly extracted below to show the total path length of the final solution of the 6 different strategies on the 10 problem instances, and the present invention adopts an integrated strategy method to integrate the advantages of each strategy. And finally, obtaining a comprehensive strategy close to the optimal solution as a final solution of the algorithm. As shown in table 4, the embodiment strategy for the CVRP problem is at 10 problem instances. The present invention has a very fast convergence speed and final solution, both for single policies and for aggregate policies.
TABLE 4 Table 4

Claims (1)

1. The logistics scheduling planning method based on the graph neural network and reinforcement learning is characterized by comprising the following steps of:
step 1: constructing a complete solution of a vehicle path planning problem instance;
step 2: transmitting the complete feasible solution to a meta-controller, and selecting a disturbance controller if the current solution is not lifted through L rounds; otherwise, selecting a lifting controller;
after the lifting controller is selected, selecting an optimal lifting operator for the problem instance and the feasible solution based on the graph neural network trained by reinforcement learning, wherein all lifting operator sets form an action space of the lifting controller; training the graph neural network in the action space specifically comprises the following operations:
the network is trained and updated by using a gradient update formula of a classical baseline-based gradient descent method, and the expression is as follows:
where s is the current solution, pi is the current strategy (policy), L (pi|s) is the total length of the path of the new solution obtained from the current solution and the current strategy, b(s) is the baseline function, the base value function is obtained from the current solution to aid training, and the whole (L (pi|s) -b (s)) is the return value reward after the current solution s selects actions according to the strategy θ (pi|s) is the log of the policy action probability;
establishing a solution model comprising state design, action design, rewarding value design and strategy network design; the specific process is as follows:
(1) The state is composed of the requirement of the current node, the position coordinates of the current node, the vehicle capacity left when the current node is accessed in the path, the history action taken in the previous h step and the influence of the history action in the previous h step;
the expression of the state is as follows:
X v =[c i ,(x i ,y i ),C i ,a t-h ,e t-h ]
wherein ,at-h Is an action before the current step number t is h steps, e t-h Is the influence of motion, x i ,y i Is the position coordinate of the current node i, c i Is the requirement of the current node, C i Is the vehicle capacity left when the current node is accessed in the path;
(2) The action space consists of a lifting operator set, and comprises an intra-path operator and an inter-path operator, wherein the intra-path operator is used for attempting to reduce the total path length of a single path, and the inter-path operator is used for attempting to reduce the total path length of a solution and jumping out of local optimum by adjusting nodes in a plurality of paths;
(3) The rewards value is the return R obtained by the action adopted in one period (n) The expression is as follows:
R (n) =r t+1 +r t+2 +...+r t+n-1 +Q(S t+n ,a t+n )
wherein ,rt+1 Is the reward of the t+1 step, gamma is the decay factor, n is the number of steps in the long term, Q (S t+n ,a t+n ) The Q value corresponding to state-action at the time of t+n steps;
(4) The strategy network design comprises the step of generating probability distribution on action space according to the input state by using a graph neural network, and specifically comprises the following operations:
the k-th layer of the graph neural network is calculated by:
wherein ,is the eigenvector of node v at the kth layer,/->Is the eigenvector of node v at layer k-1 and willInitialized to X v I.e. the initial feature vector in which the node is set, N (v) is the set of neighbor nodes of v, +.>Is the problem example high-dimensional information extracted by the kth layer of the graph neural network; the design of the graph neural network pair AGGREGATE and COMBINE functions is as follows:
aggregating node characteristics after the last layer of iteration by using READOUT function to obtain graph sign information h of the whole graph G
For graph characterization information h G The calculation formula of (2) is as follows:
after obtaining the graph feature vector, mapping the graph feature vector to an action space by using an MLP function, obtaining an action probability vector by using a softmax layer, and lifting a feasible solution according to action probability selection action, wherein the expression is as follows:
p θ (n|S)=SOFTMAX(MLP(h G ))
wherein ,pθ Is the distribution of this predictive model;
generating a probability vector for the state of the input, wherein the probability vector is the motion probability generated for a given input solution, and then selecting a lifting operator to try to lift the current solution according to the generated motion probability;
step 3, inputting a current problem instance and a current solution into the solution model trained in the step 2, after obtaining the action probability, selecting a lifting operator to lift the current solution according to the action probability and obtaining a new solution, if the new solution is better than the original solution, updating the original solution by the new solution, and then carrying out iterative lifting, otherwise, keeping carrying out iterative lifting on the original solution, and carrying out next iteration;
step 4: if the meta-controller selects a disturbance controller, the disturbance controller randomly selects a disturbance operator to disturb and reconstruct a feasible solution, and then, the next iteration is performed by carrying out iteration lifting to find an optimal solution;
step 5: when the total step number reaches T, ending the search of the current problem optimal solution of the round; the solution with the smallest total path length in all the feasible solutions accessed in the lifting and disturbance processes is selected as the optimal solution and the final output solution of the whole algorithm.
CN202110958524.1A 2021-08-20 2021-08-20 Logistics scheduling planning method based on graph neural network and reinforcement learning Active CN113850414B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110958524.1A CN113850414B (en) 2021-08-20 2021-08-20 Logistics scheduling planning method based on graph neural network and reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110958524.1A CN113850414B (en) 2021-08-20 2021-08-20 Logistics scheduling planning method based on graph neural network and reinforcement learning

Publications (2)

Publication Number Publication Date
CN113850414A CN113850414A (en) 2021-12-28
CN113850414B true CN113850414B (en) 2023-08-04

Family

ID=78975656

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110958524.1A Active CN113850414B (en) 2021-08-20 2021-08-20 Logistics scheduling planning method based on graph neural network and reinforcement learning

Country Status (1)

Country Link
CN (1) CN113850414B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116594358B (en) * 2023-04-20 2024-01-02 暨南大学 Multi-layer factory workshop scheduling method based on reinforcement learning
CN116187611B (en) * 2023-04-25 2023-07-25 南方科技大学 Multi-agent path planning method and terminal
CN117129000B (en) * 2023-09-21 2024-03-26 安徽大学 Multi-target freight vehicle path planning method based on seed optimization algorithm

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111797992A (en) * 2020-05-25 2020-10-20 华为技术有限公司 Machine learning optimization method and device
CN113159432A (en) * 2021-04-28 2021-07-23 杭州电子科技大学 Multi-agent path planning method based on deep reinforcement learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11443346B2 (en) * 2019-10-14 2022-09-13 Visa International Service Association Group item recommendations for ephemeral groups based on mutual information maximization

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111797992A (en) * 2020-05-25 2020-10-20 华为技术有限公司 Machine learning optimization method and device
CN113159432A (en) * 2021-04-28 2021-07-23 杭州电子科技大学 Multi-agent path planning method based on deep reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
A Multi-Graph Attributed Reinforcement Learning based Optimization Algorithm for Large-scale Hybrid Flow Shop Scheduling Problem;Fei Ni 等;《Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining》;20210814;全文 *

Also Published As

Publication number Publication date
CN113850414A (en) 2021-12-28

Similar Documents

Publication Publication Date Title
CN113850414B (en) Logistics scheduling planning method based on graph neural network and reinforcement learning
Luo et al. Species-based particle swarm optimizer enhanced by memory for dynamic optimization
Peng et al. Accelerating minibatch stochastic gradient descent using typicality sampling
CN106529818A (en) Water quality evaluation prediction method based on fuzzy wavelet neural network
CN114167898B (en) Global path planning method and system for collecting data of unmanned aerial vehicle
CN113627471A (en) Data classification method, system, equipment and information data processing terminal
Ducange et al. Multi-objective evolutionary fuzzy systems
Kalinli et al. Training recurrent neural networks by using parallel tabu search algorithm based on crossover operation
Lee et al. Boundary-focused generative adversarial networks for imbalanced and multimodal time series
Zhu et al. An Efficient Hybrid Feature Selection Method Using the Artificial Immune Algorithm for High‐Dimensional Data
Zhang et al. Brain-inspired experience reinforcement model for bin packing in varying environments
CN109993271A (en) Grey neural network forecasting based on theory of games
CN112836846B (en) Multi-depot and multi-direction combined transportation scheduling double-layer optimization algorithm for cigarette delivery
Han et al. A deep reinforcement learning based multiple meta-heuristic methods approach for resource constrained multi-project scheduling problem
Wu et al. An algorithm for solving travelling salesman problem based on improved particle swarm optimisation and dynamic step Hopfield network
Ayres et al. The extreme value evolving predictor
Pontes-Filho et al. EvoDynamic: a framework for the evolution of generally represented dynamical systems and its application to criticality
Oxenstierna Warehouse vehicle routing using deep reinforcement learning
Nolle et al. Intelligent computational optimization in engineering: Techniques and applications
Qiu et al. On the adoption of metaheuristics for solving 0–1 knapsack problems
Lu et al. Corrigendum to “The Fourth-Party Logistics Routing Problem Using Ant Colony System-Improved Grey Wolf Optimization”
Han et al. An Improved Ant Colony Optimization for Large Scale Colored Traveling Salesman Problem
Li et al. Ttnet: Tabular transfer network for few-samples prediction
Ahmed et al. Review on the parameter settings in harmony search algorithm applied to combinatorial optimization problems
Hu et al. Research on flexible job-shop scheduling problem based on the dragonfly algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant