CN114237222A - Method for planning route of delivery vehicle based on reinforcement learning - Google Patents
Method for planning route of delivery vehicle based on reinforcement learning Download PDFInfo
- Publication number
- CN114237222A CN114237222A CN202111355807.3A CN202111355807A CN114237222A CN 114237222 A CN114237222 A CN 114237222A CN 202111355807 A CN202111355807 A CN 202111355807A CN 114237222 A CN114237222 A CN 114237222A
- Authority
- CN
- China
- Prior art keywords
- reinforcement learning
- node
- vehicle
- training
- learning model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000002787 reinforcement Effects 0.000 title claims abstract description 76
- 238000000034 method Methods 0.000 title claims abstract description 57
- 238000012549 training Methods 0.000 claims abstract description 71
- 230000008569 process Effects 0.000 claims abstract description 21
- 238000005457 optimization Methods 0.000 claims abstract description 11
- 239000011159 matrix material Substances 0.000 claims description 35
- 239000003795 chemical substances by application Substances 0.000 claims description 15
- 230000006870 function Effects 0.000 claims description 15
- 230000009471 action Effects 0.000 claims description 10
- 230000008859 change Effects 0.000 claims description 10
- 238000012545 processing Methods 0.000 claims description 9
- 230000003068 static effect Effects 0.000 claims description 9
- 241000531116 Blitum bonus-henricus Species 0.000 claims description 3
- 235000008645 Chenopodium bonus henricus Nutrition 0.000 claims description 3
- OAICVXFJPJFONN-UHFFFAOYSA-N Phosphorus Chemical compound [P] OAICVXFJPJFONN-UHFFFAOYSA-N 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 3
- 230000007423 decrease Effects 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000010845 search algorithm Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0223—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving speed control of the vehicle
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0221—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Aviation & Aerospace Engineering (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Traffic Control Systems (AREA)
Abstract
The invention discloses a method for planning a route of a delivery vehicle based on reinforcement learning. The method comprises the following steps: constructing a reinforcement learning model based on an A2C framework and an optimization target thereof; initializing all parameter values of the reinforcement learning model, and randomly generating a data set; constructing a training process of a reinforcement learning model, inputting the generated data set into the reinforcement learning model, and calculating the reward value of each round of training results; optimizing the reinforcement learning model by adopting a reinforcement learning method based on strategy gradient according to the loss value; and setting the maximum number of training rounds, repeatedly training to obtain a trained reinforcement learning model, and planning the route of the delivery vehicle by adopting the trained reinforcement learning model. The method is different from the traditional precise algorithm and heuristic algorithm, and can quickly solve the large-scale path planning problem.
Description
Technical Field
The invention belongs to the field of logistics scheduling, and particularly relates to a method for planning a route of a delivery vehicle based on reinforcement learning.
Background
In recent years, along with the popularization and development of the mobile internet, the scale of electronic commerce is continuously enlarged, the related logistics industry is also rapidly developed, and the output value of the logistics industry is continuously increased at a high speed. According to the data of the national statistical bureau, the business total amount of 2020 industry breaks through 2 trillion yuan, the year-on-year increase is 29.7%, and the express business amount breaks through 800 billion pieces, and the year-on-year increase is 31.2%. With the continuous increase of urbanization, urban distribution has become an important ring in the whole logistics industry. In the current society of building smart cities by applying high-tech means, the improvement of urban distribution efficiency by combining the latest software and hardware technology becomes a new challenge.
The goods taking and delivering problem is a classical NP difficult problem in the field of combinatorial optimization, researchers carry out a great deal of research on theory and practical application of the goods taking and delivering problem, and a great number of accurate algorithms and heuristic algorithms are provided. Among them, the precise algorithms include a branch pricing method, a column generation method, etc., which have advantages in that an optimal solution of the problem can be obtained, but when the problem scale increases, the solution time increases exponentially, and satisfactory results cannot be obtained within an acceptable time.
Thus, more research has turned to heuristic algorithms. The solving process of the heuristic algorithm generally comprises the steps of firstly generating an initial solution, making an iteration strategy according to certain phenomena of a simulated nature, and obtaining a final solution through a certain number of iterations. Researchers have proposed many heuristic algorithms such as genetic algorithm, artificial immune algorithm, tabu search algorithm, etc. Although the results obtained by applying the heuristic algorithm are not necessarily optimal, the heuristic algorithm can obtain relatively good results in a reasonable time and is most widely applied at present. But it still does not perform well in the face of large-scale problems that require prompt and quick solution times.
To address this problem, some researchers have begun to try to introduce new approaches. In recent years, reinforcement learning has achieved a remarkable effect in the related art. The solution of related path planning problems such as the delivery problem and the like relates to a sequence decision problem, and the reinforcement learning is very suitable for making sequence decisions. Therefore, researchers begin to use reinforcement learning to solve path planning problems such as TSP and VRP, so that the solution time is greatly improved, and the solution quality has certain competitiveness (m.nazari, a.orojooy, M).and L.V.Snyder, "recovery learning for dissolving the vehicle routing protocol," adv.neural inf.Process.Syst., vol.2018-Decem, pp.9839-9849,2018 ]. Compared with the traditional precision algorithm and the heuristic algorithm,the method has the advantages of solving rapidity, and even complex large-scale problems can be solved quickly and reasonably. But it has the same disadvantages, mainly including the following three points: the training time of the model is very long; the training process requires a large amount of data; the quality of the solution is somewhat different from the traditional algorithm.
Disclosure of Invention
The invention aims to provide a feasible solution for the problem of goods taking and delivering in urban distribution rapidly and intelligently, so that the overall efficiency is improved.
The purpose of the invention is realized by at least one of the following technical solutions.
A method for planning a route of a pickup truck and delivery truck based on reinforcement learning comprises the following steps:
s1: constructing a reinforcement learning model based on an A2C framework and an optimization target thereof;
s2: initializing all parameter values of the reinforcement learning model, and randomly generating a data set;
s3: a training process of constructing a reinforcement learning model, wherein the data set generated in the step S2 is input into the reinforcement learning model, and the reward value of each round of training results is calculated;
s4: optimizing the reinforcement learning model by adopting a reinforcement learning method based on strategy gradient according to the loss value;
s5: and setting the maximum number of training wheels, repeating the steps S3-S4 to obtain a trained reinforcement learning model, and planning the route of the pick-up and delivery vehicles by adopting the trained reinforcement learning model.
Further, each customer order comprises a pick point and n delivery points, n belongs to [2,4 ]; the volume and the weight of the goods at the goods taking point are equal to the sum of the volume and the weight of the goods at the n goods delivering points; due to the limitation that different goods cannot be loaded in a mixed mode, the goods taking point of each customer order and all corresponding delivery points are delivered by the same vehicle; the number of the vehicles which can be used is not limited, and the maximum load, the maximum running distance and the maximum volume of the vehicles are the same; the vehicles all start from the same parking lot and return to the parking lot after all orders are sent;
for the same customer order, before goods are delivered, the vehicle must take all goods from the goods-taking points and then deliver the goods to each corresponding delivery point; for different customer orders, goods taking and delivery can be performed in a crossed manner;
the optimization goal of the reinforcement learning model is to minimize the sum of the travel distances F of all vehicles:
where | R | represents the number of nodes in the current route R,represents the ith node that vehicle m passes through, | · | | non-woven phosphor2The norm of L2 is shown,representing the distance vehicle M travels from the ith node to the next node and M represents all vehicles in use.
Further, in step S1, the reinforcement learning model includes an actor network and a critic network;
an actor network refers to learning a strategy to obtain the highest possible return, and is used for generating actions and interacting with the environment; the critic network is used for estimating the current strategy, namely evaluating the quality of the operator network and guiding the action of the next stage of the operator network; both are implemented using different neural networks;
the actor network includes a first encoder, a decoder, and an attention layer; the first encoder is used for processing the input coordinates of all the goods taking points and the goods delivering points, the current vehicle load, the goods taking amount of all the goods taking points and the goods delivering amount of all the goods delivering points, and the data input into the encoder passes through the convolution layer in the first encoder to obtain a first vector embed _ 1; the decoder is used for processing the coordinates of the node where the current vehicle is located, and the data input into the decoder passes through the convolution layer and the GRU layer in the decoder to obtain a second vector embed _ 2; the attention layer maintains a first zero matrix v and a second zero matrix W, adds the first vector embed _1 and the second vector embed _2 to obtain a third vector hidden, and operates to obtain a probability matrix p which is softmax (v tanh (W hiden);
the criticic network comprises a second encoder and a full connection layer, the second encoder is used for processing the input coordinates of all the goods taking points and the goods delivering points, the current vehicle load, the goods taking amount of all the goods taking points and the goods delivering amount of the goods delivering points, and the input third vector embed _3 is obtained by passing through the convolution layer in the second encoder; the fully-connected layer takes the third vector embed _3 as input, the fully-connected layer comprises a plurality of convolutional layers, and a ReLU activation function is used in the fully-connected layer to remove negative values in the output of each convolutional layer.
Further, defining basic elements in the reinforcement learning model, including an agent, a state and an award value, as follows:
the intelligent agent: the method comprises the following steps that a vehicle is an intelligent agent, the intelligent agent selects the next action according to a strategy from an initial state, and after each action is completed, the intelligent agent feeds back data obtained according to the data to update the strategy;
the state is as follows: the states are divided into static states and dynamic states; the static state is an attribute which does not change along with time and comprises the coordinates of each node; the dynamic state is an attribute which can be changed along with the training process, and comprises the current load and the position of the vehicle and the demand of each node;
the reward value is as follows: the training goal of the reinforcement learning model is to maximize the reward value, while the optimization goal is to minimize the distance traveled F, with-F as the reward value.
Further, in step S2, initializing parameter values including an optimizer learning rate e, a vector dimension d, a training batch size S, a maximum training round number epoch, a node number nodes, and a dropout value; in order to meet the requirement that ultra-large-scale data are needed in the reinforcement learning model training process so as to improve the accuracy, a random method is adopted to generate the data based on the existing pick-and-place instance data set, and the method specifically comprises the following steps:
for each instance in the data set, all nodes in the data set are divided into pick-up points and delivery points, coordinates of each node are randomly disturbed in a certain range to generate a new instance, then a plurality of delivery points are randomly assigned to each pick-up point, and the demand of each pick-up point is equally assigned to each assigned delivery point.
Further, in step S3, in the training process, after each step is selected, the input static state remains unchanged, but the dynamic state changes, and the change of the dynamic state is updated by using the state update function; meanwhile, due to various limiting conditions, the vehicle cannot access all nodes in the training process, in order to accelerate the training speed of the reinforcement learning model, an accessible point matrix is generated, the matrix comprises all nodes, if the nodes are accessible, the value of the accessible point matrix is 1, otherwise, the accessible point matrix is 0, the nodes which cannot be reached by the vehicle in the next step are shielded, and a position updating function is adopted to update the accessible point matrix immediately after the selection in each step is completed;
inputting the data set generated in the step S2 into an actor network, randomly selecting a next destination of the vehicle according to a probability matrix p output by the actor network and in combination with the accessible point matrix, and recording the probability corresponding to the selected location;
after each training round is completed, namely the complete paths of all vehicles are constructed, the sum F of the driving distances of all vehicles is calculated, and then the reward value is calculated.
Further, the state update function is defined by the following rules:
assuming that the node selected by the vehicle m as the destination of the next step is i, the change in the dynamic state after completion of the selection is as follows:
(1.1) updating the current position of the vehicle m to i;
(1.2) if the node i is the pick-up point, the load of the vehicle m increases by qi;
(1.3) if the node i is a delivery point, the load of the vehicle m decreases by qi;
(1.4) the demand of the node i is updated to 0;
the location update function is defined by the following rules:
assuming that the current position of the vehicle m is node i, node j is masked when it satisfies one of the following conditions:
(2.1) the demand of the node j is 0, which indicates that the node j has been served;
(2.2) the node j is a goods taking point, and the demand of the node j exceeds the residual load of the vehicle m;
(2.3) node j is the delivery point, but vehicle m has not yet taken the delivery from the pick point corresponding to node j.
Further, in step S4, the reinforcement learning model is trained based on the principle of the strategy gradient algorithm, which is as follows:
obtaining the reward value of each round of training result output by the actor network, obtaining the estimation value output by the critic network, subtracting the reward value from the estimation value to obtain the loss value of the critic network, and multiplying the loss value of the critic network by the probability matrix output by the actor network to obtain the loss value of the actor network;
updating the operator network and the critic network through the optimizer, wherein an updating formula is represented as follows:
where theta represents a parameter in an operator network or a critical network,representing the gradient of the parameter theta, J (theta) representing the desired reward of the parameter theta, S representing the size of each training batch, pikRepresents the strategy selected during the k-th training, GkNode map at kth training, b estimation of base line, critic network, pθ(πk|Gk) Expressing that the actor network is based on the parameter theta and the strategy pi in the k trainingkAnd node map GkAnd outputting the probability matrix.
Further, in step S5, a maximum training round number epoch is set, if the maximum training round number epoch is not reached, the step S3 is returned, otherwise, the whole training process is completed, the current reinforcement learning model is saved, the reinforcement learning model after training is obtained, and the reinforcement learning model after training is adopted to perform route planning of the pickup truck and the delivery truck.
Compared with the prior art, the invention has the following advantages and technical effects:
1. the invention considers the scene of one-pick-and-multiple-pick in the problem of picking and delivering goods, and because of the limitation that different goods cannot be loaded in a mixed way, the picking point and all the corresponding delivery points are delivered by the same vehicle.
2. The invention adopts a reinforcement learning method, establishes a model aiming at the goods taking and delivering problem based on an A2C framework, uses the trained model, and can obtain a solution more quickly compared with the prior art.
Drawings
Fig. 1 is a flowchart of a method for planning a route of a pickup vehicle based on reinforcement learning according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of a vehicle distribution route according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a reinforcement learning model according to an embodiment of the present invention.
Detailed Description
In order to make the technical solution and advantages of the present invention more clearly understood, the following detailed description is made with reference to the accompanying drawings and examples, but the present invention is not limited thereto.
Example (b):
a method for planning a route of a pickup truck based on reinforcement learning, as shown in fig. 1, includes the following steps:
s1: constructing a reinforcement learning model based on an A2C framework and an optimization target thereof;
each customer order comprises a pick-up point and n delivery points, n belongs to [2,4 ]; the volume and the weight of the goods at the goods taking point are equal to the sum of the volume and the weight of the goods at the n goods delivering points; due to the limitation that different goods cannot be loaded in a mixed mode, the goods taking point of each customer order and all corresponding delivery points are delivered by the same vehicle; the number of the vehicles which can be used is not limited, and the maximum load, the maximum running distance and the maximum volume of the vehicles are the same; the vehicles all start from the same parking lot and must return to the parking lot after all orders are sent;
for the same customer order, before goods are delivered, the vehicle must take all goods from the goods-taking points and then deliver the goods to each corresponding delivery point; for different customer orders, goods taking and delivery can be performed in a crossed manner;
the optimization goal of the reinforcement learning model is to minimize the sum of the travel distances F of all vehicles:
where | R | represents the number of nodes in the current route R,represents the ith node that vehicle m passes through, | · | | non-woven phosphor2The norm of L2 is shown,representing the distance vehicle M travels from the ith node to the next node and M represents all vehicles in use.
As shown in FIG. 3, the reinforcement learning model comprises an actor network and a critic network;
an actor network refers to learning a strategy to obtain the highest possible return, and is used for generating actions and interacting with the environment; the critic network is used for estimating the current strategy, namely evaluating the quality of the operator network and guiding the action of the next stage of the operator network; both are implemented using different neural networks;
the actor network includes a first encoder, a decoder, and an attention layer; the first encoder is used for processing the input coordinates of all the goods taking points and the goods delivering points, the current vehicle load, the goods taking amount of all the goods taking points and the goods delivering amount of all the goods delivering points, and the data input into the encoder passes through the convolution layer in the first encoder to obtain a first vector embed _ 1; the decoder is used for processing the coordinates of the node where the current vehicle is located, and the data input into the decoder passes through the convolution layer and the GRU layer in the decoder to obtain a second vector embed _ 2; the attention layer maintains a first zero matrix v and a second zero matrix W, adds the first vector embed _1 and the second vector embed _2 to obtain a third vector hidden, and operates to obtain a probability matrix p which is softmax (v tanh (W hiden);
the criticic network comprises a second encoder and a full connection layer, the second encoder is used for processing the input coordinates of all the goods taking points and the goods delivering points, the current vehicle load, the goods taking amount of all the goods taking points and the goods delivering amount of the goods delivering points, and the input third vector embed _3 is obtained by passing through the convolution layer in the second encoder; the fully-connected layer takes the third vector embed _3 as input, the fully-connected layer comprises a plurality of convolutional layers, and a ReLU activation function is used in the fully-connected layer to remove negative values in the output of each convolutional layer.
Defining basic elements in the reinforcement learning model, including an agent, a state and an award value, specifically as follows:
the intelligent agent: the method comprises the following steps that a vehicle is an intelligent agent, the intelligent agent selects the next action according to a strategy from an initial state, and after each action is completed, the intelligent agent feeds back data obtained according to the data to update the strategy;
the state is as follows: the states are divided into static states and dynamic states; the static state is an attribute which does not change along with time and comprises the coordinates of each node; the dynamic state is an attribute which can change along with the training process and comprises the current load and the position of the vehicle and the demand of each node;
the reward value is as follows: the training goal of the reinforcement learning model is to maximize the reward value, while the optimization goal is to minimize the distance traveled F, with-F as the reward value.
In this embodiment, the schematic diagram of the vehicle delivery route is shown in fig. 2, where R1, R2, and R3 represent customer order numbers, circles represent customer points, solid circles represent pick-up points, and dashed circles represent delivery points. The first number in the circle represents the customer order number, the second letter 'P' represents the pick point in the order, the letter 'D' represents the delivery point in the order, and the third number following the letter 'D' represents the number of the delivery point in the order. V1, V2 indicate vehicle delivery routes to be scheduled for orders, and 0 represents a yard. In this case, the goods in order R1 and R2 can be mixed and thus delivered by the same vehicle in a short distance, but the goods in order R3 cannot be mixed and delivered by the first two vehicles and thus must be delivered by the other vehicle. It can be seen that the delivery route meets the constraints.
S2: initializing all parameter values of the model, and randomly generating a data set;
initializing parameter values, including an optimizer learning rate e, a vector dimension d, a training batch size S, a maximum training round number epoch, a node number nodes and a dropout value; in order to meet the requirement that the reinforcement learning model training process needs ultra-large-scale data to improve accuracy, in this embodiment, based on the Li & Lim dataset, a random method is adopted to generate data, which is specifically as follows:
for each instance in the data set, all nodes in the data set are divided into pick-up points and delivery points, coordinates of each node are randomly disturbed in a certain range to generate a new instance, then a plurality of delivery points are randomly assigned to each pick-up point, and the demand of each pick-up point is equally assigned to each assigned delivery point.
S3: constructing a training process of a reinforcement learning model, and calculating the reward value of each round of training results;
in the training process, after each step of selection, the input static state is still kept unchanged, but the dynamic state is changed, and the change of the dynamic state is updated by adopting a state updating function; meanwhile, due to various limiting conditions, the vehicle cannot access all nodes in the training process, in order to accelerate the training speed of the reinforcement learning model, an accessible point matrix is generated, the matrix comprises all nodes, if the nodes are accessible, the value of the accessible point matrix is 1, otherwise, the accessible point matrix is 0, the nodes which cannot be reached by the vehicle in the next step are shielded, and a position updating function is adopted to update the accessible point matrix immediately after the selection in each step is completed;
inputting the data set generated in the step S2 into an actor network, randomly selecting a next destination of the vehicle according to the probability matrix output by the actor network and the accessible point matrix, and recording the probability corresponding to the selected location;
after each training round is completed, namely the complete paths of all vehicles are constructed, the sum F of the driving distances of all vehicles is calculated, and then the reward value is calculated.
The state update function is defined by the following rules:
assuming that the node selected by the vehicle m as the destination of the next step is i, the change in the dynamic state after completion of the selection is as follows:
(1.1) updating the current position of the vehicle m to i;
(1.2) if the node i is the pick-up point, the load of the vehicle m increases by qi;
(1.3) if the node i is a delivery point, the load of the vehicle m decreases by qi;
(1.4) the demand of the node i is updated to 0;
the location update function is defined by the following rules:
assuming that the current position of the vehicle m is node i, node j is masked when it satisfies one of the following conditions:
(2.1) the demand of the node j is 0, which indicates that the node j has been served;
(2.2) the node j is a goods taking point, and the demand of the node j exceeds the residual load of the vehicle m;
(2.3) node j is the delivery point, but vehicle m has not yet taken the delivery from the pick point corresponding to node j.
S4: optimizing the reinforcement learning model by adopting a reinforcement learning method based on strategy gradient according to the loss value;
training a reinforcement learning model based on the principle of a strategy gradient algorithm, which comprises the following specific steps:
obtaining the reward value of each round of training result output by the actor network, obtaining the estimation value output by the critic network, subtracting the reward value from the estimation value to obtain the loss value of the critic network, and multiplying the loss value of the critic network by the probability matrix output by the actor network to obtain the loss value of the actor network;
updating the operator network and the critic network through the optimizer, wherein an updating formula is represented as follows:
where theta represents a parameter in an operator network or a critical network,representing the gradient of the parameter theta, J (theta) representing the desired reward of the parameter theta, S representing the size of each training batch, pikRepresents the strategy selected during the k-th training, GkNode map at kth training, b estimation of base line, critic network, pθ(πk|Gk) Expressing that the actor network is based on the parameter theta and the strategy pi in the k trainingkAnd node map GkAnd outputting the probability matrix.
S5: and setting the maximum number of training rounds epoch, if the maximum number of training rounds epoch is not reached, returning to the step S3, otherwise, finishing the whole training process, storing the current reinforcement learning model to obtain the reinforcement learning model after training, and adopting the reinforcement learning model after training to plan the route of the delivery vehicle.
The above examples are merely illustrative for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.
Claims (10)
1. A method for planning a route of a pickup truck and delivery truck based on reinforcement learning is characterized by comprising the following steps:
s1: constructing a reinforcement learning model based on an A2C framework and an optimization target thereof;
s2: initializing all parameter values of the reinforcement learning model, and randomly generating a data set;
s3: a training process of constructing a reinforcement learning model, wherein the data set generated in the step S2 is input into the reinforcement learning model, and the reward value of each round of training results is calculated;
s4: optimizing the reinforcement learning model by adopting a reinforcement learning method based on strategy gradient according to the loss value;
s5: and setting the maximum number of training wheels, repeating the steps S3-S4 to obtain a trained reinforcement learning model, and planning the route of the pick-up and delivery vehicles by adopting the trained reinforcement learning model.
2. A reinforcement learning-based pick-and-deliver vehicle path planning method as claimed in claim 1, wherein each customer order includes a pick-up point and n delivery points, n e [2,4 ]; the volume and the weight of the goods at the goods taking point are equal to the sum of the volume and the weight of the goods at the n goods delivering points; due to the limitation that different goods cannot be loaded in a mixed mode, the goods taking point of each customer order and all corresponding delivery points are delivered by the same vehicle; the number of the vehicles which can be used is not limited, and the maximum load, the maximum running distance and the maximum volume of the vehicles are the same; the vehicles all start from the same parking lot and return to the parking lot after all orders are sent;
for the same customer order, before goods are delivered, the vehicle must take all goods from the goods-taking points and then deliver the goods to each corresponding delivery point; for different customer orders, goods taking and delivery can be performed in a crossed manner;
the optimization goal of the reinforcement learning model is to minimize the sum of the travel distances F of all vehicles:
3. The reinforced learning-based pick-up and delivery vehicle path planning method according to claim 1, wherein in step S1, the reinforced learning model comprises an operator network and a critic network;
the actor network includes a first encoder, a decoder, and an attention layer; the first encoder is used for processing the input coordinates of all the goods taking points and the goods delivering points, the current vehicle load, the goods taking amount of all the goods taking points and the goods delivering amount of all the goods delivering points, and the data input into the encoder passes through the convolution layer in the first encoder to obtain a first vector embed _ 1; the decoder is used for processing the coordinates of the node where the current vehicle is located, and the data input into the decoder passes through the convolution layer and the GRU layer in the decoder to obtain a second vector embed _ 2; the attention layer maintains a first zero matrix v and a second zero matrix W, adds the first vector embed _1 and the second vector embed _2 together to obtain a third vector hidden, and operates to obtain a probability matrix p which is softmax (v tanh (W hiden);
the criticic network comprises a second encoder and a full connection layer, the second encoder is used for processing the input coordinates of all the goods taking points and the goods delivering points, the current vehicle load, the goods taking amount of all the goods taking points and the goods delivering amount of the goods delivering points, and the input third vector embed _3 is obtained by passing through the convolution layer in the second encoder; the fully-connected layer takes the third vector embed _3 as input, the fully-connected layer comprises a plurality of convolutional layers, and a ReLU activation function is used in the fully-connected layer to remove negative values in the output of each convolutional layer.
4. The pickup truck vehicle path planning method based on reinforcement learning as claimed in claim 3, wherein basic elements in the reinforcement learning model are defined, including an agent, a state and an incentive value, and specifically as follows:
the intelligent agent: the method comprises the following steps that a vehicle is an intelligent agent, the intelligent agent selects the next action according to a strategy from an initial state, and after each action is completed, the intelligent agent feeds back data obtained according to the data to update the strategy;
the state is as follows: the states are divided into static states and dynamic states; the static state is an attribute which does not change along with time and comprises the coordinates of each node; the dynamic state is an attribute which can be changed along with the training process, and comprises the current load and the position of the vehicle and the demand of each node;
the reward value is as follows: the training goal of the reinforcement learning model is to maximize the reward value, while the optimization goal is to minimize the distance traveled F, with-F as the reward value.
5. The reinforced learning-based delivery vehicle path planning method according to claim 1, wherein in step S2, parameter values are initialized, including optimizer learning rate e, vector dimension d, training batch size S, maximum training round number epoch, node number nodes, and dropout value; based on the existing pick-and-place instance data set, a random method is adopted to generate data, which specifically comprises the following steps:
for each instance in the data set, all nodes in the data set are divided into pick-up points and delivery points, coordinates of each node are randomly disturbed in a certain range to generate a new instance, then a plurality of delivery points are randomly assigned to each pick-up point, and the demand of each pick-up point is equally assigned to each assigned delivery point.
6. The method for planning the route of pickup vehicles based on reinforcement learning as claimed in claim 1, wherein in step S3, after each step is selected during the training process, the static state of the input remains unchanged, but the dynamic state changes, and the change of the dynamic state is updated by using the state update function; meanwhile, due to various limiting conditions, the vehicle cannot access all nodes in the training process, in order to accelerate the training speed of the reinforcement learning model, an accessible point matrix is generated and comprises all nodes, if the nodes are accessible, the value of the accessible point matrix is 1, otherwise, the accessible point matrix is 0, the nodes which cannot be reached by the vehicle in the next step are shielded, and a position updating function is adopted to update the accessible point matrix immediately after the selection in each step is completed;
inputting the data set generated in the step S2 into an actor network, randomly selecting a next destination of the vehicle according to a probability matrix p output by the actor network and in combination with the accessible point matrix, and recording the probability corresponding to the selected location;
after each training round is completed, namely the complete paths of all vehicles are constructed, the sum F of the driving distances of all vehicles is calculated, and then the reward value is calculated.
7. The reinforcement learning-based pick-up and delivery vehicle path planning method according to claim 6, wherein the state updating function is defined according to the following rules:
assuming that the node selected by the vehicle m as the destination of the next step is i, the change in the dynamic state after completion of the selection is as follows:
(1.1) updating the current position of the vehicle m to i;
(1.2) if the node i is the pick-up point, the load of the vehicle m increases by qi;
(1.3) if the node i is a delivery point, the load of the vehicle m decreases by qi;
(1.4) the demand of the node i is updated to 0.
8. The reinforcement learning-based pick-up and delivery vehicle path planning method according to claim 6, wherein the location update function is defined according to the following rules:
assuming that the current position of the vehicle m is node i, node j is masked when it satisfies one of the following conditions:
(2.1) the demand of the node j is 0, which indicates that the node j has been served;
(2.2) the node j is a goods taking point, and the demand of the node j exceeds the residual load of the vehicle m;
(2.3) node j is the delivery point, but vehicle m has not yet taken the delivery from the pick point corresponding to node j.
9. The pickup delivery vehicle path planning method based on reinforcement learning as claimed in claim 1, wherein in step S4, a reinforcement learning model is trained based on a strategy gradient algorithm, specifically as follows:
obtaining the reward value of each round of training result output by the actor network, obtaining the estimation value output by the critic network, subtracting the reward value from the estimation value to obtain the loss value of the critic network, and multiplying the loss value of the critic network by the probability matrix output by the actor network to obtain the loss value of the actor network;
updating the operator network and the critic network through the optimizer, wherein an updating formula is represented as follows:
where theta represents a parameter in an operator network or a critical network,representing the gradient of the parameter theta, J (theta) representing the desired reward of the parameter theta, S representing the size of each training batch, pikRepresents the strategy selected during the k-th training, GkNode map at kth training, b estimation of base line, critic network, pθ(πk|Gk) Expressing that the actor network is based on the parameter theta and the strategy pi in the k trainingkAnd node map GkAnd outputting the probability matrix.
10. The reinforcement learning-based delivery vehicle path planning method according to any one of claims 1 to 9, wherein in step S5, a maximum number of training rounds epoch is set, and if the maximum number of training rounds epoch is not reached, the method returns to step S3, otherwise the method indicates that the whole training process is completed, the current reinforcement learning model is saved, the reinforcement learning model after training is obtained, and the delivery vehicle path planning is performed by using the reinforcement learning model after training.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111355807.3A CN114237222B (en) | 2021-11-16 | 2021-11-16 | Delivery vehicle path planning method based on reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111355807.3A CN114237222B (en) | 2021-11-16 | 2021-11-16 | Delivery vehicle path planning method based on reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114237222A true CN114237222A (en) | 2022-03-25 |
CN114237222B CN114237222B (en) | 2024-06-21 |
Family
ID=80749548
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111355807.3A Active CN114237222B (en) | 2021-11-16 | 2021-11-16 | Delivery vehicle path planning method based on reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114237222B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115063066A (en) * | 2022-05-26 | 2022-09-16 | 电子科技大学 | Part supply circulation packing box distribution scheduling method based on graph convolution |
CN115545350A (en) * | 2022-11-28 | 2022-12-30 | 湖南工商大学 | Comprehensive deep neural network and reinforcement learning vehicle path problem solving method |
CN116562738A (en) * | 2023-07-10 | 2023-08-08 | 深圳市汉德网络科技有限公司 | Intelligent freight dispatching method, device, equipment and storage medium |
CN117875535A (en) * | 2024-03-13 | 2024-04-12 | 中南大学 | Method and system for planning picking and delivering paths based on historical information embedding |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110014428A (en) * | 2019-04-23 | 2019-07-16 | 北京理工大学 | A kind of sequential logic mission planning method based on intensified learning |
US20200073399A1 (en) * | 2018-08-30 | 2020-03-05 | Canon Kabushiki Kaisha | Information processing apparatus, information processing method, information processing system, and storage medium |
CN110956311A (en) * | 2019-11-15 | 2020-04-03 | 浙江工业大学 | Vehicle path optimization method based on super heuristic algorithm of reinforcement learning |
CN111415048A (en) * | 2020-04-10 | 2020-07-14 | 大连海事大学 | Vehicle path planning method based on reinforcement learning |
US20200273346A1 (en) * | 2019-02-26 | 2020-08-27 | Didi Research America, Llc | Multi-agent reinforcement learning for order-dispatching via order-vehicle distribution matching |
CN111695700A (en) * | 2020-06-16 | 2020-09-22 | 华东师范大学 | Boxing method based on deep reinforcement learning |
CN112784481A (en) * | 2021-01-15 | 2021-05-11 | 中国人民解放军国防科技大学 | Deep reinforcement learning method and system for relay charging path planning |
CN112965499A (en) * | 2021-03-08 | 2021-06-15 | 哈尔滨工业大学(深圳) | Unmanned vehicle driving decision-making method based on attention model and deep reinforcement learning |
-
2021
- 2021-11-16 CN CN202111355807.3A patent/CN114237222B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200073399A1 (en) * | 2018-08-30 | 2020-03-05 | Canon Kabushiki Kaisha | Information processing apparatus, information processing method, information processing system, and storage medium |
US20200273346A1 (en) * | 2019-02-26 | 2020-08-27 | Didi Research America, Llc | Multi-agent reinforcement learning for order-dispatching via order-vehicle distribution matching |
CN110014428A (en) * | 2019-04-23 | 2019-07-16 | 北京理工大学 | A kind of sequential logic mission planning method based on intensified learning |
CN110956311A (en) * | 2019-11-15 | 2020-04-03 | 浙江工业大学 | Vehicle path optimization method based on super heuristic algorithm of reinforcement learning |
CN111415048A (en) * | 2020-04-10 | 2020-07-14 | 大连海事大学 | Vehicle path planning method based on reinforcement learning |
CN111695700A (en) * | 2020-06-16 | 2020-09-22 | 华东师范大学 | Boxing method based on deep reinforcement learning |
CN112784481A (en) * | 2021-01-15 | 2021-05-11 | 中国人民解放军国防科技大学 | Deep reinforcement learning method and system for relay charging path planning |
CN112965499A (en) * | 2021-03-08 | 2021-06-15 | 哈尔滨工业大学(深圳) | Unmanned vehicle driving decision-making method based on attention model and deep reinforcement learning |
Non-Patent Citations (3)
Title |
---|
汪华健: "基于强化学习的扑翼飞行机器人控制设计与研究", 中国优秀硕士学位论文全文数据库(电子期刊), no. 01, pages 1 - 74 * |
王丙琛等: "基于深度强化学习的自动驾驶车控制算法研究", 郑州大学学报( 工学版), vol. 41, no. 4, pages 41 - 45 * |
马琼雄等: "基于深度强化学习的水下机器人最优轨迹控制", 华南师范大学学报(自然科学版), vol. 50, no. 1, pages 118 - 123 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115063066A (en) * | 2022-05-26 | 2022-09-16 | 电子科技大学 | Part supply circulation packing box distribution scheduling method based on graph convolution |
CN115545350A (en) * | 2022-11-28 | 2022-12-30 | 湖南工商大学 | Comprehensive deep neural network and reinforcement learning vehicle path problem solving method |
CN115545350B (en) * | 2022-11-28 | 2024-01-16 | 湖南工商大学 | Vehicle path problem solving method integrating deep neural network and reinforcement learning |
CN116562738A (en) * | 2023-07-10 | 2023-08-08 | 深圳市汉德网络科技有限公司 | Intelligent freight dispatching method, device, equipment and storage medium |
CN116562738B (en) * | 2023-07-10 | 2024-01-12 | 深圳市汉德网络科技有限公司 | Intelligent freight dispatching method, device, equipment and storage medium |
CN117875535A (en) * | 2024-03-13 | 2024-04-12 | 中南大学 | Method and system for planning picking and delivering paths based on historical information embedding |
CN117875535B (en) * | 2024-03-13 | 2024-06-04 | 中南大学 | Method and system for planning picking and delivering paths based on historical information embedding |
Also Published As
Publication number | Publication date |
---|---|
CN114237222B (en) | 2024-06-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114237222B (en) | Delivery vehicle path planning method based on reinforcement learning | |
WO2022262469A1 (en) | Industrial park logistics scheduling method and system based on game theory | |
WO2022116225A1 (en) | Multi-vehicle task assignment and routing optimization simulation platform and implementation method therefor | |
CN107578119A (en) | A kind of resource allocation global optimization method of intelligent dispatching system | |
Wang et al. | Ant colony optimization with an improved pheromone model for solving MTSP with capacity and time window constraint | |
CN111047087B (en) | Intelligent optimization method and device for path under cooperation of unmanned aerial vehicle and vehicle | |
Bogyrbayeva et al. | A reinforcement learning approach for rebalancing electric vehicle sharing systems | |
CN113359702B (en) | Intelligent warehouse AGV operation optimization scheduling method based on water wave optimization-tabu search | |
Wang et al. | Solving task scheduling problems in cloud manufacturing via attention mechanism and deep reinforcement learning | |
CN113837628B (en) | Metallurgical industry workshop crown block scheduling method based on deep reinforcement learning | |
CN116739466A (en) | Distribution center vehicle path planning method based on multi-agent deep reinforcement learning | |
CN109934388A (en) | One kind sorting optimization system for intelligence | |
CN115454005A (en) | Manufacturing workshop dynamic intelligent scheduling method and device oriented to limited transportation resource scene | |
CN114861368B (en) | Construction method of railway longitudinal section design learning model based on near-end strategy | |
CN113205220A (en) | Unmanned aerial vehicle logistics distribution global planning method facing real-time order data | |
CN117236541A (en) | Distributed logistics distribution path planning method and system based on attention pointer network | |
Zhang et al. | Transformer-based reinforcement learning for pickup and delivery problems with late penalties | |
Liu et al. | An adaptive large neighborhood search method for rebalancing free-floating electric vehicle sharing systems | |
Liu et al. | Graph convolution-based deep reinforcement learning for multi-agent decision-making in interactive traffic scenarios | |
CN117273590B (en) | Neural combination optimization method and system for solving vehicle path optimization problem | |
CN117666495A (en) | Goods picking path planning method, system and electronic equipment | |
Wang et al. | Towards optimization of path planning: An RRT*-ACO algorithm | |
Liu et al. | Graph convolution-based deep reinforcement learning for multi-agent decision-making in mixed traffic environments | |
CN108492020B (en) | Polluted vehicle scheduling method and system based on simulated annealing and branch cutting optimization | |
CN115841286A (en) | Takeout delivery path planning method based on deep reinforcement learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |