CN114237222A - Method for planning route of delivery vehicle based on reinforcement learning - Google Patents

Method for planning route of delivery vehicle based on reinforcement learning Download PDF

Info

Publication number
CN114237222A
CN114237222A CN202111355807.3A CN202111355807A CN114237222A CN 114237222 A CN114237222 A CN 114237222A CN 202111355807 A CN202111355807 A CN 202111355807A CN 114237222 A CN114237222 A CN 114237222A
Authority
CN
China
Prior art keywords
reinforcement learning
node
vehicle
training
learning model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111355807.3A
Other languages
Chinese (zh)
Other versions
CN114237222B (en
Inventor
刘发贵
赖承启
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202111355807.3A priority Critical patent/CN114237222B/en
Publication of CN114237222A publication Critical patent/CN114237222A/en
Application granted granted Critical
Publication of CN114237222B publication Critical patent/CN114237222B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0223Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving speed control of the vehicle
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0221Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Traffic Control Systems (AREA)

Abstract

The invention discloses a method for planning a route of a delivery vehicle based on reinforcement learning. The method comprises the following steps: constructing a reinforcement learning model based on an A2C framework and an optimization target thereof; initializing all parameter values of the reinforcement learning model, and randomly generating a data set; constructing a training process of a reinforcement learning model, inputting the generated data set into the reinforcement learning model, and calculating the reward value of each round of training results; optimizing the reinforcement learning model by adopting a reinforcement learning method based on strategy gradient according to the loss value; and setting the maximum number of training rounds, repeatedly training to obtain a trained reinforcement learning model, and planning the route of the delivery vehicle by adopting the trained reinforcement learning model. The method is different from the traditional precise algorithm and heuristic algorithm, and can quickly solve the large-scale path planning problem.

Description

Method for planning route of delivery vehicle based on reinforcement learning
Technical Field
The invention belongs to the field of logistics scheduling, and particularly relates to a method for planning a route of a delivery vehicle based on reinforcement learning.
Background
In recent years, along with the popularization and development of the mobile internet, the scale of electronic commerce is continuously enlarged, the related logistics industry is also rapidly developed, and the output value of the logistics industry is continuously increased at a high speed. According to the data of the national statistical bureau, the business total amount of 2020 industry breaks through 2 trillion yuan, the year-on-year increase is 29.7%, and the express business amount breaks through 800 billion pieces, and the year-on-year increase is 31.2%. With the continuous increase of urbanization, urban distribution has become an important ring in the whole logistics industry. In the current society of building smart cities by applying high-tech means, the improvement of urban distribution efficiency by combining the latest software and hardware technology becomes a new challenge.
The goods taking and delivering problem is a classical NP difficult problem in the field of combinatorial optimization, researchers carry out a great deal of research on theory and practical application of the goods taking and delivering problem, and a great number of accurate algorithms and heuristic algorithms are provided. Among them, the precise algorithms include a branch pricing method, a column generation method, etc., which have advantages in that an optimal solution of the problem can be obtained, but when the problem scale increases, the solution time increases exponentially, and satisfactory results cannot be obtained within an acceptable time.
Thus, more research has turned to heuristic algorithms. The solving process of the heuristic algorithm generally comprises the steps of firstly generating an initial solution, making an iteration strategy according to certain phenomena of a simulated nature, and obtaining a final solution through a certain number of iterations. Researchers have proposed many heuristic algorithms such as genetic algorithm, artificial immune algorithm, tabu search algorithm, etc. Although the results obtained by applying the heuristic algorithm are not necessarily optimal, the heuristic algorithm can obtain relatively good results in a reasonable time and is most widely applied at present. But it still does not perform well in the face of large-scale problems that require prompt and quick solution times.
To address this problem, some researchers have begun to try to introduce new approaches. In recent years, reinforcement learning has achieved a remarkable effect in the related art. The solution of related path planning problems such as the delivery problem and the like relates to a sequence decision problem, and the reinforcement learning is very suitable for making sequence decisions. Therefore, researchers begin to use reinforcement learning to solve path planning problems such as TSP and VRP, so that the solution time is greatly improved, and the solution quality has certain competitiveness (m.nazari, a.orojooy, M).
Figure BDA0003357528730000011
and L.V.Snyder, "recovery learning for dissolving the vehicle routing protocol," adv.neural inf.Process.Syst., vol.2018-Decem, pp.9839-9849,2018 ]. Compared with the traditional precision algorithm and the heuristic algorithm,the method has the advantages of solving rapidity, and even complex large-scale problems can be solved quickly and reasonably. But it has the same disadvantages, mainly including the following three points: the training time of the model is very long; the training process requires a large amount of data; the quality of the solution is somewhat different from the traditional algorithm.
Disclosure of Invention
The invention aims to provide a feasible solution for the problem of goods taking and delivering in urban distribution rapidly and intelligently, so that the overall efficiency is improved.
The purpose of the invention is realized by at least one of the following technical solutions.
A method for planning a route of a pickup truck and delivery truck based on reinforcement learning comprises the following steps:
s1: constructing a reinforcement learning model based on an A2C framework and an optimization target thereof;
s2: initializing all parameter values of the reinforcement learning model, and randomly generating a data set;
s3: a training process of constructing a reinforcement learning model, wherein the data set generated in the step S2 is input into the reinforcement learning model, and the reward value of each round of training results is calculated;
s4: optimizing the reinforcement learning model by adopting a reinforcement learning method based on strategy gradient according to the loss value;
s5: and setting the maximum number of training wheels, repeating the steps S3-S4 to obtain a trained reinforcement learning model, and planning the route of the pick-up and delivery vehicles by adopting the trained reinforcement learning model.
Further, each customer order comprises a pick point and n delivery points, n belongs to [2,4 ]; the volume and the weight of the goods at the goods taking point are equal to the sum of the volume and the weight of the goods at the n goods delivering points; due to the limitation that different goods cannot be loaded in a mixed mode, the goods taking point of each customer order and all corresponding delivery points are delivered by the same vehicle; the number of the vehicles which can be used is not limited, and the maximum load, the maximum running distance and the maximum volume of the vehicles are the same; the vehicles all start from the same parking lot and return to the parking lot after all orders are sent;
for the same customer order, before goods are delivered, the vehicle must take all goods from the goods-taking points and then deliver the goods to each corresponding delivery point; for different customer orders, goods taking and delivery can be performed in a crossed manner;
the optimization goal of the reinforcement learning model is to minimize the sum of the travel distances F of all vehicles:
Figure BDA0003357528730000021
where | R | represents the number of nodes in the current route R,
Figure BDA0003357528730000022
represents the ith node that vehicle m passes through, | · | | non-woven phosphor2The norm of L2 is shown,
Figure BDA0003357528730000023
representing the distance vehicle M travels from the ith node to the next node and M represents all vehicles in use.
Further, in step S1, the reinforcement learning model includes an actor network and a critic network;
an actor network refers to learning a strategy to obtain the highest possible return, and is used for generating actions and interacting with the environment; the critic network is used for estimating the current strategy, namely evaluating the quality of the operator network and guiding the action of the next stage of the operator network; both are implemented using different neural networks;
the actor network includes a first encoder, a decoder, and an attention layer; the first encoder is used for processing the input coordinates of all the goods taking points and the goods delivering points, the current vehicle load, the goods taking amount of all the goods taking points and the goods delivering amount of all the goods delivering points, and the data input into the encoder passes through the convolution layer in the first encoder to obtain a first vector embed _ 1; the decoder is used for processing the coordinates of the node where the current vehicle is located, and the data input into the decoder passes through the convolution layer and the GRU layer in the decoder to obtain a second vector embed _ 2; the attention layer maintains a first zero matrix v and a second zero matrix W, adds the first vector embed _1 and the second vector embed _2 to obtain a third vector hidden, and operates to obtain a probability matrix p which is softmax (v tanh (W hiden);
the criticic network comprises a second encoder and a full connection layer, the second encoder is used for processing the input coordinates of all the goods taking points and the goods delivering points, the current vehicle load, the goods taking amount of all the goods taking points and the goods delivering amount of the goods delivering points, and the input third vector embed _3 is obtained by passing through the convolution layer in the second encoder; the fully-connected layer takes the third vector embed _3 as input, the fully-connected layer comprises a plurality of convolutional layers, and a ReLU activation function is used in the fully-connected layer to remove negative values in the output of each convolutional layer.
Further, defining basic elements in the reinforcement learning model, including an agent, a state and an award value, as follows:
the intelligent agent: the method comprises the following steps that a vehicle is an intelligent agent, the intelligent agent selects the next action according to a strategy from an initial state, and after each action is completed, the intelligent agent feeds back data obtained according to the data to update the strategy;
the state is as follows: the states are divided into static states and dynamic states; the static state is an attribute which does not change along with time and comprises the coordinates of each node; the dynamic state is an attribute which can be changed along with the training process, and comprises the current load and the position of the vehicle and the demand of each node;
the reward value is as follows: the training goal of the reinforcement learning model is to maximize the reward value, while the optimization goal is to minimize the distance traveled F, with-F as the reward value.
Further, in step S2, initializing parameter values including an optimizer learning rate e, a vector dimension d, a training batch size S, a maximum training round number epoch, a node number nodes, and a dropout value; in order to meet the requirement that ultra-large-scale data are needed in the reinforcement learning model training process so as to improve the accuracy, a random method is adopted to generate the data based on the existing pick-and-place instance data set, and the method specifically comprises the following steps:
for each instance in the data set, all nodes in the data set are divided into pick-up points and delivery points, coordinates of each node are randomly disturbed in a certain range to generate a new instance, then a plurality of delivery points are randomly assigned to each pick-up point, and the demand of each pick-up point is equally assigned to each assigned delivery point.
Further, in step S3, in the training process, after each step is selected, the input static state remains unchanged, but the dynamic state changes, and the change of the dynamic state is updated by using the state update function; meanwhile, due to various limiting conditions, the vehicle cannot access all nodes in the training process, in order to accelerate the training speed of the reinforcement learning model, an accessible point matrix is generated, the matrix comprises all nodes, if the nodes are accessible, the value of the accessible point matrix is 1, otherwise, the accessible point matrix is 0, the nodes which cannot be reached by the vehicle in the next step are shielded, and a position updating function is adopted to update the accessible point matrix immediately after the selection in each step is completed;
inputting the data set generated in the step S2 into an actor network, randomly selecting a next destination of the vehicle according to a probability matrix p output by the actor network and in combination with the accessible point matrix, and recording the probability corresponding to the selected location;
after each training round is completed, namely the complete paths of all vehicles are constructed, the sum F of the driving distances of all vehicles is calculated, and then the reward value is calculated.
Further, the state update function is defined by the following rules:
assuming that the node selected by the vehicle m as the destination of the next step is i, the change in the dynamic state after completion of the selection is as follows:
(1.1) updating the current position of the vehicle m to i;
(1.2) if the node i is the pick-up point, the load of the vehicle m increases by qi
(1.3) if the node i is a delivery point, the load of the vehicle m decreases by qi
(1.4) the demand of the node i is updated to 0;
the location update function is defined by the following rules:
assuming that the current position of the vehicle m is node i, node j is masked when it satisfies one of the following conditions:
(2.1) the demand of the node j is 0, which indicates that the node j has been served;
(2.2) the node j is a goods taking point, and the demand of the node j exceeds the residual load of the vehicle m;
(2.3) node j is the delivery point, but vehicle m has not yet taken the delivery from the pick point corresponding to node j.
Further, in step S4, the reinforcement learning model is trained based on the principle of the strategy gradient algorithm, which is as follows:
obtaining the reward value of each round of training result output by the actor network, obtaining the estimation value output by the critic network, subtracting the reward value from the estimation value to obtain the loss value of the critic network, and multiplying the loss value of the critic network by the probability matrix output by the actor network to obtain the loss value of the actor network;
updating the operator network and the critic network through the optimizer, wherein an updating formula is represented as follows:
Figure BDA0003357528730000041
where theta represents a parameter in an operator network or a critical network,
Figure BDA0003357528730000042
representing the gradient of the parameter theta, J (theta) representing the desired reward of the parameter theta, S representing the size of each training batch, pikRepresents the strategy selected during the k-th training, GkNode map at kth training, b estimation of base line, critic network, pθk|Gk) Expressing that the actor network is based on the parameter theta and the strategy pi in the k trainingkAnd node map GkAnd outputting the probability matrix.
Further, in step S5, a maximum training round number epoch is set, if the maximum training round number epoch is not reached, the step S3 is returned, otherwise, the whole training process is completed, the current reinforcement learning model is saved, the reinforcement learning model after training is obtained, and the reinforcement learning model after training is adopted to perform route planning of the pickup truck and the delivery truck.
Compared with the prior art, the invention has the following advantages and technical effects:
1. the invention considers the scene of one-pick-and-multiple-pick in the problem of picking and delivering goods, and because of the limitation that different goods cannot be loaded in a mixed way, the picking point and all the corresponding delivery points are delivered by the same vehicle.
2. The invention adopts a reinforcement learning method, establishes a model aiming at the goods taking and delivering problem based on an A2C framework, uses the trained model, and can obtain a solution more quickly compared with the prior art.
Drawings
Fig. 1 is a flowchart of a method for planning a route of a pickup vehicle based on reinforcement learning according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of a vehicle distribution route according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a reinforcement learning model according to an embodiment of the present invention.
Detailed Description
In order to make the technical solution and advantages of the present invention more clearly understood, the following detailed description is made with reference to the accompanying drawings and examples, but the present invention is not limited thereto.
Example (b):
a method for planning a route of a pickup truck based on reinforcement learning, as shown in fig. 1, includes the following steps:
s1: constructing a reinforcement learning model based on an A2C framework and an optimization target thereof;
each customer order comprises a pick-up point and n delivery points, n belongs to [2,4 ]; the volume and the weight of the goods at the goods taking point are equal to the sum of the volume and the weight of the goods at the n goods delivering points; due to the limitation that different goods cannot be loaded in a mixed mode, the goods taking point of each customer order and all corresponding delivery points are delivered by the same vehicle; the number of the vehicles which can be used is not limited, and the maximum load, the maximum running distance and the maximum volume of the vehicles are the same; the vehicles all start from the same parking lot and must return to the parking lot after all orders are sent;
for the same customer order, before goods are delivered, the vehicle must take all goods from the goods-taking points and then deliver the goods to each corresponding delivery point; for different customer orders, goods taking and delivery can be performed in a crossed manner;
the optimization goal of the reinforcement learning model is to minimize the sum of the travel distances F of all vehicles:
Figure BDA0003357528730000051
where | R | represents the number of nodes in the current route R,
Figure BDA0003357528730000052
represents the ith node that vehicle m passes through, | · | | non-woven phosphor2The norm of L2 is shown,
Figure BDA0003357528730000053
representing the distance vehicle M travels from the ith node to the next node and M represents all vehicles in use.
As shown in FIG. 3, the reinforcement learning model comprises an actor network and a critic network;
an actor network refers to learning a strategy to obtain the highest possible return, and is used for generating actions and interacting with the environment; the critic network is used for estimating the current strategy, namely evaluating the quality of the operator network and guiding the action of the next stage of the operator network; both are implemented using different neural networks;
the actor network includes a first encoder, a decoder, and an attention layer; the first encoder is used for processing the input coordinates of all the goods taking points and the goods delivering points, the current vehicle load, the goods taking amount of all the goods taking points and the goods delivering amount of all the goods delivering points, and the data input into the encoder passes through the convolution layer in the first encoder to obtain a first vector embed _ 1; the decoder is used for processing the coordinates of the node where the current vehicle is located, and the data input into the decoder passes through the convolution layer and the GRU layer in the decoder to obtain a second vector embed _ 2; the attention layer maintains a first zero matrix v and a second zero matrix W, adds the first vector embed _1 and the second vector embed _2 to obtain a third vector hidden, and operates to obtain a probability matrix p which is softmax (v tanh (W hiden);
the criticic network comprises a second encoder and a full connection layer, the second encoder is used for processing the input coordinates of all the goods taking points and the goods delivering points, the current vehicle load, the goods taking amount of all the goods taking points and the goods delivering amount of the goods delivering points, and the input third vector embed _3 is obtained by passing through the convolution layer in the second encoder; the fully-connected layer takes the third vector embed _3 as input, the fully-connected layer comprises a plurality of convolutional layers, and a ReLU activation function is used in the fully-connected layer to remove negative values in the output of each convolutional layer.
Defining basic elements in the reinforcement learning model, including an agent, a state and an award value, specifically as follows:
the intelligent agent: the method comprises the following steps that a vehicle is an intelligent agent, the intelligent agent selects the next action according to a strategy from an initial state, and after each action is completed, the intelligent agent feeds back data obtained according to the data to update the strategy;
the state is as follows: the states are divided into static states and dynamic states; the static state is an attribute which does not change along with time and comprises the coordinates of each node; the dynamic state is an attribute which can change along with the training process and comprises the current load and the position of the vehicle and the demand of each node;
the reward value is as follows: the training goal of the reinforcement learning model is to maximize the reward value, while the optimization goal is to minimize the distance traveled F, with-F as the reward value.
In this embodiment, the schematic diagram of the vehicle delivery route is shown in fig. 2, where R1, R2, and R3 represent customer order numbers, circles represent customer points, solid circles represent pick-up points, and dashed circles represent delivery points. The first number in the circle represents the customer order number, the second letter 'P' represents the pick point in the order, the letter 'D' represents the delivery point in the order, and the third number following the letter 'D' represents the number of the delivery point in the order. V1, V2 indicate vehicle delivery routes to be scheduled for orders, and 0 represents a yard. In this case, the goods in order R1 and R2 can be mixed and thus delivered by the same vehicle in a short distance, but the goods in order R3 cannot be mixed and delivered by the first two vehicles and thus must be delivered by the other vehicle. It can be seen that the delivery route meets the constraints.
S2: initializing all parameter values of the model, and randomly generating a data set;
initializing parameter values, including an optimizer learning rate e, a vector dimension d, a training batch size S, a maximum training round number epoch, a node number nodes and a dropout value; in order to meet the requirement that the reinforcement learning model training process needs ultra-large-scale data to improve accuracy, in this embodiment, based on the Li & Lim dataset, a random method is adopted to generate data, which is specifically as follows:
for each instance in the data set, all nodes in the data set are divided into pick-up points and delivery points, coordinates of each node are randomly disturbed in a certain range to generate a new instance, then a plurality of delivery points are randomly assigned to each pick-up point, and the demand of each pick-up point is equally assigned to each assigned delivery point.
S3: constructing a training process of a reinforcement learning model, and calculating the reward value of each round of training results;
in the training process, after each step of selection, the input static state is still kept unchanged, but the dynamic state is changed, and the change of the dynamic state is updated by adopting a state updating function; meanwhile, due to various limiting conditions, the vehicle cannot access all nodes in the training process, in order to accelerate the training speed of the reinforcement learning model, an accessible point matrix is generated, the matrix comprises all nodes, if the nodes are accessible, the value of the accessible point matrix is 1, otherwise, the accessible point matrix is 0, the nodes which cannot be reached by the vehicle in the next step are shielded, and a position updating function is adopted to update the accessible point matrix immediately after the selection in each step is completed;
inputting the data set generated in the step S2 into an actor network, randomly selecting a next destination of the vehicle according to the probability matrix output by the actor network and the accessible point matrix, and recording the probability corresponding to the selected location;
after each training round is completed, namely the complete paths of all vehicles are constructed, the sum F of the driving distances of all vehicles is calculated, and then the reward value is calculated.
The state update function is defined by the following rules:
assuming that the node selected by the vehicle m as the destination of the next step is i, the change in the dynamic state after completion of the selection is as follows:
(1.1) updating the current position of the vehicle m to i;
(1.2) if the node i is the pick-up point, the load of the vehicle m increases by qi
(1.3) if the node i is a delivery point, the load of the vehicle m decreases by qi
(1.4) the demand of the node i is updated to 0;
the location update function is defined by the following rules:
assuming that the current position of the vehicle m is node i, node j is masked when it satisfies one of the following conditions:
(2.1) the demand of the node j is 0, which indicates that the node j has been served;
(2.2) the node j is a goods taking point, and the demand of the node j exceeds the residual load of the vehicle m;
(2.3) node j is the delivery point, but vehicle m has not yet taken the delivery from the pick point corresponding to node j.
S4: optimizing the reinforcement learning model by adopting a reinforcement learning method based on strategy gradient according to the loss value;
training a reinforcement learning model based on the principle of a strategy gradient algorithm, which comprises the following specific steps:
obtaining the reward value of each round of training result output by the actor network, obtaining the estimation value output by the critic network, subtracting the reward value from the estimation value to obtain the loss value of the critic network, and multiplying the loss value of the critic network by the probability matrix output by the actor network to obtain the loss value of the actor network;
updating the operator network and the critic network through the optimizer, wherein an updating formula is represented as follows:
Figure BDA0003357528730000081
where theta represents a parameter in an operator network or a critical network,
Figure BDA0003357528730000082
representing the gradient of the parameter theta, J (theta) representing the desired reward of the parameter theta, S representing the size of each training batch, pikRepresents the strategy selected during the k-th training, GkNode map at kth training, b estimation of base line, critic network, pθk|Gk) Expressing that the actor network is based on the parameter theta and the strategy pi in the k trainingkAnd node map GkAnd outputting the probability matrix.
S5: and setting the maximum number of training rounds epoch, if the maximum number of training rounds epoch is not reached, returning to the step S3, otherwise, finishing the whole training process, storing the current reinforcement learning model to obtain the reinforcement learning model after training, and adopting the reinforcement learning model after training to plan the route of the delivery vehicle.
The above examples are merely illustrative for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (10)

1. A method for planning a route of a pickup truck and delivery truck based on reinforcement learning is characterized by comprising the following steps:
s1: constructing a reinforcement learning model based on an A2C framework and an optimization target thereof;
s2: initializing all parameter values of the reinforcement learning model, and randomly generating a data set;
s3: a training process of constructing a reinforcement learning model, wherein the data set generated in the step S2 is input into the reinforcement learning model, and the reward value of each round of training results is calculated;
s4: optimizing the reinforcement learning model by adopting a reinforcement learning method based on strategy gradient according to the loss value;
s5: and setting the maximum number of training wheels, repeating the steps S3-S4 to obtain a trained reinforcement learning model, and planning the route of the pick-up and delivery vehicles by adopting the trained reinforcement learning model.
2. A reinforcement learning-based pick-and-deliver vehicle path planning method as claimed in claim 1, wherein each customer order includes a pick-up point and n delivery points, n e [2,4 ]; the volume and the weight of the goods at the goods taking point are equal to the sum of the volume and the weight of the goods at the n goods delivering points; due to the limitation that different goods cannot be loaded in a mixed mode, the goods taking point of each customer order and all corresponding delivery points are delivered by the same vehicle; the number of the vehicles which can be used is not limited, and the maximum load, the maximum running distance and the maximum volume of the vehicles are the same; the vehicles all start from the same parking lot and return to the parking lot after all orders are sent;
for the same customer order, before goods are delivered, the vehicle must take all goods from the goods-taking points and then deliver the goods to each corresponding delivery point; for different customer orders, goods taking and delivery can be performed in a crossed manner;
the optimization goal of the reinforcement learning model is to minimize the sum of the travel distances F of all vehicles:
Figure FDA0003357528720000011
where | R | represents the number of nodes in the current route R,
Figure FDA0003357528720000012
represents the ith node that vehicle m passes through, | · | | non-woven phosphor2The norm of L2 is shown,
Figure FDA0003357528720000013
representing the distance vehicle M travels from the ith node to the next node and M represents all vehicles in use.
3. The reinforced learning-based pick-up and delivery vehicle path planning method according to claim 1, wherein in step S1, the reinforced learning model comprises an operator network and a critic network;
the actor network includes a first encoder, a decoder, and an attention layer; the first encoder is used for processing the input coordinates of all the goods taking points and the goods delivering points, the current vehicle load, the goods taking amount of all the goods taking points and the goods delivering amount of all the goods delivering points, and the data input into the encoder passes through the convolution layer in the first encoder to obtain a first vector embed _ 1; the decoder is used for processing the coordinates of the node where the current vehicle is located, and the data input into the decoder passes through the convolution layer and the GRU layer in the decoder to obtain a second vector embed _ 2; the attention layer maintains a first zero matrix v and a second zero matrix W, adds the first vector embed _1 and the second vector embed _2 together to obtain a third vector hidden, and operates to obtain a probability matrix p which is softmax (v tanh (W hiden);
the criticic network comprises a second encoder and a full connection layer, the second encoder is used for processing the input coordinates of all the goods taking points and the goods delivering points, the current vehicle load, the goods taking amount of all the goods taking points and the goods delivering amount of the goods delivering points, and the input third vector embed _3 is obtained by passing through the convolution layer in the second encoder; the fully-connected layer takes the third vector embed _3 as input, the fully-connected layer comprises a plurality of convolutional layers, and a ReLU activation function is used in the fully-connected layer to remove negative values in the output of each convolutional layer.
4. The pickup truck vehicle path planning method based on reinforcement learning as claimed in claim 3, wherein basic elements in the reinforcement learning model are defined, including an agent, a state and an incentive value, and specifically as follows:
the intelligent agent: the method comprises the following steps that a vehicle is an intelligent agent, the intelligent agent selects the next action according to a strategy from an initial state, and after each action is completed, the intelligent agent feeds back data obtained according to the data to update the strategy;
the state is as follows: the states are divided into static states and dynamic states; the static state is an attribute which does not change along with time and comprises the coordinates of each node; the dynamic state is an attribute which can be changed along with the training process, and comprises the current load and the position of the vehicle and the demand of each node;
the reward value is as follows: the training goal of the reinforcement learning model is to maximize the reward value, while the optimization goal is to minimize the distance traveled F, with-F as the reward value.
5. The reinforced learning-based delivery vehicle path planning method according to claim 1, wherein in step S2, parameter values are initialized, including optimizer learning rate e, vector dimension d, training batch size S, maximum training round number epoch, node number nodes, and dropout value; based on the existing pick-and-place instance data set, a random method is adopted to generate data, which specifically comprises the following steps:
for each instance in the data set, all nodes in the data set are divided into pick-up points and delivery points, coordinates of each node are randomly disturbed in a certain range to generate a new instance, then a plurality of delivery points are randomly assigned to each pick-up point, and the demand of each pick-up point is equally assigned to each assigned delivery point.
6. The method for planning the route of pickup vehicles based on reinforcement learning as claimed in claim 1, wherein in step S3, after each step is selected during the training process, the static state of the input remains unchanged, but the dynamic state changes, and the change of the dynamic state is updated by using the state update function; meanwhile, due to various limiting conditions, the vehicle cannot access all nodes in the training process, in order to accelerate the training speed of the reinforcement learning model, an accessible point matrix is generated and comprises all nodes, if the nodes are accessible, the value of the accessible point matrix is 1, otherwise, the accessible point matrix is 0, the nodes which cannot be reached by the vehicle in the next step are shielded, and a position updating function is adopted to update the accessible point matrix immediately after the selection in each step is completed;
inputting the data set generated in the step S2 into an actor network, randomly selecting a next destination of the vehicle according to a probability matrix p output by the actor network and in combination with the accessible point matrix, and recording the probability corresponding to the selected location;
after each training round is completed, namely the complete paths of all vehicles are constructed, the sum F of the driving distances of all vehicles is calculated, and then the reward value is calculated.
7. The reinforcement learning-based pick-up and delivery vehicle path planning method according to claim 6, wherein the state updating function is defined according to the following rules:
assuming that the node selected by the vehicle m as the destination of the next step is i, the change in the dynamic state after completion of the selection is as follows:
(1.1) updating the current position of the vehicle m to i;
(1.2) if the node i is the pick-up point, the load of the vehicle m increases by qi
(1.3) if the node i is a delivery point, the load of the vehicle m decreases by qi
(1.4) the demand of the node i is updated to 0.
8. The reinforcement learning-based pick-up and delivery vehicle path planning method according to claim 6, wherein the location update function is defined according to the following rules:
assuming that the current position of the vehicle m is node i, node j is masked when it satisfies one of the following conditions:
(2.1) the demand of the node j is 0, which indicates that the node j has been served;
(2.2) the node j is a goods taking point, and the demand of the node j exceeds the residual load of the vehicle m;
(2.3) node j is the delivery point, but vehicle m has not yet taken the delivery from the pick point corresponding to node j.
9. The pickup delivery vehicle path planning method based on reinforcement learning as claimed in claim 1, wherein in step S4, a reinforcement learning model is trained based on a strategy gradient algorithm, specifically as follows:
obtaining the reward value of each round of training result output by the actor network, obtaining the estimation value output by the critic network, subtracting the reward value from the estimation value to obtain the loss value of the critic network, and multiplying the loss value of the critic network by the probability matrix output by the actor network to obtain the loss value of the actor network;
updating the operator network and the critic network through the optimizer, wherein an updating formula is represented as follows:
Figure FDA0003357528720000031
where theta represents a parameter in an operator network or a critical network,
Figure FDA0003357528720000032
representing the gradient of the parameter theta, J (theta) representing the desired reward of the parameter theta, S representing the size of each training batch, pikRepresents the strategy selected during the k-th training, GkNode map at kth training, b estimation of base line, critic network, pθk|Gk) Expressing that the actor network is based on the parameter theta and the strategy pi in the k trainingkAnd node map GkAnd outputting the probability matrix.
10. The reinforcement learning-based delivery vehicle path planning method according to any one of claims 1 to 9, wherein in step S5, a maximum number of training rounds epoch is set, and if the maximum number of training rounds epoch is not reached, the method returns to step S3, otherwise the method indicates that the whole training process is completed, the current reinforcement learning model is saved, the reinforcement learning model after training is obtained, and the delivery vehicle path planning is performed by using the reinforcement learning model after training.
CN202111355807.3A 2021-11-16 2021-11-16 Delivery vehicle path planning method based on reinforcement learning Active CN114237222B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111355807.3A CN114237222B (en) 2021-11-16 2021-11-16 Delivery vehicle path planning method based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111355807.3A CN114237222B (en) 2021-11-16 2021-11-16 Delivery vehicle path planning method based on reinforcement learning

Publications (2)

Publication Number Publication Date
CN114237222A true CN114237222A (en) 2022-03-25
CN114237222B CN114237222B (en) 2024-06-21

Family

ID=80749548

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111355807.3A Active CN114237222B (en) 2021-11-16 2021-11-16 Delivery vehicle path planning method based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN114237222B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115063066A (en) * 2022-05-26 2022-09-16 电子科技大学 Part supply circulation packing box distribution scheduling method based on graph convolution
CN115545350A (en) * 2022-11-28 2022-12-30 湖南工商大学 Comprehensive deep neural network and reinforcement learning vehicle path problem solving method
CN116562738A (en) * 2023-07-10 2023-08-08 深圳市汉德网络科技有限公司 Intelligent freight dispatching method, device, equipment and storage medium
CN117875535A (en) * 2024-03-13 2024-04-12 中南大学 Method and system for planning picking and delivering paths based on historical information embedding

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110014428A (en) * 2019-04-23 2019-07-16 北京理工大学 A kind of sequential logic mission planning method based on intensified learning
US20200073399A1 (en) * 2018-08-30 2020-03-05 Canon Kabushiki Kaisha Information processing apparatus, information processing method, information processing system, and storage medium
CN110956311A (en) * 2019-11-15 2020-04-03 浙江工业大学 Vehicle path optimization method based on super heuristic algorithm of reinforcement learning
CN111415048A (en) * 2020-04-10 2020-07-14 大连海事大学 Vehicle path planning method based on reinforcement learning
US20200273346A1 (en) * 2019-02-26 2020-08-27 Didi Research America, Llc Multi-agent reinforcement learning for order-dispatching via order-vehicle distribution matching
CN111695700A (en) * 2020-06-16 2020-09-22 华东师范大学 Boxing method based on deep reinforcement learning
CN112784481A (en) * 2021-01-15 2021-05-11 中国人民解放军国防科技大学 Deep reinforcement learning method and system for relay charging path planning
CN112965499A (en) * 2021-03-08 2021-06-15 哈尔滨工业大学(深圳) Unmanned vehicle driving decision-making method based on attention model and deep reinforcement learning

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200073399A1 (en) * 2018-08-30 2020-03-05 Canon Kabushiki Kaisha Information processing apparatus, information processing method, information processing system, and storage medium
US20200273346A1 (en) * 2019-02-26 2020-08-27 Didi Research America, Llc Multi-agent reinforcement learning for order-dispatching via order-vehicle distribution matching
CN110014428A (en) * 2019-04-23 2019-07-16 北京理工大学 A kind of sequential logic mission planning method based on intensified learning
CN110956311A (en) * 2019-11-15 2020-04-03 浙江工业大学 Vehicle path optimization method based on super heuristic algorithm of reinforcement learning
CN111415048A (en) * 2020-04-10 2020-07-14 大连海事大学 Vehicle path planning method based on reinforcement learning
CN111695700A (en) * 2020-06-16 2020-09-22 华东师范大学 Boxing method based on deep reinforcement learning
CN112784481A (en) * 2021-01-15 2021-05-11 中国人民解放军国防科技大学 Deep reinforcement learning method and system for relay charging path planning
CN112965499A (en) * 2021-03-08 2021-06-15 哈尔滨工业大学(深圳) Unmanned vehicle driving decision-making method based on attention model and deep reinforcement learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
汪华健: "基于强化学习的扑翼飞行机器人控制设计与研究", 中国优秀硕士学位论文全文数据库(电子期刊), no. 01, pages 1 - 74 *
王丙琛等: "基于深度强化学习的自动驾驶车控制算法研究", 郑州大学学报( 工学版), vol. 41, no. 4, pages 41 - 45 *
马琼雄等: "基于深度强化学习的水下机器人最优轨迹控制", 华南师范大学学报(自然科学版), vol. 50, no. 1, pages 118 - 123 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115063066A (en) * 2022-05-26 2022-09-16 电子科技大学 Part supply circulation packing box distribution scheduling method based on graph convolution
CN115545350A (en) * 2022-11-28 2022-12-30 湖南工商大学 Comprehensive deep neural network and reinforcement learning vehicle path problem solving method
CN115545350B (en) * 2022-11-28 2024-01-16 湖南工商大学 Vehicle path problem solving method integrating deep neural network and reinforcement learning
CN116562738A (en) * 2023-07-10 2023-08-08 深圳市汉德网络科技有限公司 Intelligent freight dispatching method, device, equipment and storage medium
CN116562738B (en) * 2023-07-10 2024-01-12 深圳市汉德网络科技有限公司 Intelligent freight dispatching method, device, equipment and storage medium
CN117875535A (en) * 2024-03-13 2024-04-12 中南大学 Method and system for planning picking and delivering paths based on historical information embedding
CN117875535B (en) * 2024-03-13 2024-06-04 中南大学 Method and system for planning picking and delivering paths based on historical information embedding

Also Published As

Publication number Publication date
CN114237222B (en) 2024-06-21

Similar Documents

Publication Publication Date Title
CN114237222B (en) Delivery vehicle path planning method based on reinforcement learning
WO2022262469A1 (en) Industrial park logistics scheduling method and system based on game theory
WO2022116225A1 (en) Multi-vehicle task assignment and routing optimization simulation platform and implementation method therefor
CN107578119A (en) A kind of resource allocation global optimization method of intelligent dispatching system
Wang et al. Ant colony optimization with an improved pheromone model for solving MTSP with capacity and time window constraint
CN111047087B (en) Intelligent optimization method and device for path under cooperation of unmanned aerial vehicle and vehicle
Bogyrbayeva et al. A reinforcement learning approach for rebalancing electric vehicle sharing systems
CN113359702B (en) Intelligent warehouse AGV operation optimization scheduling method based on water wave optimization-tabu search
Wang et al. Solving task scheduling problems in cloud manufacturing via attention mechanism and deep reinforcement learning
CN113837628B (en) Metallurgical industry workshop crown block scheduling method based on deep reinforcement learning
CN116739466A (en) Distribution center vehicle path planning method based on multi-agent deep reinforcement learning
CN109934388A (en) One kind sorting optimization system for intelligence
CN115454005A (en) Manufacturing workshop dynamic intelligent scheduling method and device oriented to limited transportation resource scene
CN114861368B (en) Construction method of railway longitudinal section design learning model based on near-end strategy
CN113205220A (en) Unmanned aerial vehicle logistics distribution global planning method facing real-time order data
CN117236541A (en) Distributed logistics distribution path planning method and system based on attention pointer network
Zhang et al. Transformer-based reinforcement learning for pickup and delivery problems with late penalties
Liu et al. An adaptive large neighborhood search method for rebalancing free-floating electric vehicle sharing systems
Liu et al. Graph convolution-based deep reinforcement learning for multi-agent decision-making in interactive traffic scenarios
CN117273590B (en) Neural combination optimization method and system for solving vehicle path optimization problem
CN117666495A (en) Goods picking path planning method, system and electronic equipment
Wang et al. Towards optimization of path planning: An RRT*-ACO algorithm
Liu et al. Graph convolution-based deep reinforcement learning for multi-agent decision-making in mixed traffic environments
CN108492020B (en) Polluted vehicle scheduling method and system based on simulated annealing and branch cutting optimization
CN115841286A (en) Takeout delivery path planning method based on deep reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant