CN114237222A

CN114237222A - Method for planning route of delivery vehicle based on reinforcement learning

Info

Publication number: CN114237222A
Application number: CN202111355807.3A
Authority: CN
Inventors: 刘发贵; 赖承启
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-11-16
Filing date: 2021-11-16
Publication date: 2022-03-25
Anticipated expiration: 2041-11-16
Also published as: CN114237222B

Abstract

The invention discloses a method for planning a route of a delivery vehicle based on reinforcement learning. The method comprises the following steps: constructing a reinforcement learning model based on an A2C framework and an optimization target thereof; initializing all parameter values of the reinforcement learning model, and randomly generating a data set; constructing a training process of a reinforcement learning model, inputting the generated data set into the reinforcement learning model, and calculating the reward value of each round of training results; optimizing the reinforcement learning model by adopting a reinforcement learning method based on strategy gradient according to the loss value; and setting the maximum number of training rounds, repeatedly training to obtain a trained reinforcement learning model, and planning the route of the delivery vehicle by adopting the trained reinforcement learning model. The method is different from the traditional precise algorithm and heuristic algorithm, and can quickly solve the large-scale path planning problem.

Description

Method for planning route of delivery vehicle based on reinforcement learning

Technical Field

The invention belongs to the field of logistics scheduling, and particularly relates to a method for planning a route of a delivery vehicle based on reinforcement learning.

Background

In recent years, along with the popularization and development of the mobile internet, the scale of electronic commerce is continuously enlarged, the related logistics industry is also rapidly developed, and the output value of the logistics industry is continuously increased at a high speed. According to the data of the national statistical bureau, the business total amount of 2020 industry breaks through 2 trillion yuan, the year-on-year increase is 29.7%, and the express business amount breaks through 800 billion pieces, and the year-on-year increase is 31.2%. With the continuous increase of urbanization, urban distribution has become an important ring in the whole logistics industry. In the current society of building smart cities by applying high-tech means, the improvement of urban distribution efficiency by combining the latest software and hardware technology becomes a new challenge.

The goods taking and delivering problem is a classical NP difficult problem in the field of combinatorial optimization, researchers carry out a great deal of research on theory and practical application of the goods taking and delivering problem, and a great number of accurate algorithms and heuristic algorithms are provided. Among them, the precise algorithms include a branch pricing method, a column generation method, etc., which have advantages in that an optimal solution of the problem can be obtained, but when the problem scale increases, the solution time increases exponentially, and satisfactory results cannot be obtained within an acceptable time.

Thus, more research has turned to heuristic algorithms. The solving process of the heuristic algorithm generally comprises the steps of firstly generating an initial solution, making an iteration strategy according to certain phenomena of a simulated nature, and obtaining a final solution through a certain number of iterations. Researchers have proposed many heuristic algorithms such as genetic algorithm, artificial immune algorithm, tabu search algorithm, etc. Although the results obtained by applying the heuristic algorithm are not necessarily optimal, the heuristic algorithm can obtain relatively good results in a reasonable time and is most widely applied at present. But it still does not perform well in the face of large-scale problems that require prompt and quick solution times.

To address this problem, some researchers have begun to try to introduce new approaches. In recent years, reinforcement learning has achieved a remarkable effect in the related art. The solution of related path planning problems such as the delivery problem and the like relates to a sequence decision problem, and the reinforcement learning is very suitable for making sequence decisions. Therefore, researchers begin to use reinforcement learning to solve path planning problems such as TSP and VRP, so that the solution time is greatly improved, and the solution quality has certain competitiveness (m.nazari, a.orojooy, M).

and L.V.Snyder, "recovery learning for dissolving the vehicle routing protocol," adv.neural inf.Process.Syst., vol.2018-Decem, pp.9839-9849,2018 ]. Compared with the traditional precision algorithm and the heuristic algorithm,the method has the advantages of solving rapidity, and even complex large-scale problems can be solved quickly and reasonably. But it has the same disadvantages, mainly including the following three points: the training time of the model is very long; the training process requires a large amount of data; the quality of the solution is somewhat different from the traditional algorithm.

Disclosure of Invention

The invention aims to provide a feasible solution for the problem of goods taking and delivering in urban distribution rapidly and intelligently, so that the overall efficiency is improved.

The purpose of the invention is realized by at least one of the following technical solutions.

A method for planning a route of a pickup truck and delivery truck based on reinforcement learning comprises the following steps:

s1: constructing a reinforcement learning model based on an A2C framework and an optimization target thereof;

s2: initializing all parameter values of the reinforcement learning model, and randomly generating a data set;

s3: a training process of constructing a reinforcement learning model, wherein the data set generated in the step S2 is input into the reinforcement learning model, and the reward value of each round of training results is calculated;

s4: optimizing the reinforcement learning model by adopting a reinforcement learning method based on strategy gradient according to the loss value;

s5: and setting the maximum number of training wheels, repeating the steps S3-S4 to obtain a trained reinforcement learning model, and planning the route of the pick-up and delivery vehicles by adopting the trained reinforcement learning model.

Further, each customer order comprises a pick point and n delivery points, n belongs to [2,4 ]; the volume and the weight of the goods at the goods taking point are equal to the sum of the volume and the weight of the goods at the n goods delivering points; due to the limitation that different goods cannot be loaded in a mixed mode, the goods taking point of each customer order and all corresponding delivery points are delivered by the same vehicle; the number of the vehicles which can be used is not limited, and the maximum load, the maximum running distance and the maximum volume of the vehicles are the same; the vehicles all start from the same parking lot and return to the parking lot after all orders are sent;

for the same customer order, before goods are delivered, the vehicle must take all goods from the goods-taking points and then deliver the goods to each corresponding delivery point; for different customer orders, goods taking and delivery can be performed in a crossed manner;

the optimization goal of the reinforcement learning model is to minimize the sum of the travel distances F of all vehicles:

where | R | represents the number of nodes in the current route R,

represents the ith node that vehicle m passes through, | · | | non-woven phosphor₂The norm of L2 is shown,

representing the distance vehicle M travels from the ith node to the next node and M represents all vehicles in use.

Further, in step S1, the reinforcement learning model includes an actor network and a critic network;

an actor network refers to learning a strategy to obtain the highest possible return, and is used for generating actions and interacting with the environment; the critic network is used for estimating the current strategy, namely evaluating the quality of the operator network and guiding the action of the next stage of the operator network; both are implemented using different neural networks;

the actor network includes a first encoder, a decoder, and an attention layer; the first encoder is used for processing the input coordinates of all the goods taking points and the goods delivering points, the current vehicle load, the goods taking amount of all the goods taking points and the goods delivering amount of all the goods delivering points, and the data input into the encoder passes through the convolution layer in the first encoder to obtain a first vector embed _ 1; the decoder is used for processing the coordinates of the node where the current vehicle is located, and the data input into the decoder passes through the convolution layer and the GRU layer in the decoder to obtain a second vector embed _ 2; the attention layer maintains a first zero matrix v and a second zero matrix W, adds the first vector embed _1 and the second vector embed _2 to obtain a third vector hidden, and operates to obtain a probability matrix p which is softmax (v tanh (W hiden);

the criticic network comprises a second encoder and a full connection layer, the second encoder is used for processing the input coordinates of all the goods taking points and the goods delivering points, the current vehicle load, the goods taking amount of all the goods taking points and the goods delivering amount of the goods delivering points, and the input third vector embed _3 is obtained by passing through the convolution layer in the second encoder; the fully-connected layer takes the third vector embed _3 as input, the fully-connected layer comprises a plurality of convolutional layers, and a ReLU activation function is used in the fully-connected layer to remove negative values in the output of each convolutional layer.

Further, defining basic elements in the reinforcement learning model, including an agent, a state and an award value, as follows:

the intelligent agent: the method comprises the following steps that a vehicle is an intelligent agent, the intelligent agent selects the next action according to a strategy from an initial state, and after each action is completed, the intelligent agent feeds back data obtained according to the data to update the strategy;

the state is as follows: the states are divided into static states and dynamic states; the static state is an attribute which does not change along with time and comprises the coordinates of each node; the dynamic state is an attribute which can be changed along with the training process, and comprises the current load and the position of the vehicle and the demand of each node;

the reward value is as follows: the training goal of the reinforcement learning model is to maximize the reward value, while the optimization goal is to minimize the distance traveled F, with-F as the reward value.

Further, in step S2, initializing parameter values including an optimizer learning rate e, a vector dimension d, a training batch size S, a maximum training round number epoch, a node number nodes, and a dropout value; in order to meet the requirement that ultra-large-scale data are needed in the reinforcement learning model training process so as to improve the accuracy, a random method is adopted to generate the data based on the existing pick-and-place instance data set, and the method specifically comprises the following steps:

for each instance in the data set, all nodes in the data set are divided into pick-up points and delivery points, coordinates of each node are randomly disturbed in a certain range to generate a new instance, then a plurality of delivery points are randomly assigned to each pick-up point, and the demand of each pick-up point is equally assigned to each assigned delivery point.

Further, in step S3, in the training process, after each step is selected, the input static state remains unchanged, but the dynamic state changes, and the change of the dynamic state is updated by using the state update function; meanwhile, due to various limiting conditions, the vehicle cannot access all nodes in the training process, in order to accelerate the training speed of the reinforcement learning model, an accessible point matrix is generated, the matrix comprises all nodes, if the nodes are accessible, the value of the accessible point matrix is 1, otherwise, the accessible point matrix is 0, the nodes which cannot be reached by the vehicle in the next step are shielded, and a position updating function is adopted to update the accessible point matrix immediately after the selection in each step is completed;

inputting the data set generated in the step S2 into an actor network, randomly selecting a next destination of the vehicle according to a probability matrix p output by the actor network and in combination with the accessible point matrix, and recording the probability corresponding to the selected location;

after each training round is completed, namely the complete paths of all vehicles are constructed, the sum F of the driving distances of all vehicles is calculated, and then the reward value is calculated.

Further, the state update function is defined by the following rules:

assuming that the node selected by the vehicle m as the destination of the next step is i, the change in the dynamic state after completion of the selection is as follows:

(1.1) updating the current position of the vehicle m to i;

(1.2) if the node i is the pick-up point, the load of the vehicle m increases by q_i；

(1.3) if the node i is a delivery point, the load of the vehicle m decreases by q_i；

(1.4) the demand of the node i is updated to 0;

the location update function is defined by the following rules:

assuming that the current position of the vehicle m is node i, node j is masked when it satisfies one of the following conditions:

(2.1) the demand of the node j is 0, which indicates that the node j has been served;

(2.2) the node j is a goods taking point, and the demand of the node j exceeds the residual load of the vehicle m;

(2.3) node j is the delivery point, but vehicle m has not yet taken the delivery from the pick point corresponding to node j.

Further, in step S4, the reinforcement learning model is trained based on the principle of the strategy gradient algorithm, which is as follows:

obtaining the reward value of each round of training result output by the actor network, obtaining the estimation value output by the critic network, subtracting the reward value from the estimation value to obtain the loss value of the critic network, and multiplying the loss value of the critic network by the probability matrix output by the actor network to obtain the loss value of the actor network;

updating the operator network and the critic network through the optimizer, wherein an updating formula is represented as follows:

where theta represents a parameter in an operator network or a critical network,

representing the gradient of the parameter theta, J (theta) representing the desired reward of the parameter theta, S representing the size of each training batch, pi_kRepresents the strategy selected during the k-th training, G_kNode map at kth training, b estimation of base line, critic network, p_θ(π_k|G_k) Expressing that the actor network is based on the parameter theta and the strategy pi in the k training_kAnd node map G_kAnd outputting the probability matrix.

Further, in step S5, a maximum training round number epoch is set, if the maximum training round number epoch is not reached, the step S3 is returned, otherwise, the whole training process is completed, the current reinforcement learning model is saved, the reinforcement learning model after training is obtained, and the reinforcement learning model after training is adopted to perform route planning of the pickup truck and the delivery truck.

Compared with the prior art, the invention has the following advantages and technical effects:

1. the invention considers the scene of one-pick-and-multiple-pick in the problem of picking and delivering goods, and because of the limitation that different goods cannot be loaded in a mixed way, the picking point and all the corresponding delivery points are delivered by the same vehicle.

2. The invention adopts a reinforcement learning method, establishes a model aiming at the goods taking and delivering problem based on an A2C framework, uses the trained model, and can obtain a solution more quickly compared with the prior art.

Drawings

Fig. 1 is a flowchart of a method for planning a route of a pickup vehicle based on reinforcement learning according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of a vehicle distribution route according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a reinforcement learning model according to an embodiment of the present invention.

Detailed Description

In order to make the technical solution and advantages of the present invention more clearly understood, the following detailed description is made with reference to the accompanying drawings and examples, but the present invention is not limited thereto.

Example (b):

a method for planning a route of a pickup truck based on reinforcement learning, as shown in fig. 1, includes the following steps:

each customer order comprises a pick-up point and n delivery points, n belongs to [2,4 ]; the volume and the weight of the goods at the goods taking point are equal to the sum of the volume and the weight of the goods at the n goods delivering points; due to the limitation that different goods cannot be loaded in a mixed mode, the goods taking point of each customer order and all corresponding delivery points are delivered by the same vehicle; the number of the vehicles which can be used is not limited, and the maximum load, the maximum running distance and the maximum volume of the vehicles are the same; the vehicles all start from the same parking lot and must return to the parking lot after all orders are sent;

where | R | represents the number of nodes in the current route R,

As shown in FIG. 3, the reinforcement learning model comprises an actor network and a critic network;

Defining basic elements in the reinforcement learning model, including an agent, a state and an award value, specifically as follows:

the state is as follows: the states are divided into static states and dynamic states; the static state is an attribute which does not change along with time and comprises the coordinates of each node; the dynamic state is an attribute which can change along with the training process and comprises the current load and the position of the vehicle and the demand of each node;

In this embodiment, the schematic diagram of the vehicle delivery route is shown in fig. 2, where R1, R2, and R3 represent customer order numbers, circles represent customer points, solid circles represent pick-up points, and dashed circles represent delivery points. The first number in the circle represents the customer order number, the second letter 'P' represents the pick point in the order, the letter 'D' represents the delivery point in the order, and the third number following the letter 'D' represents the number of the delivery point in the order. V1, V2 indicate vehicle delivery routes to be scheduled for orders, and 0 represents a yard. In this case, the goods in order R1 and R2 can be mixed and thus delivered by the same vehicle in a short distance, but the goods in order R3 cannot be mixed and delivered by the first two vehicles and thus must be delivered by the other vehicle. It can be seen that the delivery route meets the constraints.

S2: initializing all parameter values of the model, and randomly generating a data set;

initializing parameter values, including an optimizer learning rate e, a vector dimension d, a training batch size S, a maximum training round number epoch, a node number nodes and a dropout value; in order to meet the requirement that the reinforcement learning model training process needs ultra-large-scale data to improve accuracy, in this embodiment, based on the Li & Lim dataset, a random method is adopted to generate data, which is specifically as follows:

S3: constructing a training process of a reinforcement learning model, and calculating the reward value of each round of training results;

in the training process, after each step of selection, the input static state is still kept unchanged, but the dynamic state is changed, and the change of the dynamic state is updated by adopting a state updating function; meanwhile, due to various limiting conditions, the vehicle cannot access all nodes in the training process, in order to accelerate the training speed of the reinforcement learning model, an accessible point matrix is generated, the matrix comprises all nodes, if the nodes are accessible, the value of the accessible point matrix is 1, otherwise, the accessible point matrix is 0, the nodes which cannot be reached by the vehicle in the next step are shielded, and a position updating function is adopted to update the accessible point matrix immediately after the selection in each step is completed;

inputting the data set generated in the step S2 into an actor network, randomly selecting a next destination of the vehicle according to the probability matrix output by the actor network and the accessible point matrix, and recording the probability corresponding to the selected location;

The state update function is defined by the following rules:

(1.1) updating the current position of the vehicle m to i;

(1.4) the demand of the node i is updated to 0;

the location update function is defined by the following rules:

training a reinforcement learning model based on the principle of a strategy gradient algorithm, which comprises the following specific steps:

S5: and setting the maximum number of training rounds epoch, if the maximum number of training rounds epoch is not reached, returning to the step S3, otherwise, finishing the whole training process, storing the current reinforcement learning model to obtain the reinforcement learning model after training, and adopting the reinforcement learning model after training to plan the route of the delivery vehicle.

The above examples are merely illustrative for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A method for planning a route of a pickup truck and delivery truck based on reinforcement learning is characterized by comprising the following steps:

2. A reinforcement learning-based pick-and-deliver vehicle path planning method as claimed in claim 1, wherein each customer order includes a pick-up point and n delivery points, n e [2,4 ]; the volume and the weight of the goods at the goods taking point are equal to the sum of the volume and the weight of the goods at the n goods delivering points; due to the limitation that different goods cannot be loaded in a mixed mode, the goods taking point of each customer order and all corresponding delivery points are delivered by the same vehicle; the number of the vehicles which can be used is not limited, and the maximum load, the maximum running distance and the maximum volume of the vehicles are the same; the vehicles all start from the same parking lot and return to the parking lot after all orders are sent;

where | R | represents the number of nodes in the current route R,

3. The reinforced learning-based pick-up and delivery vehicle path planning method according to claim 1, wherein in step S1, the reinforced learning model comprises an operator network and a critic network;

the actor network includes a first encoder, a decoder, and an attention layer; the first encoder is used for processing the input coordinates of all the goods taking points and the goods delivering points, the current vehicle load, the goods taking amount of all the goods taking points and the goods delivering amount of all the goods delivering points, and the data input into the encoder passes through the convolution layer in the first encoder to obtain a first vector embed _ 1; the decoder is used for processing the coordinates of the node where the current vehicle is located, and the data input into the decoder passes through the convolution layer and the GRU layer in the decoder to obtain a second vector embed _ 2; the attention layer maintains a first zero matrix v and a second zero matrix W, adds the first vector embed _1 and the second vector embed _2 together to obtain a third vector hidden, and operates to obtain a probability matrix p which is softmax (v tanh (W hiden);

4. The pickup truck vehicle path planning method based on reinforcement learning as claimed in claim 3, wherein basic elements in the reinforcement learning model are defined, including an agent, a state and an incentive value, and specifically as follows:

5. The reinforced learning-based delivery vehicle path planning method according to claim 1, wherein in step S2, parameter values are initialized, including optimizer learning rate e, vector dimension d, training batch size S, maximum training round number epoch, node number nodes, and dropout value; based on the existing pick-and-place instance data set, a random method is adopted to generate data, which specifically comprises the following steps:

6. The method for planning the route of pickup vehicles based on reinforcement learning as claimed in claim 1, wherein in step S3, after each step is selected during the training process, the static state of the input remains unchanged, but the dynamic state changes, and the change of the dynamic state is updated by using the state update function; meanwhile, due to various limiting conditions, the vehicle cannot access all nodes in the training process, in order to accelerate the training speed of the reinforcement learning model, an accessible point matrix is generated and comprises all nodes, if the nodes are accessible, the value of the accessible point matrix is 1, otherwise, the accessible point matrix is 0, the nodes which cannot be reached by the vehicle in the next step are shielded, and a position updating function is adopted to update the accessible point matrix immediately after the selection in each step is completed;

7. The reinforcement learning-based pick-up and delivery vehicle path planning method according to claim 6, wherein the state updating function is defined according to the following rules:

(1.1) updating the current position of the vehicle m to i;

(1.4) the demand of the node i is updated to 0.

8. The reinforcement learning-based pick-up and delivery vehicle path planning method according to claim 6, wherein the location update function is defined according to the following rules:

9. The pickup delivery vehicle path planning method based on reinforcement learning as claimed in claim 1, wherein in step S4, a reinforcement learning model is trained based on a strategy gradient algorithm, specifically as follows:

10. The reinforcement learning-based delivery vehicle path planning method according to any one of claims 1 to 9, wherein in step S5, a maximum number of training rounds epoch is set, and if the maximum number of training rounds epoch is not reached, the method returns to step S3, otherwise the method indicates that the whole training process is completed, the current reinforcement learning model is saved, the reinforcement learning model after training is obtained, and the delivery vehicle path planning is performed by using the reinforcement learning model after training.