CN116739466A

CN116739466A - Distribution center vehicle path planning method based on multi-agent deep reinforcement learning

Info

Publication number: CN116739466A
Application number: CN202310565816.8A
Authority: CN
Inventors: 朱光宇; 黄世哲
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2023-05-19
Filing date: 2023-05-19
Publication date: 2023-09-12

Abstract

The application discloses a distribution center vehicle path planning method based on multi-agent deep reinforcement learning, which comprises the following steps: defining a multi-agent deep reinforcement learning basic component under the planning of a multi-distribution center vehicle path; constructing an encoder network model; building a decoder network model embedded with a soft-hard two-stage attention mechanism; buildingA criticism network model under an algorithm framework; introducing a mask mechanism to accelerate training; by passing throughThe algorithm carries out neural network model training to obtain a vehicle path planning result; according to the application, from the whole perspective, the problem of planning the vehicle path of the multiple distribution centers is solved end to end, and the problem solving quality is improved.

Description

Distribution center vehicle path planning method based on multi-agent deep reinforcement learning

Technical Field

The application relates to the technical field of intelligent transportation, in particular to a distribution center vehicle path planning method based on multi-agent deep reinforcement learning.

Background

The vehicle path planning problem is used to design an optimal path for a fleet of vehicles to serve a set of customers given constraints. Large logistics companies often set up multiple warehouses within the distribution network, from which the same goods can be sent to customers, this variant being called with simultaneous consideration of vehicle carrying capacity: multiple distribution center vehicle path planning problem.

The problem of multiple distribution center vehicle path planning has received long-standing attention as a classical NP-hard problem. In the past decades, many researchers solve the problem through a heuristic method, the heuristic algorithm mostly adopts the concept of grouping first and planning later, a designated distribution point is allocated to each warehouse in advance, and the multi-distribution center vehicle path planning problem is cut into a plurality of independent single-distribution center vehicle path planning problems and then solved. This results in a lack of correlation between the packet and the whole, the merits of the packet directly affect the merits of the heuristic effect, the proper packet requires a lot of expertise, and the artificial packet is prone to cause the loss of the optimal solution.

In recent years, with the rapid development of machine learning, solving a vehicle path planning problem by deep reinforcement learning has attracted attention. The virtual et al propose a Pointernet model, solve the problem of combination optimization by using a supervised learning method, and the quality of knowledge is limited by the quality of training sets by using the supervised learning method. Bello et al train the Pointer network using deep reinforcement learning and introduce baseline reduction training variance. Nazari et al, use a simplified Pointernet to solve the vehicle path planning problem, and the model can find an approximate optimal solution. Dai et al use the graph neural network to model, and calculate the action cost function Q value of each node in the remaining optional nodes according to the graph neural network based on the greedy strategy, so as to select a new node to be added into the current solution, and the result is similar to the research result of Bello.

In summary, the vehicle path planning problem is solved by deep reinforcement learning, and a great deal of results are obtained on the single distribution center vehicle path planning problem. But there are few studies to solve the multi-center vehicle path planning problem with deep reinforcement learning. The application of deep reinforcement learning to solve the problem of vehicle path planning in multiple distribution centers requires overcoming a number of difficulties: the multiple delivery center vehicle path planning problem has a larger solution space than the single delivery center vehicle path planning problem, and it is more difficult to find a reliable solution. The multi-distribution center vehicle path planning problem is difficult to model as a single agent reinforcement learning problem. The multi-agent reinforcement learning algorithm is used for solving the problem of vehicle path planning of a multi-distribution center, and reasonable design is needed to balance the abstract game relationship among agents.

Disclosure of Invention

In view of the above, the present application aims to provide a distribution center vehicle path planning method based on multi-agent deep reinforcement learning, which solves the problem of multi-distribution center vehicle path planning from the whole point of view end to end, and improves the problem solving quality.

In order to achieve the above purpose, the application adopts the following technical scheme: the distribution center vehicle path planning method based on multi-agent deep reinforcement learning comprises the following steps:

step S1: defining a multi-agent deep reinforcement learning basic component under the planning of a multi-distribution center vehicle path;

step S2: constructing an encoder network model;

step S3: embedding a soft-hard two-stage attention mechanism in the decoder network model;

step S4: establishing a commentator network model under the framework of the MAC-AC algorithm;

step S5: a masking mechanism accelerates training;

step S6: determining an action selection mechanism;

step S7: and training a neural network model through an MAC-AC algorithm to obtain a vehicle path planning result.

In a preferred embodiment, the reinforcement learning basic component defined in step S1 includes:

(1) Defining vehicles respectively assigned from all warehouses as independent agents;

(2) Defining the behavior of the next target node of the vehicle decision as an action;

(3) The state is composed of an environment state and an agent state, wherein,

the environmental states include: all warehouse coordinates, all customer coordinates and the remaining requirements thereof, and the current coordinates of each vehicle and the remaining cargo capacity thereof;

the agent state includes:

the position coordinate difference between the vehicle and the unserviceable node, the absolute value of the difference between the vehicle residual cargo quantity and the customer residual demand, and the current coordinate and load of the vehicle;

(4) Defining a state transfer function:

after the vehicle i accesses the arbitrary node k, its position coordinate Li is updated:wherein x is _k Coordinates corresponding to the accessed node; after the vehicle accesses the customer node k, the customer demand and the vehicle residual load are updated:wherein d represents the demand of each customer, and l represents the current residual load of the vehicle; after the vehicle accesses the corresponding warehouse node, the vehicle is refilled with goods: />Wherein C represents the load capacity of the vehicle;

(5) Defining a reward function:

taking the form of the negative of the total distance travelled by all vehicles in a time step as a reward function

Wherein M is the total number of warehouses.

In a preferred embodiment, the encoder network model in step S2 is specifically:

the encoder network consists of a single-layer linear network and a gate-controlled cyclic neural network GRU, and receives the local observation of the vehicle i at the time tCalculating initial embedding by means of linear projection; initially intercalating and bonding GRU hidden layer h _t Calculating a feature vector e of each agent local observation through GRU network _i 。

In a preferred embodiment, the decoder network model in step S3 is specifically:

the decoder is composed of a soft-hard two-stage attention mechanism module:

the hard attention mechanism module receives feature vectors e of all vehicles containing independent observation information _i Through e _i Learning hard attention weight W _h ，W _h The method comprises the steps of determining which vehicles need to communicate with each other at the current time t by using independent heat vectors; the hard attention mechanism module is realized based on a bi-GRU network; the feature vector (e) corresponding to the vehicle i and the vehicle j (i not equal to j) _i ，e _j ) Input bi-GRU, obtain the embedded h of the output through the full connection layer f _i,j ：h _i,j ＝f(bi-GRU(e _i ，e _j ) Calculating hard attention weights using gummel-Softmax

The soft attention mechanism module gathers the observations e of all vehicles _i In combination with a hard attention weight W _h Calculating a correlation weight X between each vehicle and other vehicles _i ；X _i The value weighted summation of other vehicles is as follows: x is X _i ＝∑αV _i WhereinValue V _i By the corresponding feature vector e of vehicle i _i Warp matrix W _V Linear transformation results in the attention weight alpha comparing the feature vectors e using the query key system _i And e _j ，W _q Will e _i Converts into a query, W _k Will e _j Converting to keys and inputting the matching value into a Softmax function; according to W _q And W is equal to _k Scaling the matching value and combining the hard attention weight W _h The attention weight α is obtained:combining the feature vectors e corresponding to each vehicle i _i Correlation weight X _i Calculating to obtain the action cost function Q of each vehicle i ⁱ The method comprises the steps of carrying out a first treatment on the surface of the Action cost function Q ⁱ Q is according to the formula ⁱ ＝f(g(e _i ，X _i ) Calculated, where f is a multi-layer linear network and g is a single-layer linear network.

In a preferred embodiment, step S4 is specifically: the critics network model consists of an evaluation network and a target critics network, which are multi-layer linear networks with the same network dimension but different network parameters; evaluating the environmental state S at the moment of network reception t _t Estimated state valueThe target critic network receives the environmental state at the time of t+1, and estimates the value of the environmental state at the time of t+1 +.>

In a preferred embodiment, the masking mechanism introduced in step S5 is specifically:

in the training process, the mask sets the corresponding logarithmic probability that all vehicles should not access the nodes to be- ≡to shield the infeasible actions, and the mask is forcedly solved when the specific conditions are met; the working mechanism is as follows: (1) the vehicle does not allow access to customers with a demand of 0; (2) When the residual load of the vehicle cannot meet any customer, forcing the vehicle to return to the corresponding warehouse to supplement goods; (3) Masking all customers whose current demand is greater than the vehicle load; (4) According to the characteristics of the multi-distribution center vehicle path planning, a candidate list with the length of H is set for the vehicle, and the vehicle is restricted to select the next target node from the H available nodes closest to the vehicle, so that the convergence speed is increased.

In a preferred embodiment, step S6 is specifically: adopting an epsilon-Greedy method as an action selection strategy, wherein epsilon-Greedy utilizes a parameter epsilon, epsilon is more than or equal to 0 and less than or equal to 1, and the epsilon-Greedy is weighted between exploring an uncertain strategy and developing a current optimal strategy; under the condition that the probability is 1-epsilon, the intelligent agent selects the action corresponding to the maximum Q value in the action cost function; under the condition that the probability is epsilon, the intelligent agent randomly selects actions from the available action set; epsilon is smaller gradually along with the training process, which means that along with the accumulation of information and experience, the utilization rate of mastered information is gradually increased, and the exploratory property is gradually reduced.

In a preferred embodiment, step S7 is specifically: using the parameters θ, W and W ^- Parameterizing an encoder and a decoder respectively, and evaluating all trainable variables in a network and a target critics network; advantage function, consist ofCalculated, wherein the joint action cost function Q ^tot From the following componentsApproximately, wherein γ is the rewarding discount rate, +.>And->The evaluation network is respectively approximated to the target criticism network, and the parameter theta ⁱ The updating mode is as follows:the evaluation network parameter W is updated by adopting a TD differential algorithm: /> Evaluating the network parameters W every T copies to update the network parameters W of the target criticism network ^- The method comprises the steps of carrying out a first treatment on the surface of the And after training reaches the set maximum number of times, regarding the solution with the largest rewards obtained in the training process as the solution of the problem.

Compared with the prior art, the application has the following beneficial effects: the application provides a multi-agent-based deep reinforcement learning method which is used for solving a multi-distribution center vehicle path planning problem. Different from the solving thought of the traditional heuristic algorithm of 'grouping before planning', the multi-agent utilizes high-level characteristic information to carry out vehicle path planning from the whole problem through the action of mutual cooperation of communication learning, and improves the solving quality.

Drawings

FIG. 1 is a schematic diagram illustrating the operation of an encoder and decoder according to a preferred embodiment of the present application.

FIG. 2 is a schematic diagram of the environment and agent interaction under a single training of a preferred embodiment of the present application.

Fig. 3 is a schematic diagram of a training process of a network model according to a preferred embodiment of the present application.

Detailed Description

The application will be further described with reference to the accompanying drawings and examples.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application; as used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

The application provides a multi-agent deep reinforcement learning-based multi-distribution center vehicle path planning method, which specifically comprises the following steps with reference to figures 1 to 3:

step S2: constructing an encoder network model;

step S3: a decoder network model is built, so that interference to decision-making of an agent caused by unnecessary communication is avoided, and a soft-hard two-stage attention mechanism is embedded in the model, thereby helping the agent to autonomously learn the real-time communication requirements among each other.

step S5: introducing a mask mechanism to accelerate training;

step S6: determining an action selection mechanism;

step S7: and training a neural network model through an MAC-AC (Multi-Agent CooperativeActor-Critic algorithm of Multi-agent actor under complete cooperative relationship setting) algorithm, and obtaining a vehicle path planning result.

Step 1: defining a multi-agent deep reinforcement learning basic component under multi-distribution center vehicle path planning:

(3) The state is composed of an environment state and an agent state,

the environmental conditions include:

1. all warehouse coordinates; 2. all customer coordinates and their remaining requirements; 3. the current coordinates of each vehicle and the residual cargo capacity thereof;

(ii) agent status includes:

1. the position coordinate difference value between the vehicle and the unserviceable node is 2. The absolute value of the difference value between the vehicle residual cargo quantity and the unserviceable customer residual demand; 3. current coordinates and load of the vehicle;

(4) Defining a state transfer function:

the warehouse and the client coordinate are static elements and remain unchanged all the time;

(5) Defining a reward function:

for multi-center path planning, the objective function is to minimize the total distance traveled by the vehicle, and the smaller the total distance, the higher the cumulative rewards of the agent, thus taking the negative form of the total distance traveled by all vehicles in one time step as the rewards function Wherein M is the total number of warehouses.

Step 2: building an encoder network model:

the encoder network consists of a single layer linear network and a GRU network (gated loop neural network, gated Recurrent Neural Network) that enables the agent to obtain the ability to record its recent access trajectories. Encoder networkLocal observation of a network receiving vehicle i at time tCalculating initial embedding by means of linear projection, and combining the initial embedding with GRU hidden layer state h _t Calculating and obtaining a feature vector e of each agent local observation through GRU network _i 。

Step 3: building a decoder network model:

the decoder is composed of a soft-hard two-stage attention mechanism module:

first, the hard attention mechanism module receives all vehicles i from the encoder, feature vectors e containing independent observation information _i Through e _i To learn the hard attention weight W _h 。W _h Consisting of a unique heat vector that determines which vehicles the vehicle needs to communicate with at the current time t. The hard attention mechanism module is implemented based on bi-GRU networks (bi-directional GRU networks). The feature vector (e) corresponding to the vehicle i and the vehicle j (i not equal to j) _i ，e _j ) Input bi-GRU, obtain the embedded h of the output through the full connection layer f _i,j ：h _i,j ＝f(bi-GRU(e _i ，e _j ) A hard-attention mechanism module requires a sampling operation, which makes it difficult to counter-propagate gradients. The Gumbel-Softmax function is an effective tool to solve the above problems, and the Gumbel-Softmax is used to calculate the hard attention weight

Then, the soft-attention mechanism sums the observed information e of all vehicles i _i In combination with a hard attention weight W _h For each vehicle i, a correlation weight X between the other vehicles is calculated _i ，X _i The value weighted summation of other vehicles is as follows: x is X _i ＝∑αV _i Wherein the value V _i By the corresponding feature vector e of vehicle i _i Warp matrix W _V Linearly transformed toOut, the attention weight α compares the feature vector e using the query key system _i And e _j ，W _q Will e _i Converts into a query, W _k Will e _j Converts to keys and inputs the matching value into the Softmax function. To prevent the gradient from disappearing, according to W _q And W is equal to _k Scaling the matching value and combining the hard attention weight W _h The attention weight α is obtained:

finally, merging the feature vectors e corresponding to each vehicle i _i Correlation weight X _i Calculating to obtain the action cost function Q of each vehicle i ⁱ ：Q ⁱ ＝f(g(e _i ，X _i ) Where f is a multilayer linear network and g is a single linear layer.

Step 4: establishing a commentator network model under the framework of the MAC-AC algorithm:

the critics network model consists of an evaluation network and a target critics network, which are multi-layer linear networks with the same network dimension but different network parameters; evaluating the environmental state S at the moment of network reception t _t Estimated state valueThe target critic network receives the environmental state at the time of t+1, and estimates the value of the environmental state at the time of t+1 +.>

Step 5: introducing a masking mechanism accelerates training:

in the training process, the mask sets the corresponding logarithmic probability that all vehicles should not access the node to- ≡to mask the infeasible solution and force the solution when the specific condition is met. The mask elaboration mechanism is as follows: (1) the vehicle does not allow access to customers with a demand of 0; (2) When the residual load of the vehicle cannot meet any customer, forcing the vehicle to return to the corresponding warehouse to supplement goods; (3) All customers whose current demand is greater than the vehicle load are masked. (4) According to the characteristics of the multi-distribution center vehicle path planning, a candidate list with the length of H is set for the vehicle, and the vehicle is restricted to select the next target node from the H available nodes closest to the vehicle, so that the convergence speed is increased.

Step 6: determining an action selection mechanism:

an epsilon-Greedy method is adopted as an action selection strategy, and epsilon-Greedy utilizes a parameter epsilon (epsilon is more than or equal to 0 and less than or equal to 1) to make trade-off between exploring an uncertain strategy and developing a current optimal strategy. When the probability is 1-epsilon, the agent selects the action corresponding to the maximum Q value in the action cost function. In the case of a probability ε, the agent randomly selects an action from the set of available actions. Epsilon is progressively smaller as the training process progresses, meaning that as information and experience accumulate, the utilization of learned information gradually increases and exploratory gradually decreases.

Step 7: training a neural network model through an MAC-AC algorithm to obtain a vehicle path planning result:

using the parameters θ, W and W ^- Parameterizing an encoder and a decoder respectively, and evaluating all trainable variables in a network and a target critics network; advantage function, consist ofCalculated, wherein the joint action cost function Q ^tot By->Approximately, wherein γ is the rewarding discount rate, +.>And->The evaluation network is respectively approximated to the target criticism network, and the parameter theta ⁱ The updating mode is as follows:evaluation netThe complex parameters W are updated by adopting a TD differential algorithm: /> For more stable training, the evaluation network parameters W are duplicated every T times to update the network parameters W of the target critic network ^- . And after training reaches the set maximum number of times, regarding the solution with the largest rewards obtained in the training process as the solution of the problem.

Description of the interaction process between the environment and the agent in the training process:

as shown in fig. 2 of the specification:

(1) At time step t, all vehicles acquire respective local observations O from the environment ⁱ _t And communicate with each other.

(2) The vehicle decides the action of the next moment through the strategy network under the constraint of the mask matrix

(3) The environment updates the customer node information, each vehicle information, and the mask matrix based on actions taken by each vehicle and gives a common global instant prize to improve the critics network.

(3) Policy network feedback evaluation for improving distribution policy through critic network

Repeating the above process until all clients are served, thereby completing a single training.

Preferred parameter settings and optimizer selections of the present application:

the RMS optimizer is adopted to respectively learn the learning rate alpha, beta to be 1 multiplied by 10 ^-4 And 1X 10 ^-3 And optimizing the encoder and the decoder, and commenting on network parameters. Wherein, the hidden layer dimension of the GRU in the encoder network is 64, and the linear hidden layer dimension of the critic network is 128. For small-scale instances with a node number of 50 or less, medium-scale instances with a node number between 50-100, and large-scale instances with a node number greater than 100, a number of different parameter settings will be employed. For small scale examples, 3X 10 was performed ⁴ And (5) training the wheels. For the medium-scale example, 5×10 is performed ⁴ Wheel training. For large scale examples, 6X 10 is performed ⁴ And (5) training the wheels. In a small scale example, the parameter ε in the action selection policy ε -Greedy is reduced from 0.6 by 1.28X10 per training ^-3 The amplitude of (2) decreases to 0.02. In the medium-scale example, the parameter ε is 6.4X10 ^-4 The drop in the (c) range from 0.8 to 0.02. In a large scale example, the parameter ε is 6.4X10 ^-4 The drop in the (c) range from 0.9 to 0.02. Target critics network parameter W ^- In the small-scale example, the medium-scale example and the large-scale example, the candidate list is updated with 25, 50 and 50 training cycles respectively, and the length H of the candidate list is 6.

The foregoing description is only of the preferred embodiments of the application, and all changes and modifications that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. The distribution center vehicle path planning method based on multi-agent deep reinforcement learning is characterized by comprising the following steps of:

step S2: constructing an encoder network model;

step S5: a masking mechanism accelerates training;

step S6: determining an action selection mechanism;

2. The multi-agent deep reinforcement learning based delivery center vehicle path planning method of claim 1, wherein the reinforcement learning basic components defined in step S1 include:

(3) The state is composed of an environment state and an agent state, wherein,

the agent state includes:

(4) Defining a state transfer function:

(5) Defining a reward function:

taking the form of the negative of the total distance travelled by all vehicles in a time step as a reward functionWherein M is the total number of warehouses.

3. The multi-agent deep reinforcement learning-based distribution center vehicle path planning method according to claim 1, wherein the encoder network model in step S2 is specifically:

4. The multi-agent deep reinforcement learning-based distribution center vehicle path planning method according to claim 3, wherein the decoder network model in step S3 is specifically:

the decoder is composed of a soft-hard two-stage attention mechanism module:

The soft attention mechanism module gathers the observations e of all vehicles _i In combination with a hard attention weight W _h Calculating a correlation weight X between each vehicle and other vehicles _i ；X _i The value weighted summation of other vehicles is as follows: x is X _i ＝∑αV _i Wherein the value V _i By the corresponding feature vector e of vehicle i _i Warp matrix W _V Linear transformation results in the attention weight alpha comparing the feature vectors e using the query key system _i And e _j ，W _q Will e _i Converts into a query, W _k Will e _j Converting to keys and inputting the matching value into a Softmax function; according to W _q And W is equal to _k Scaling the matching value and combining the hard attention weight W _h The attention weight α is obtained:combining the feature vectors e corresponding to each vehicle i _i Correlation weight X _i Calculating to obtain the action cost function Q of each vehicle i ⁱ The method comprises the steps of carrying out a first treatment on the surface of the Action cost function Q ⁱ Q is according to the formula ⁱ ＝f(g(e _i ，X _i ) Calculated, where f is a multi-layer linear network and g is a single-layer linear network.

5. The method for planning a vehicle path in a distribution center based on multi-agent deep reinforcement learning according to claim 1, wherein step S4 is specifically: the critics network model consists of an evaluation network and a target critics network, which are multi-layer linear networks with the same network dimension but different network parameters; evaluating the environmental state S at the moment of network reception t _t Estimated state valueThe target critic network receives the environmental state at the time of t+1, and estimates the value of the environmental state at the time of t+1 +.>

6. The method for planning a vehicle path in a distribution center based on multi-agent deep reinforcement learning according to claim 1, wherein the masking mechanism introduced in step S5 is specifically:

7. The method for planning a vehicle path in a distribution center based on multi-agent deep reinforcement learning according to claim 1, wherein step S6 is specifically: adopting an epsilon-Greedy method as an action selection strategy, wherein epsilon-Greedy utilizes a parameter epsilon, epsilon is more than or equal to 0 and less than or equal to 1, and the epsilon-Greedy is weighted between exploring an uncertain strategy and developing a current optimal strategy; under the condition that the probability is 1-epsilon, the intelligent agent selects the action corresponding to the maximum Q value in the action cost function; under the condition that the probability is epsilon, the intelligent agent randomly selects actions from the available action set; epsilon is smaller gradually along with the training process, which means that along with the accumulation of information and experience, the utilization rate of mastered information is gradually increased, and the exploratory property is gradually reduced.

8. The method for planning a vehicle path in a distribution center based on multi-agent deep reinforcement learning according to claim 1, wherein step S7 is specifically: using the parameters θ, W and W ^- Parameterizing an encoder and a decoder respectively, and evaluating all trainable variables in a network and a target critics network; advantage function, consist ofCalculated, wherein the joint action cost function Q ^tot By->Approximately, where gamma is the rate of the bonus discount,and->The evaluation network is respectively approximated to the target criticism network, and the parameter theta ⁱ The updating mode is as follows:the evaluation network parameter W is updated by adopting a TD differential algorithm: /> Evaluating the network parameters W every T copies to update the network parameters W of the target criticism network ^- The method comprises the steps of carrying out a first treatment on the surface of the And after training reaches the set maximum number of times, regarding the solution with the largest rewards obtained in the training process as the solution of the problem.