CN116739466A - Distribution center vehicle path planning method based on multi-agent deep reinforcement learning - Google Patents

Distribution center vehicle path planning method based on multi-agent deep reinforcement learning Download PDF

Info

Publication number
CN116739466A
CN116739466A CN202310565816.8A CN202310565816A CN116739466A CN 116739466 A CN116739466 A CN 116739466A CN 202310565816 A CN202310565816 A CN 202310565816A CN 116739466 A CN116739466 A CN 116739466A
Authority
CN
China
Prior art keywords
vehicle
network
reinforcement learning
distribution center
agent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310565816.8A
Other languages
Chinese (zh)
Inventor
朱光宇
黄世哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN202310565816.8A priority Critical patent/CN116739466A/en
Publication of CN116739466A publication Critical patent/CN116739466A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/08Logistics, e.g. warehousing, loading or distribution; Inventory or stock management
    • G06Q10/083Shipping
    • G06Q10/0835Relationships between shipper or supplier and carriers
    • G06Q10/08355Routing methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Operations Research (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Quality & Reliability (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Tourism & Hospitality (AREA)
  • Traffic Control Systems (AREA)

Abstract

The application discloses a distribution center vehicle path planning method based on multi-agent deep reinforcement learning, which comprises the following steps: defining a multi-agent deep reinforcement learning basic component under the planning of a multi-distribution center vehicle path; constructing an encoder network model; building a decoder network model embedded with a soft-hard two-stage attention mechanism; buildingA criticism network model under an algorithm framework; introducing a mask mechanism to accelerate training; by passing throughThe algorithm carries out neural network model training to obtain a vehicle path planning result; according to the application, from the whole perspective, the problem of planning the vehicle path of the multiple distribution centers is solved end to end, and the problem solving quality is improved.

Description

Distribution center vehicle path planning method based on multi-agent deep reinforcement learning
Technical Field
The application relates to the technical field of intelligent transportation, in particular to a distribution center vehicle path planning method based on multi-agent deep reinforcement learning.
Background
The vehicle path planning problem is used to design an optimal path for a fleet of vehicles to serve a set of customers given constraints. Large logistics companies often set up multiple warehouses within the distribution network, from which the same goods can be sent to customers, this variant being called with simultaneous consideration of vehicle carrying capacity: multiple distribution center vehicle path planning problem.
The problem of multiple distribution center vehicle path planning has received long-standing attention as a classical NP-hard problem. In the past decades, many researchers solve the problem through a heuristic method, the heuristic algorithm mostly adopts the concept of grouping first and planning later, a designated distribution point is allocated to each warehouse in advance, and the multi-distribution center vehicle path planning problem is cut into a plurality of independent single-distribution center vehicle path planning problems and then solved. This results in a lack of correlation between the packet and the whole, the merits of the packet directly affect the merits of the heuristic effect, the proper packet requires a lot of expertise, and the artificial packet is prone to cause the loss of the optimal solution.
In recent years, with the rapid development of machine learning, solving a vehicle path planning problem by deep reinforcement learning has attracted attention. The virtual et al propose a Pointernet model, solve the problem of combination optimization by using a supervised learning method, and the quality of knowledge is limited by the quality of training sets by using the supervised learning method. Bello et al train the Pointer network using deep reinforcement learning and introduce baseline reduction training variance. Nazari et al, use a simplified Pointernet to solve the vehicle path planning problem, and the model can find an approximate optimal solution. Dai et al use the graph neural network to model, and calculate the action cost function Q value of each node in the remaining optional nodes according to the graph neural network based on the greedy strategy, so as to select a new node to be added into the current solution, and the result is similar to the research result of Bello.
In summary, the vehicle path planning problem is solved by deep reinforcement learning, and a great deal of results are obtained on the single distribution center vehicle path planning problem. But there are few studies to solve the multi-center vehicle path planning problem with deep reinforcement learning. The application of deep reinforcement learning to solve the problem of vehicle path planning in multiple distribution centers requires overcoming a number of difficulties: the multiple delivery center vehicle path planning problem has a larger solution space than the single delivery center vehicle path planning problem, and it is more difficult to find a reliable solution. The multi-distribution center vehicle path planning problem is difficult to model as a single agent reinforcement learning problem. The multi-agent reinforcement learning algorithm is used for solving the problem of vehicle path planning of a multi-distribution center, and reasonable design is needed to balance the abstract game relationship among agents.
Disclosure of Invention
In view of the above, the present application aims to provide a distribution center vehicle path planning method based on multi-agent deep reinforcement learning, which solves the problem of multi-distribution center vehicle path planning from the whole point of view end to end, and improves the problem solving quality.
In order to achieve the above purpose, the application adopts the following technical scheme: the distribution center vehicle path planning method based on multi-agent deep reinforcement learning comprises the following steps:
step S1: defining a multi-agent deep reinforcement learning basic component under the planning of a multi-distribution center vehicle path;
step S2: constructing an encoder network model;
step S3: embedding a soft-hard two-stage attention mechanism in the decoder network model;
step S4: establishing a commentator network model under the framework of the MAC-AC algorithm;
step S5: a masking mechanism accelerates training;
step S6: determining an action selection mechanism;
step S7: and training a neural network model through an MAC-AC algorithm to obtain a vehicle path planning result.
In a preferred embodiment, the reinforcement learning basic component defined in step S1 includes:
(1) Defining vehicles respectively assigned from all warehouses as independent agents;
(2) Defining the behavior of the next target node of the vehicle decision as an action;
(3) The state is composed of an environment state and an agent state, wherein,
the environmental states include: all warehouse coordinates, all customer coordinates and the remaining requirements thereof, and the current coordinates of each vehicle and the remaining cargo capacity thereof;
the agent state includes:
the position coordinate difference between the vehicle and the unserviceable node, the absolute value of the difference between the vehicle residual cargo quantity and the customer residual demand, and the current coordinate and load of the vehicle;
(4) Defining a state transfer function:
after the vehicle i accesses the arbitrary node k, its position coordinate Li is updated:wherein x is k Coordinates corresponding to the accessed node; after the vehicle accesses the customer node k, the customer demand and the vehicle residual load are updated:wherein d represents the demand of each customer, and l represents the current residual load of the vehicle; after the vehicle accesses the corresponding warehouse node, the vehicle is refilled with goods: />Wherein C represents the load capacity of the vehicle;
(5) Defining a reward function:
taking the form of the negative of the total distance travelled by all vehicles in a time step as a reward function
Wherein M is the total number of warehouses.
In a preferred embodiment, the encoder network model in step S2 is specifically:
the encoder network consists of a single-layer linear network and a gate-controlled cyclic neural network GRU, and receives the local observation of the vehicle i at the time tCalculating initial embedding by means of linear projection; initially intercalating and bonding GRU hidden layer h t Calculating a feature vector e of each agent local observation through GRU network i
In a preferred embodiment, the decoder network model in step S3 is specifically:
the decoder is composed of a soft-hard two-stage attention mechanism module:
the hard attention mechanism module receives feature vectors e of all vehicles containing independent observation information i Through e i Learning hard attention weight W h ,W h The method comprises the steps of determining which vehicles need to communicate with each other at the current time t by using independent heat vectors; the hard attention mechanism module is realized based on a bi-GRU network; the feature vector (e) corresponding to the vehicle i and the vehicle j (i not equal to j) i ,e j ) Input bi-GRU, obtain the embedded h of the output through the full connection layer f i,j :h i,j =f(bi-GRU(e i ,e j ) Calculating hard attention weights using gummel-Softmax
The soft attention mechanism module gathers the observations e of all vehicles i In combination with a hard attention weight W h Calculating a correlation weight X between each vehicle and other vehicles i ;X i The value weighted summation of other vehicles is as follows: x is X i =∑αV i WhereinValue V i By the corresponding feature vector e of vehicle i i Warp matrix W V Linear transformation results in the attention weight alpha comparing the feature vectors e using the query key system i And e j ,W q Will e i Converts into a query, W k Will e j Converting to keys and inputting the matching value into a Softmax function; according to W q And W is equal to k Scaling the matching value and combining the hard attention weight W h The attention weight α is obtained:combining the feature vectors e corresponding to each vehicle i i Correlation weight X i Calculating to obtain the action cost function Q of each vehicle i i The method comprises the steps of carrying out a first treatment on the surface of the Action cost function Q i Q is according to the formula i =f(g(e i ,X i ) Calculated, where f is a multi-layer linear network and g is a single-layer linear network.
In a preferred embodiment, step S4 is specifically: the critics network model consists of an evaluation network and a target critics network, which are multi-layer linear networks with the same network dimension but different network parameters; evaluating the environmental state S at the moment of network reception t t Estimated state valueThe target critic network receives the environmental state at the time of t+1, and estimates the value of the environmental state at the time of t+1 +.>
In a preferred embodiment, the masking mechanism introduced in step S5 is specifically:
in the training process, the mask sets the corresponding logarithmic probability that all vehicles should not access the nodes to be- ≡to shield the infeasible actions, and the mask is forcedly solved when the specific conditions are met; the working mechanism is as follows: (1) the vehicle does not allow access to customers with a demand of 0; (2) When the residual load of the vehicle cannot meet any customer, forcing the vehicle to return to the corresponding warehouse to supplement goods; (3) Masking all customers whose current demand is greater than the vehicle load; (4) According to the characteristics of the multi-distribution center vehicle path planning, a candidate list with the length of H is set for the vehicle, and the vehicle is restricted to select the next target node from the H available nodes closest to the vehicle, so that the convergence speed is increased.
In a preferred embodiment, step S6 is specifically: adopting an epsilon-Greedy method as an action selection strategy, wherein epsilon-Greedy utilizes a parameter epsilon, epsilon is more than or equal to 0 and less than or equal to 1, and the epsilon-Greedy is weighted between exploring an uncertain strategy and developing a current optimal strategy; under the condition that the probability is 1-epsilon, the intelligent agent selects the action corresponding to the maximum Q value in the action cost function; under the condition that the probability is epsilon, the intelligent agent randomly selects actions from the available action set; epsilon is smaller gradually along with the training process, which means that along with the accumulation of information and experience, the utilization rate of mastered information is gradually increased, and the exploratory property is gradually reduced.
In a preferred embodiment, step S7 is specifically: using the parameters θ, W and W - Parameterizing an encoder and a decoder respectively, and evaluating all trainable variables in a network and a target critics network; advantage function, consist ofCalculated, wherein the joint action cost function Q tot From the following componentsApproximately, wherein γ is the rewarding discount rate, +.>And->The evaluation network is respectively approximated to the target criticism network, and the parameter theta i The updating mode is as follows:the evaluation network parameter W is updated by adopting a TD differential algorithm: /> Evaluating the network parameters W every T copies to update the network parameters W of the target criticism network - The method comprises the steps of carrying out a first treatment on the surface of the And after training reaches the set maximum number of times, regarding the solution with the largest rewards obtained in the training process as the solution of the problem.
Compared with the prior art, the application has the following beneficial effects: the application provides a multi-agent-based deep reinforcement learning method which is used for solving a multi-distribution center vehicle path planning problem. Different from the solving thought of the traditional heuristic algorithm of 'grouping before planning', the multi-agent utilizes high-level characteristic information to carry out vehicle path planning from the whole problem through the action of mutual cooperation of communication learning, and improves the solving quality.
Drawings
FIG. 1 is a schematic diagram illustrating the operation of an encoder and decoder according to a preferred embodiment of the present application.
FIG. 2 is a schematic diagram of the environment and agent interaction under a single training of a preferred embodiment of the present application.
Fig. 3 is a schematic diagram of a training process of a network model according to a preferred embodiment of the present application.
Detailed Description
The application will be further described with reference to the accompanying drawings and examples.
It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application; as used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
The application provides a multi-agent deep reinforcement learning-based multi-distribution center vehicle path planning method, which specifically comprises the following steps with reference to figures 1 to 3:
step S1: defining a multi-agent deep reinforcement learning basic component under the planning of a multi-distribution center vehicle path;
step S2: constructing an encoder network model;
step S3: a decoder network model is built, so that interference to decision-making of an agent caused by unnecessary communication is avoided, and a soft-hard two-stage attention mechanism is embedded in the model, thereby helping the agent to autonomously learn the real-time communication requirements among each other.
Step S4: establishing a commentator network model under the framework of the MAC-AC algorithm;
step S5: introducing a mask mechanism to accelerate training;
step S6: determining an action selection mechanism;
step S7: and training a neural network model through an MAC-AC (Multi-Agent CooperativeActor-Critic algorithm of Multi-agent actor under complete cooperative relationship setting) algorithm, and obtaining a vehicle path planning result.
Step 1: defining a multi-agent deep reinforcement learning basic component under multi-distribution center vehicle path planning:
(1) Defining vehicles respectively assigned from all warehouses as independent agents;
(2) Defining the behavior of the next target node of the vehicle decision as an action;
(3) The state is composed of an environment state and an agent state,
the environmental conditions include:
1. all warehouse coordinates; 2. all customer coordinates and their remaining requirements; 3. the current coordinates of each vehicle and the residual cargo capacity thereof;
(ii) agent status includes:
1. the position coordinate difference value between the vehicle and the unserviceable node is 2. The absolute value of the difference value between the vehicle residual cargo quantity and the unserviceable customer residual demand; 3. current coordinates and load of the vehicle;
(4) Defining a state transfer function:
the warehouse and the client coordinate are static elements and remain unchanged all the time;
after the vehicle i accesses the arbitrary node k, its position coordinate Li is updated:wherein x is k Coordinates corresponding to the accessed node; after the vehicle accesses the customer node k, the customer demand and the vehicle residual load are updated:wherein d represents the demand of each customer, and l represents the current residual load of the vehicle; after the vehicle accesses the corresponding warehouse node, the vehicle is refilled with goods: />Wherein C represents the load capacity of the vehicle;
(5) Defining a reward function:
for multi-center path planning, the objective function is to minimize the total distance traveled by the vehicle, and the smaller the total distance, the higher the cumulative rewards of the agent, thus taking the negative form of the total distance traveled by all vehicles in one time step as the rewards function Wherein M is the total number of warehouses.
Step 2: building an encoder network model:
the encoder network consists of a single layer linear network and a GRU network (gated loop neural network, gated Recurrent Neural Network) that enables the agent to obtain the ability to record its recent access trajectories. Encoder networkLocal observation of a network receiving vehicle i at time tCalculating initial embedding by means of linear projection, and combining the initial embedding with GRU hidden layer state h t Calculating and obtaining a feature vector e of each agent local observation through GRU network i
Step 3: building a decoder network model:
the decoder is composed of a soft-hard two-stage attention mechanism module:
first, the hard attention mechanism module receives all vehicles i from the encoder, feature vectors e containing independent observation information i Through e i To learn the hard attention weight W h 。W h Consisting of a unique heat vector that determines which vehicles the vehicle needs to communicate with at the current time t. The hard attention mechanism module is implemented based on bi-GRU networks (bi-directional GRU networks). The feature vector (e) corresponding to the vehicle i and the vehicle j (i not equal to j) i ,e j ) Input bi-GRU, obtain the embedded h of the output through the full connection layer f i,j :h i,j =f(bi-GRU(e i ,e j ) A hard-attention mechanism module requires a sampling operation, which makes it difficult to counter-propagate gradients. The Gumbel-Softmax function is an effective tool to solve the above problems, and the Gumbel-Softmax is used to calculate the hard attention weight
Then, the soft-attention mechanism sums the observed information e of all vehicles i i In combination with a hard attention weight W h For each vehicle i, a correlation weight X between the other vehicles is calculated i ,X i The value weighted summation of other vehicles is as follows: x is X i =∑αV i Wherein the value V i By the corresponding feature vector e of vehicle i i Warp matrix W V Linearly transformed toOut, the attention weight α compares the feature vector e using the query key system i And e j ,W q Will e i Converts into a query, W k Will e j Converts to keys and inputs the matching value into the Softmax function. To prevent the gradient from disappearing, according to W q And W is equal to k Scaling the matching value and combining the hard attention weight W h The attention weight α is obtained:
finally, merging the feature vectors e corresponding to each vehicle i i Correlation weight X i Calculating to obtain the action cost function Q of each vehicle i i :Q i =f(g(e i ,X i ) Where f is a multilayer linear network and g is a single linear layer.
Step 4: establishing a commentator network model under the framework of the MAC-AC algorithm:
the critics network model consists of an evaluation network and a target critics network, which are multi-layer linear networks with the same network dimension but different network parameters; evaluating the environmental state S at the moment of network reception t t Estimated state valueThe target critic network receives the environmental state at the time of t+1, and estimates the value of the environmental state at the time of t+1 +.>
Step 5: introducing a masking mechanism accelerates training:
in the training process, the mask sets the corresponding logarithmic probability that all vehicles should not access the node to- ≡to mask the infeasible solution and force the solution when the specific condition is met. The mask elaboration mechanism is as follows: (1) the vehicle does not allow access to customers with a demand of 0; (2) When the residual load of the vehicle cannot meet any customer, forcing the vehicle to return to the corresponding warehouse to supplement goods; (3) All customers whose current demand is greater than the vehicle load are masked. (4) According to the characteristics of the multi-distribution center vehicle path planning, a candidate list with the length of H is set for the vehicle, and the vehicle is restricted to select the next target node from the H available nodes closest to the vehicle, so that the convergence speed is increased.
Step 6: determining an action selection mechanism:
an epsilon-Greedy method is adopted as an action selection strategy, and epsilon-Greedy utilizes a parameter epsilon (epsilon is more than or equal to 0 and less than or equal to 1) to make trade-off between exploring an uncertain strategy and developing a current optimal strategy. When the probability is 1-epsilon, the agent selects the action corresponding to the maximum Q value in the action cost function. In the case of a probability ε, the agent randomly selects an action from the set of available actions. Epsilon is progressively smaller as the training process progresses, meaning that as information and experience accumulate, the utilization of learned information gradually increases and exploratory gradually decreases.
Step 7: training a neural network model through an MAC-AC algorithm to obtain a vehicle path planning result:
using the parameters θ, W and W - Parameterizing an encoder and a decoder respectively, and evaluating all trainable variables in a network and a target critics network; advantage function, consist ofCalculated, wherein the joint action cost function Q tot By->Approximately, wherein γ is the rewarding discount rate, +.>And->The evaluation network is respectively approximated to the target criticism network, and the parameter theta i The updating mode is as follows:evaluation netThe complex parameters W are updated by adopting a TD differential algorithm: /> For more stable training, the evaluation network parameters W are duplicated every T times to update the network parameters W of the target critic network - . And after training reaches the set maximum number of times, regarding the solution with the largest rewards obtained in the training process as the solution of the problem.
Description of the interaction process between the environment and the agent in the training process:
as shown in fig. 2 of the specification:
(1) At time step t, all vehicles acquire respective local observations O from the environment i t And communicate with each other.
(2) The vehicle decides the action of the next moment through the strategy network under the constraint of the mask matrix
(3) The environment updates the customer node information, each vehicle information, and the mask matrix based on actions taken by each vehicle and gives a common global instant prize to improve the critics network.
(3) Policy network feedback evaluation for improving distribution policy through critic network
Repeating the above process until all clients are served, thereby completing a single training.
Preferred parameter settings and optimizer selections of the present application:
the RMS optimizer is adopted to respectively learn the learning rate alpha, beta to be 1 multiplied by 10 -4 And 1X 10 -3 And optimizing the encoder and the decoder, and commenting on network parameters. Wherein, the hidden layer dimension of the GRU in the encoder network is 64, and the linear hidden layer dimension of the critic network is 128. For small-scale instances with a node number of 50 or less, medium-scale instances with a node number between 50-100, and large-scale instances with a node number greater than 100, a number of different parameter settings will be employed. For small scale examples, 3X 10 was performed 4 And (5) training the wheels. For the medium-scale example, 5×10 is performed 4 Wheel training. For large scale examples, 6X 10 is performed 4 And (5) training the wheels. In a small scale example, the parameter ε in the action selection policy ε -Greedy is reduced from 0.6 by 1.28X10 per training -3 The amplitude of (2) decreases to 0.02. In the medium-scale example, the parameter ε is 6.4X10 -4 The drop in the (c) range from 0.8 to 0.02. In a large scale example, the parameter ε is 6.4X10 -4 The drop in the (c) range from 0.9 to 0.02. Target critics network parameter W - In the small-scale example, the medium-scale example and the large-scale example, the candidate list is updated with 25, 50 and 50 training cycles respectively, and the length H of the candidate list is 6.
The foregoing description is only of the preferred embodiments of the application, and all changes and modifications that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims (8)

1. The distribution center vehicle path planning method based on multi-agent deep reinforcement learning is characterized by comprising the following steps of:
step S1: defining a multi-agent deep reinforcement learning basic component under the planning of a multi-distribution center vehicle path;
step S2: constructing an encoder network model;
step S3: embedding a soft-hard two-stage attention mechanism in the decoder network model;
step S4: establishing a commentator network model under the framework of the MAC-AC algorithm;
step S5: a masking mechanism accelerates training;
step S6: determining an action selection mechanism;
step S7: and training a neural network model through an MAC-AC algorithm to obtain a vehicle path planning result.
2. The multi-agent deep reinforcement learning based delivery center vehicle path planning method of claim 1, wherein the reinforcement learning basic components defined in step S1 include:
(1) Defining vehicles respectively assigned from all warehouses as independent agents;
(2) Defining the behavior of the next target node of the vehicle decision as an action;
(3) The state is composed of an environment state and an agent state, wherein,
the environmental states include: all warehouse coordinates, all customer coordinates and the remaining requirements thereof, and the current coordinates of each vehicle and the remaining cargo capacity thereof;
the agent state includes:
the position coordinate difference between the vehicle and the unserviceable node, the absolute value of the difference between the vehicle residual cargo quantity and the customer residual demand, and the current coordinate and load of the vehicle;
(4) Defining a state transfer function:
after the vehicle i accesses the arbitrary node k, its position coordinate Li is updated:wherein x is k Coordinates corresponding to the accessed node; after the vehicle accesses the customer node k, the customer demand and the vehicle residual load are updated:wherein d represents the demand of each customer, and l represents the current residual load of the vehicle; after the vehicle accesses the corresponding warehouse node, the vehicle is refilled with goods: />Wherein C represents the load capacity of the vehicle;
(5) Defining a reward function:
taking the form of the negative of the total distance travelled by all vehicles in a time step as a reward functionWherein M is the total number of warehouses.
3. The multi-agent deep reinforcement learning-based distribution center vehicle path planning method according to claim 1, wherein the encoder network model in step S2 is specifically:
the encoder network consists of a single-layer linear network and a gate-controlled cyclic neural network GRU, and receives the local observation of the vehicle i at the time tCalculating initial embedding by means of linear projection; initially intercalating and bonding GRU hidden layer h t Calculating a feature vector e of each agent local observation through GRU network i
4. The multi-agent deep reinforcement learning-based distribution center vehicle path planning method according to claim 3, wherein the decoder network model in step S3 is specifically:
the decoder is composed of a soft-hard two-stage attention mechanism module:
the hard attention mechanism module receives feature vectors e of all vehicles containing independent observation information i Through e i Learning hard attention weight W h ,W h The method comprises the steps of determining which vehicles need to communicate with each other at the current time t by using independent heat vectors; the hard attention mechanism module is realized based on a bi-GRU network; the feature vector (e) corresponding to the vehicle i and the vehicle j (i not equal to j) i ,e j ) Input bi-GRU, obtain the embedded h of the output through the full connection layer f i,j :h i,j =f(bi-GRU(e i ,e j ) Calculating hard attention weights using gummel-Softmax
The soft attention mechanism module gathers the observations e of all vehicles i In combination with a hard attention weight W h Calculating a correlation weight X between each vehicle and other vehicles i ;X i The value weighted summation of other vehicles is as follows: x is X i =∑αV i Wherein the value V i By the corresponding feature vector e of vehicle i i Warp matrix W V Linear transformation results in the attention weight alpha comparing the feature vectors e using the query key system i And e j ,W q Will e i Converts into a query, W k Will e j Converting to keys and inputting the matching value into a Softmax function; according to W q And W is equal to k Scaling the matching value and combining the hard attention weight W h The attention weight α is obtained:combining the feature vectors e corresponding to each vehicle i i Correlation weight X i Calculating to obtain the action cost function Q of each vehicle i i The method comprises the steps of carrying out a first treatment on the surface of the Action cost function Q i Q is according to the formula i =f(g(e i ,X i ) Calculated, where f is a multi-layer linear network and g is a single-layer linear network.
5. The method for planning a vehicle path in a distribution center based on multi-agent deep reinforcement learning according to claim 1, wherein step S4 is specifically: the critics network model consists of an evaluation network and a target critics network, which are multi-layer linear networks with the same network dimension but different network parameters; evaluating the environmental state S at the moment of network reception t t Estimated state valueThe target critic network receives the environmental state at the time of t+1, and estimates the value of the environmental state at the time of t+1 +.>
6. The method for planning a vehicle path in a distribution center based on multi-agent deep reinforcement learning according to claim 1, wherein the masking mechanism introduced in step S5 is specifically:
in the training process, the mask sets the corresponding logarithmic probability that all vehicles should not access the nodes to be- ≡to shield the infeasible actions, and the mask is forcedly solved when the specific conditions are met; the working mechanism is as follows: (1) the vehicle does not allow access to customers with a demand of 0; (2) When the residual load of the vehicle cannot meet any customer, forcing the vehicle to return to the corresponding warehouse to supplement goods; (3) Masking all customers whose current demand is greater than the vehicle load; (4) According to the characteristics of the multi-distribution center vehicle path planning, a candidate list with the length of H is set for the vehicle, and the vehicle is restricted to select the next target node from the H available nodes closest to the vehicle, so that the convergence speed is increased.
7. The method for planning a vehicle path in a distribution center based on multi-agent deep reinforcement learning according to claim 1, wherein step S6 is specifically: adopting an epsilon-Greedy method as an action selection strategy, wherein epsilon-Greedy utilizes a parameter epsilon, epsilon is more than or equal to 0 and less than or equal to 1, and the epsilon-Greedy is weighted between exploring an uncertain strategy and developing a current optimal strategy; under the condition that the probability is 1-epsilon, the intelligent agent selects the action corresponding to the maximum Q value in the action cost function; under the condition that the probability is epsilon, the intelligent agent randomly selects actions from the available action set; epsilon is smaller gradually along with the training process, which means that along with the accumulation of information and experience, the utilization rate of mastered information is gradually increased, and the exploratory property is gradually reduced.
8. The method for planning a vehicle path in a distribution center based on multi-agent deep reinforcement learning according to claim 1, wherein step S7 is specifically: using the parameters θ, W and W - Parameterizing an encoder and a decoder respectively, and evaluating all trainable variables in a network and a target critics network; advantage function, consist ofCalculated, wherein the joint action cost function Q tot By->Approximately, where gamma is the rate of the bonus discount,and->The evaluation network is respectively approximated to the target criticism network, and the parameter theta i The updating mode is as follows:the evaluation network parameter W is updated by adopting a TD differential algorithm: /> Evaluating the network parameters W every T copies to update the network parameters W of the target criticism network - The method comprises the steps of carrying out a first treatment on the surface of the And after training reaches the set maximum number of times, regarding the solution with the largest rewards obtained in the training process as the solution of the problem.
CN202310565816.8A 2023-05-19 2023-05-19 Distribution center vehicle path planning method based on multi-agent deep reinforcement learning Pending CN116739466A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310565816.8A CN116739466A (en) 2023-05-19 2023-05-19 Distribution center vehicle path planning method based on multi-agent deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310565816.8A CN116739466A (en) 2023-05-19 2023-05-19 Distribution center vehicle path planning method based on multi-agent deep reinforcement learning

Publications (1)

Publication Number Publication Date
CN116739466A true CN116739466A (en) 2023-09-12

Family

ID=87902080

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310565816.8A Pending CN116739466A (en) 2023-05-19 2023-05-19 Distribution center vehicle path planning method based on multi-agent deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN116739466A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117273590A (en) * 2023-10-19 2023-12-22 苏州大学 Neural combination optimization method and system for solving vehicle path optimization problem

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117273590A (en) * 2023-10-19 2023-12-22 苏州大学 Neural combination optimization method and system for solving vehicle path optimization problem

Similar Documents

Publication Publication Date Title
WO2021248607A1 (en) Deep reinforcement learning-based taxi dispatching method and system
Gupta et al. Half a dozen real-world applications of evolutionary multitasking, and more
Yang et al. Cooperative traffic signal control using multi-step return and off-policy asynchronous advantage actor-critic graph algorithm
Sáez et al. Hybrid adaptive predictive control for the multi-vehicle dynamic pick-up and delivery problem based on genetic algorithms and fuzzy clustering
CN112799386B (en) Robot path planning method based on artificial potential field and reinforcement learning
Tang et al. A novel hierarchical soft actor-critic algorithm for multi-logistics robots task allocation
Wang et al. Ant colony optimization with an improved pheromone model for solving MTSP with capacity and time window constraint
Qin et al. Multi-agent reinforcement learning-based dynamic task assignment for vehicles in urban transportation system
CN116739466A (en) Distribution center vehicle path planning method based on multi-agent deep reinforcement learning
CN113051815A (en) Agile imaging satellite task planning method based on independent pointer network
Brajevic Artificial bee colony algorithm for the capacitated vehicle routing problem
CN113537580B (en) Public transportation passenger flow prediction method and system based on self-adaptive graph learning
CN113033072A (en) Imaging satellite task planning method based on multi-head attention pointer network
Tarkesh et al. Facility layout design using virtual multi-agent system
CN106295864A (en) A kind of method solving single home-delivery center logistics transportation scheduling problem
CN115759915A (en) Multi-constraint vehicle path planning method based on attention mechanism and deep reinforcement learning
Jang et al. Offline-online reinforcement learning for energy pricing in office demand response: lowering energy and data costs
Xi et al. Hmdrl: Hierarchical mixed deep reinforcement learning to balance vehicle supply and demand
Wang et al. Optimization of ride-sharing with passenger transfer via deep reinforcement learning
Zong et al. Deep reinforcement learning for demand driven services in logistics and transportation systems: A survey
Zhou et al. Optimization of multi-echelon spare parts inventory systems using multi-agent deep reinforcement learning
CN117361013A (en) Multi-machine shelf storage scheduling method based on deep reinforcement learning
CN116187610A (en) Tobacco order vehicle distribution optimization method based on deep reinforcement learning
Li et al. Congestion-aware path coordination game with markov decision process dynamics
CN113890112B (en) Power grid look-ahead scheduling method based on multi-scene parallel learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination