CN113159681A - Multi-type intermodal dynamic path planning method based on game reinforcement learning - Google Patents

Multi-type intermodal dynamic path planning method based on game reinforcement learning Download PDF

Info

Publication number
CN113159681A
CN113159681A CN202110423315.7A CN202110423315A CN113159681A CN 113159681 A CN113159681 A CN 113159681A CN 202110423315 A CN202110423315 A CN 202110423315A CN 113159681 A CN113159681 A CN 113159681A
Authority
CN
China
Prior art keywords
state
network
learning
game
action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110423315.7A
Other languages
Chinese (zh)
Other versions
CN113159681B (en
Inventor
叶峰
覃诗
赖乙宗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202110423315.7A priority Critical patent/CN113159681B/en
Publication of CN113159681A publication Critical patent/CN113159681A/en
Application granted granted Critical
Publication of CN113159681B publication Critical patent/CN113159681B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/08Logistics, e.g. warehousing, loading or distribution; Inventory or stock management
    • G06Q10/083Shipping
    • G06Q10/0835Relationships between shipper or supplier and carriers
    • G06Q10/08355Routing methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/042Backward inferencing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Development Economics (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a multi-type intermodal dynamic path planning method based on game reinforcement learning; the method comprises the following steps: s1 the order processing module receives the order information of the user; the S2 game module calculates game influence factors according to the transmitted order information; s3 the parallel reinforcement learning module constructs a state transition model in the reinforcement learning environment according to the transmitted order information, constructs rewards in the reinforcement learning according to the required targets and learns the Q network under the single target; s4, combining game factors and a Q network under a single target to calculate a Q table under multiple targets, and generating an order initial strategy; s5, executing the order, adjusting the game influence factor according to the order execution condition, and adjusting the Q network under the multiple targets until the order execution is completed.

Description

Multi-type intermodal dynamic path planning method based on game reinforcement learning
Technical Field
The invention relates to multi-type intermodal transportation path planning, in particular to a multi-type intermodal transportation dynamic path planning method based on game reinforcement learning.
Background
Along with the continuous improvement of the comprehensive transportation system in China, the multi-mode intermodal transportation is used as an advanced transportation organization form and is more and more widely applied in transportation practice, and along with the gradual maturity of a multi-mode intermodal service network, the economic benefit and the social benefit of transportation are also remarkably improved.
The multimodal transport path planning is a multi-constraint and multi-objective optimization problem under an uncertain environment.
The general goals pursued by the logistics industry include minimizing overall costs, minimizing overall transit time, and minimizing carbon emissions. Due to mutual exclusivity among the targets, when the number of path nodes is large, realizing the global optimization of multiple targets is a typical NP-HARD problem. Meanwhile, uncertain factors such as transportation scenes, user preferences, transportation process environment, personnel, equipment conditions and the like, and space-time constraints of transportation means increase the complexity of the problem. Most of the current solutions are path planning under a single target and a static scene.
Disclosure of Invention
The invention provides a multi-type combined transport dynamic path planning method based on game reinforcement learning, aiming at solving the problem of multi-type combined transport multi-target optimization with space-time constraint in an uncertain environment;
according to the invention, a plurality of Q networks are established according to a plurality of targets such as time, cost and social benefit, and meanwhile, a game mechanism is adopted according to a transportation scene, user preference and the like, and time, cost and social benefit are considered;
in order execution, a feedback regulation mechanism is introduced to dynamically regulate the path so as to achieve multi-target dynamic balance.
The invention is realized by the following technical scheme:
a multi-type intermodal dynamic path planning method based on game reinforcement learning comprises the following steps:
s1: the order processing module receives user order information;
s2: the game module is used for calculating game influence factors according to the transmitted order information;
s3: the parallel reinforcement learning module is used for constructing a state transition model in a reinforcement learning environment according to the transmitted order information, constructing rewards in reinforcement learning according to required targets, learning a Q network under a single target and performing the learning of a plurality of targets in parallel;
s4: the Q table calculation module under the multiple targets combines the game factors and the Q network under the multiple single targets to calculate the Q table under the multiple targets, and generates an order initial strategy;
s5: and the dynamic adjustment module executes the order, adjusts the game influence factors according to the execution condition and adjusts the Q table under the multiple targets until the order execution is completed.
In step S1, the order processing module receives order information of a user, where the order information includes a ship from location, a destination location, a ship from time, a cargo type, a cargo weight, and a user preference. The user preferences include shortest transit time, lowest transit cost, etc.
Step S2, the game module calculates an initial game influence factor according to the incoming order information, and specifically includes the following sub-steps:
s2-1: setting the importance degree to be 1 grade to 5 grades, wherein the corresponding numerical values are 1, 3, 5, 7 and 9 respectively, and the higher the grade is, namely the larger the numerical value is, the higher the importance degree is;
s2-2: setting the initial importance degrees of all the targets to be 3 levels, namely setting the value to be 5, and setting different rules according to requirements by a user to adjust the importance degrees of all the targets;
s2-3: and calculating game influence factors according to the grades of the targets.
The parallel reinforcement learning module in step S3 constructs a state transition model in a reinforcement learning environment according to the transmitted order information, constructs rewards in reinforcement learning according to the required targets, learns Q networks under a single target, and learns multiple targets in parallel, and specifically includes the following sub-steps:
s3-1: according to the origin and destination points in the order, separating a complete routing network diagram from the origin to the destination point from the map storage module, wherein the complete routing network diagram comprises the origin and destination points and a route between the two points, and the routing network diagram is represented as G (N, V, M), wherein N represents a node set, N represents {1,2, …, N }, M represents a set of transportation modes, M represents {0,1,2, …, M }, and V represents an edge setSet of (V) { V ═ Vk|vk=(vijM), i ∈ N, j ∈ N, M ∈ M }, where v ∈ M }, wherekRepresenting an edge of a transportation mode m from an i node to a j node, wherein each node represents a city, one or more edges exist between two communicated nodes and comprise different transportation modes, and different edges have different time costs, transportation costs, carbon emission sizes and other various costs; where the time cost of an edge is denoted TvkThe transportation cost is represented as CvkCarbon emissions are expressed in terms of CEvk
S3-2: constructing a reinforcement learning environment according to a routing network diagram, wherein each node is a state S, and a current node selects a certain edge to a next node to be an action A; selecting an action according to the set of edges selectable from the node at each node, namely each state, so as to transfer to the next state and obtain a reward; the number of states is equal to the number of city nodes in the routing network graph, the nodes are connected by edges, and the set of the selectable edges starting from the node i is vki={vk|vk=(vij,m),j∈N,m∈M},vkiThe total number of elements in the set is L, L represents the total number of edges which can be selected from a node i, and the number of actions which can be selected at the node i is L;
s3-3: formulating corresponding environment rewards for each Q learning network; each Q learning network can comprise a Q network aiming at minimizing order execution time, a Q network aiming at minimizing order execution cost, a Q network aiming at minimizing carbon emission and the like, and the number of the targets can be increased or decreased according to the requirements of users;
for Q-networks that minimize transit time, settings of rewards in the environment: the agent passes v from the current statekWhen the state is shifted to the next state, the reward is set to R-TvkIf Td delivery is delayed relative to the estimated time, then we get a penalty of-Td, with the goal of maximizing the sum of all state rewards, i.e., minimizing execution time;
for Q-networks that minimize transportation costs, the setting of rewards in the environment:
firstly, the transportation cost and the transit cost are set with corresponding punishment values in the environment according to the selected services, namely the difference of roads, railways, water ways and distances and the difference of the loading and unloading expenses of freight stations and wharfs;
secondly, stacking cost, detention cost and delay cost, and setting corresponding punishment values in the environment according to stacking time, detention time and delay time;
if the maximum transportation capacity and the maximum node transfer capacity are exceeded, penalty values are set; for Q-networks that minimize transportation costs, agents traverse v from the current statekWhen the state is shifted to the next state, the reward is set to R ═ CvkThe goal is to maximize the sum of all status rewards, i.e., minimize transportation costs;
setting of rewards in the environment for Q networks with the goal of minimizing carbon emissions: the agent passes v from the current statekWhen moving to next state, its reward is set as R-CEvkThe goal is to maximize the sum of all stateful rewards, i.e. minimize carbon emissions;
for all objectives, some common rewards are set according to the co-existing constraints:
(1) when the carrying capacity exceeds the limit quantity, the reward value is set to be-theta 100;
(2) when the goods are turned back in the transportation process, setting the reward value to be-1000;
s3-4: setting a search strategy of reinforcement learning, wherein the search strategy belongs to a greedy method;
s3-5: setting a network model of Q network learning under a single target to comprise two deep neural networks, namely a Q-current network and a Q-target network, wherein the two networks have the same structure and inconsistent updating frequency; updating the Q-current network in real time, wherein the updating frequency of the Q-target network is lower than that of the Q-current network, the Q-target network is updated once when the iteration number of the Q-current network reaches a set value C, and the parameter of the Q-current network is updated to the Q-target network; the Q network takes the state as input and outputs the value of all optional actions in the state; the network model comprises a memory pool E for storing experience; the memories in the memory pool are randomly extracted for updating the network, so that the correlation among the memories can be disturbed, and the training efficiency is improved;
s3-6: learning for a Q network under a single target;
s3-7: when the learning times reach the maximum learning times, the single-target Q network completes learning;
s3-8: and learning a plurality of single-target learning Q networks in parallel.
The Q table calculation module under multiple targets in step S-4 calculates the value table of each state-action pair in combination with the game influence factor and the Q network trained under each single target, calculates the action value of each action corresponding to each state under the game factor, and generates an initial optimal order strategy according to a greedy method, which includes the following specific sub-steps:
s4-1, calculating the value of selectable actions in the state S under multiple targets by combining game influence factors;
and S4-2, generating an order initial strategy according to the generated Q table under multiple targets.
The dynamic adjustment module in step S5 executes the order, adjusts the game impact factor according to the execution condition, and adjusts the Q network under multiple targets until the order execution is completed, and the specific substeps are as follows:
s5-1, executing the order, monitoring the execution condition of the order, adjusting game influence factors according to the execution condition of the order, reducing the influence of uncertain factors on timeliness and the like, such as monitoring time nodes, and the concrete substeps are as follows:
s5-1-1, when executing order, monitoring time node, if order arriving state S is later than estimated time node T due to uncertain factor influencesThe time importance degree is improved by one level;
s5-1-2 recalculating game influence factors;
s5-1-3 recalculating the Q table under multiple targets;
s5-1-4, updating the path according to the new Q table and updating the order execution strategy;
s5-1-5, verifying whether the new order strategy meets the requirement, if yes, executing the new strategy, and if not, turning to S5-1-2 until the new strategy meets the requirement.
The step S2-3 of calculating the game influence factor according to the level of each target includes the following sub-steps:
s2-3-1: after the importance degrees of the n targets are adjusted according to requirements and the like, the importance degree of the target I is level I;
s2-3-2: determining the value corresponding to the grade of the target I according to the grade I
Figure BDA0003028648150000061
S2-3-3: calculating the influence factor delta of the target i according to the following formulai
Figure BDA0003028648150000062
S2-3-4: the targets can be adjusted according to requirements, and the number and the content of the targets can be adjusted.
In the above step S3-4, the search policy is e greedy, which includes the following sub-steps:
s3-4-1: setting a greedy factor epsilon, wherein the initial value of the greedy factor epsilon ranges from 0 to 1;
s3-4-2: randomly generating a number beta between 0 and 1;
s3-4-3: if beta is larger than or equal to the middle part of the action range, selecting the action with the largest action value in the action range, namely the action with the largest Q value, and if beta is smaller than the middle part of the action range, randomly selecting one action in the action range;
s3-4-4: as training progresses, the exploration rate epsilon becomes smaller as iteration progresses;
in the above step S3-6, the learning of the Q network under a single target specifically includes the following sub-steps: setting a maximum learning time T, a state set S, an action set A, a step length alpha, an attenuation factor gamma, an exploration rate E, Q-a current network Q, Q-a target network Q', the number of samples m of batch gradient descent and Q-a target network parameter updating frequency C;
s3-6-1: initializing the learning times i to be 1;
s3-6-2: when the learning frequency i is less than the maximum learning frequency T, go to step S3-6-3, otherwise go to step S3-7;
s3-6-3: initializing S to be the state of a ship-from place;
s3-6-4: using S as input in the Q network to obtain Q value output corresponding to all actions of the Q network;
s3-6-5: selecting a corresponding action A in the current Q value output by using an element greedy method, executing the current action A in the state S to obtain a new state S ' and an award R, judging whether the S ' is a target place D or not, and storing the new state S ' and the award R in a variable is _ D; storing the five-tuple of { S, A, R, S ', is _ D } into an empirical playback set E, and switching the state from S to S';
s3-6-6: sampling m samples from empirical playback set E Sj,Aj,Rj,S′j,is_DjJ is 1,2 … m, and the value ζ of the current target action value Q is calculatedj
ζj=Rj+γmaxa′Q′(S′j,A′j)
Wherein R isjIs a state slave SjSwitch to S'jThe reward obtained in time, gamma being the decay factor of the reward, maxa′Q′(S′j,A′j) Is in to S'jSelectable maximum action value at state;
s3-6-7: using the mean square error loss function:
Figure BDA0003028648150000081
updating all parameters w of the Q network through gradient back propagation of the neural network;
s3-6-8: if the learning times i are multiples of the updating frequency C, namely i% C is 1, updating the target Q network parameter w' ═ w;
s3-6-9: if S 'is not in the termination state, go to step S3-6-4 to continue learning, if S' is in the termination state, the learning number of the current round is increased once, i is i +1, if the new learning number is less than the maximum learning number, go to step S3-6-3, if the maximum learning number is reached, go to step S3-7.
In the step S4-1, the value of the selectable action in the state S under multiple targets is calculated by combining the game influence factors, and the specific sub-steps are as follows:
s4-1-1, inputting state S into each trained Q network, obtaining value of selectable action of each Q network output in the state, outputting value set Q of each action by target i network in the state Sis={Qis1,Qis2,…,Qisl,…,QisL};
S4-1-2 according to game influence factor deltaA,δB…,δi,…δnAnd QisCalculate the action A in s-State in combination with the following equationslThe value of (A):
Figure BDA0003028648150000082
s4-1-3, traversing and calculating the values of all actions in all states to form a multi-target Q value table, wherein the horizontal axis of the Q value table is a state, the vertical axis of the Q value table is an action, and each state-action pair corresponds to a unique action value;
step S4-2, generating an order initial strategy according to the generated Q table under multiple targets, and the specific substeps are as follows:
s4-2-1 sets state S to an initial state;
s4-2-2 looks up a table according to the state S to obtain a value set of the optional actions in the state S;
s4-2-3, selecting the action with the highest action value from the selectable actions by a greedy method, and refreshing the state to the next state;
s4-2-4 repeatedly executes S4-2-2 to S4-2-3 until the next state reached by the refresh state is the terminating state destination point D;
s4-2-5 backtracks from the end state to the initial state, and obtains the optimal strategy, namely, generating the execution route of the order and reaching the predicted time node T of the state S from the initial states
Compared with the prior art, the invention has the following advantages and effects:
the invention provides a multi-target reinforcement learning framework based on a parallel DQN network, a plurality of Q networks are established according to three targets of time, cost and social benefit, and dynamic balance of the time, the cost and the social benefit is realized by adopting a game mechanism.
In the process of reinforcement learning training, the invention adopts an experience playback mechanism to accelerate the training of the network, uses an element greedy strategy to better balance the relationship of exploration and utilization, and can effectively reduce the training time by multi-network parallel learning.
The invention can dynamically solve the problem of multi-target optimization under multi-mode combined transport, can consider more factors compared with single-target optimization, can realize dynamic balance of time, cost and social benefit, and is more in line with actual conditions; and secondly, the influence of uncertain factors and the like on order execution is considered, dynamic adjustment can be performed according to the execution condition, the robustness is higher, and the adaptability is stronger.
Drawings
Fig. 1 is a flow chart of a dynamic path planning scheme of multimodal transport based on game reinforcement learning according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific examples.
As shown in fig. 1, the invention discloses a multi-type intermodal dynamic path planning method based on game reinforcement learning, which adopts the following scheme:
the order processing module is used for processing order information input by a user;
the game module is used for calculating game influence factors;
a parallel reinforcement learning module for Q network learning under each target
The Q table calculation module under the multiple targets is used for calculating the Q table under the multiple targets;
and the dynamic adjustment module is used for a closed-loop dynamic adjustment feedback module in the order processing process.
Step1. receiving order information and simply processing
The order processing module receives order information of a user, wherein the order information comprises a ship-from place, a destination place, a ship-from time, a cargo type, a cargo weight and user preferences. The user preferences include shortest transit time, lowest transit cost, etc.
Step2. the game module calculates the initial game influence factor according to the transmitted order information
Step21 sets the importance level to 5 levels, 1 to 5 levels, and the corresponding values are 1, 3, 5, 7, 9, wherein the higher the level, i.e., the greater the value, the higher the importance level.
Step22 sets the initial importance level of each target to be 3 grade, that is, the value is 5, and the user makes different rules according to the requirement to adjust the importance level of each target.
Step23 default rule is set as: 1. if the goods are cold chain transportation and the like, the importance degree of the time target is improved by level 1; 2. if the user requirement is timeliness priority, the importance degree of the time target is promoted to 5 grade; 3. if the user demand is that the transportation cost is low and priority, the cost target is promoted to 5 grades; 4. the carbon emission importance level must not fall below level 2.
Step24, calculating game influence factors according to the grades of the targets, and the specific steps are as follows:
after Step241 is adjusted according to requirements and the like, the importance degree of a time target is grade A, the importance degree of a cost target is grade B, and the importance degree of carbon emission is grade C;
step242, determining numerical values a, B and C corresponding to the levels according to the levels A, B and C;
step243 calculates the influence factor of each target according to the following formula, wherein the influence factor of the time target is deltaAThe impact factor of the transportation cost target is δBThe minimum carbon emission target has an impact factor δC
Figure BDA0003028648150000111
Figure BDA0003028648150000112
Figure BDA0003028648150000113
The target in Step244 can be adjusted according to the requirement, and the number and the content of the targets can be adjusted.
The Step3 parallel reinforcement learning module constructs a state transition model in the reinforcement learning environment according to the transmitted order information, constructs rewards in the reinforcement learning according to the required targets, and learns the Q network under the single target
Step31, separating a complete routing network diagram from the origin point to the destination point from the map storage module according to the origin point and the destination point in the order, wherein the complete routing network diagram comprises the origin point and the destination point and a route between the two points, and the routing network diagram is represented as G (N, V, M), wherein N represents a node set, N represents {1,2, …, N }, M represents a set of transportation modes, M represents {0,1,2, …, M }, V represents a set of edges, and V represents { V { (V) } and a route between the two points is represented by a node set, and the node set is a node setk|vk=(vijM), i ∈ N, j ∈ N, M ∈ M }, where v ∈ M }, wherekRepresenting an edge of a m transportation mode from an i node to a j node, wherein each node represents a city, one or more edges exist between two communicated nodes and comprise different transportation modes, and different edges have different time costs, transportation costs and carbon emission cost; where the time cost of an edge is denoted TvkThe transportation cost is represented as CvkCarbon emissions are expressed in terms of CEvk
Step32, constructing a reinforcement learning environment according to the routing network diagram, wherein each node is a state S, and the current node selects a certain edge to the next node to be an action A. According to v at each node n, i.e. each statenjTo select an action, to transition to the next state, and to obtain a reward. The number of states is equal to the number of city nodes in the routing network graph, the nodes are connected by edges, and the set of the selectable edges starting from the node i is vki={vk|vk=(vij,m),j∈N,m∈M},vkiThe total number of elements in the set is L, L denotes the total number of edges that can be selected starting from node i, where the number of actions that can be selected is L.
Step33 formulates a corresponding environmental reward for each Q-learning network. Each Q learning network comprises a Q network aiming at minimizing order execution time, a Q network aiming at minimizing order execution cost, a Q network aiming at minimizing carbon emission and the like, and the number of the targets can be increased or decreased according to the requirements of users.
For Q-networks that minimize transit time, settings of rewards in the environment: the agent passes v from the current statekWhen the state is shifted to the next state, the reward is set to R-TvkIf delivery of Td is delayed with respect to the estimated time, then we get a penalty of-Td, with the goal of maximizing the sum of all state rewards, i.e. minimizing execution time;
for Q-networks that minimize transportation costs, the setting of rewards in the environment:
firstly, the transportation cost and the transit cost are set with corresponding punishment values in the environment according to the selected services, namely the difference of roads, railways, water ways and distances and the difference of the loading and unloading expenses of freight stations and wharfs;
secondly, stacking cost, detention cost and delay cost, and setting corresponding punishment values in the environment according to stacking time, detention time and delay time;
if the maximum transportation capacity and the maximum node transfer capacity are exceeded, penalty values are set; for Q-networks that minimize transportation costs, agents traverse v from the current statekWhen the state is shifted to the next state, the reward is set to R ═ CvkThe goal is to maximize the sum of all status rewards, i.e., minimize transportation costs;
setting of rewards in the environment for Q networks with the goal of minimizing carbon emissions: the agent passes v from the current statekWhen moving to next state, its reward is set as R-CEvkThe goal is to maximize the sum of all stateful rewards, i.e. minimize carbon emissions;
for all objectives, some common rewards are set according to the co-existing constraints:
(1) when the carrying capacity exceeds the limit quantity, the reward value is set to be-theta 100;
(2) when a return occurs during the transportation of the goods, the reward value is set to-1000.
Step34, setting a search strategy for reinforcement learning, wherein the search strategy belongs to a greedy method and comprises the following specific steps:
step341 sets greedy factor e, and the initial value is between 0 and 1;
step342 randomly generates a number β between 0 and 1;
step343, if beta is larger than or equal to the middle part of the family, selecting the action with the largest action value in the action range, namely the action with the largest Q value, and if beta is smaller than the middle part of the family, randomly selecting one action in the action range;
step344 is as training progresses, the exploration rate ∈ becomes smaller as iteration progresses.
Step35, setting that the network model of the Q network learning under the single target comprises two deep neural networks, namely a Q-current network and a Q-target network, wherein the two networks have the same structure and inconsistent update frequency. And updating the Q-current network immediately, wherein the updating frequency of the Q-target network is slower than that of the Q-current network, the Q-target network is updated once when the iteration number of the Q-current network reaches a set value C, and the parameters of the Q-current network are updated to the Q-target network. The Q network takes the state as input and outputs the value of all optional actions in the state. The network model comprises a memory pool E for storing experience. The memories in the memory pool are randomly extracted for updating the network, so that the correlation among the memories can be disturbed, and the training efficiency is improved.
Step36 is used for learning the Q network under a single target, and the specific steps are as follows:
and setting the maximum learning times T, a state set S, an action set A, a step length alpha, an attenuation factor gamma, an exploration rate E, Q-a current network Q, Q-a target network Q', the number of samples m of batch gradient descent and Q-target network parameter updating frequency C.
Step361 initializes the learning times i to 1;
step362, when the learning time i is less than the maximum learning time T, turning to Step363, otherwise, turning to Step 37;
step363 initializes S as the state of the start point;
step364 uses S as input in the Q network to obtain Q value output corresponding to all actions of the Q network;
step365 selects corresponding action A in the current Q value output by an Eviae method, executes the current action A in the state S to obtain a new state S ' and an award R, judges whether the S ' is a target place D or not, and stores the new state S ' and the award R in a variable is _ D. Storing the five-tuple of { S, A, R, S ', is _ D } into an empirical playback set E, and switching the state from S to S';
step366 samples m samples S from the empirical playback set Ej,Aj,Rj,S′j,is_DjJ ═ 1,2,, m, the value ζ of the current target action value Q is calculatedj
ζj=Rj+γmaxa′Q′(S′j,A′j)
Wherein R isjIs a state slave SjSwitch to S'jThe reward obtained in time, gamma being the decay factor of the reward, maxa′Q′(S′j,A′j) Is in S'jMaximum action value selectable at state;
step367 uses the mean square error loss function:
Figure BDA0003028648150000141
updating all parameters w of the Q network through gradient back propagation of the neural network;
step368, if the learning time i is a multiple of the updating frequency C, that is, if i% C is 1, updating the target Q network parameter w' ═ w;
step369, if the S' is not in the end state, turning to Step364 to continue learning, if the S is in the end state, increasing the learning frequency once after the learning of the current round is finished, namely i is equal to i +1, if the new learning frequency is less than the maximum learning frequency, turning to Step363, and if the maximum learning frequency is reached, turning to Step 37;
step37 when the learning times reach the maximum learning times, the single-target Q network completes the learning.
Step38 learns multiple single-target learning Q-networks in parallel.
Step4, calculating a value table of each state-action pair by combining the game influence factor and the Q network trained under each single target, calculating action values of corresponding actions under each state under the game factor, and generating an order initial optimal strategy according to a greedy method
Step41, calculating the value of selectable actions under n states under multiple targets by combining game influence factors, and the specific steps are as follows:
step411 inputs the state n into each trained Q network to obtain the value of the selectable action of each Q network output in the state, such as inputting the state n, and the value Q of each action of the time target, namely A target network output in the state nAN=QAN1,QAN2,QAN3,…,QANl(ii) a Value Q of each action output by B target network as transportation cost target in n stateBN=QBN1,QBN2,QBN3,…,QBNl(ii) a Minimizing the carbon emissions target in the n state, i.e., the C target network outputs the value Q of each actionCN=QCN1,QCN2,QCN3,…,QCNl
Step412 according to game influence factor lambdaA,λB,λCAnd QANl,QBNl,QCNlCalculate the action A in n state by combining the following formulanlThe value of (A):
QNl=λA*QANlB*QBNlC*QCNl
step413 calculates values of all actions in all states in a traversal mode, a multi-target Q value table is formed, the horizontal axis of the Q value table is a state, the vertical axis of the Q value table is an action, and each state-action pair corresponds to a unique action value.
Step42, generating an order initial strategy according to the generated Q table under multiple targets, and specifically comprising the following steps:
step421 sets state S to the initial state;
step422 looks up a table according to the state S to obtain a value set of the optional actions in the state S;
step423 selects the action with the highest action value from the selectable actions by using a greedy method, and refreshes the state to the next state;
step424 repeatedly executes steps 422 to 423 until the next state where the refresh state reaches is the destination point D of the termination state;
step425 backtracks from the end state to the initial state, and obtains the optimal strategy, i.e. the execution route of the generated order and the predicted time node T for reaching the state node Ns
Step5, adjusting game influence factors according to the execution condition, and adjusting the Q network under multiple targets until the order execution is completed.
S51 orders are executed, the execution condition of the orders is monitored, game influence factors are adjusted according to the execution condition of the orders, influence of uncertain factors on timeliness and the like is reduced, and the time node is monitored, and the method specifically comprises the following steps:
step511, monitoring the time node when executing the order, and if the order is later than the estimated time node due to the influence of uncertain factors, improving the time importance degree by one level;
step512, recalculating the game influence factors;
step513 recalculates the Q table under multiple targets;
step514 updates the path according to the new Q table, and updates the order execution strategy;
step515 verifies that the new order policy is satisfied, if so, executes the new policy, otherwise, go to Step 512.
As described above, the present invention can be preferably realized.
The embodiments of the present invention are not limited to the above-described embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and they are included in the scope of the present invention.

Claims (10)

1. A multi-type intermodal dynamic path planning method based on game reinforcement learning is characterized by comprising the following steps:
s1: the order processing module receives user order information;
s2: the game module is used for calculating game influence factors according to the transmitted order information;
s3: the parallel reinforcement learning module is used for constructing a state transition model in a reinforcement learning environment according to the transmitted order information, constructing rewards in reinforcement learning according to required targets, learning a Q network under a single target and performing the learning of a plurality of targets in parallel;
s4: the Q table calculation module under the multiple targets combines the game factors and the Q network under the multiple single targets to calculate the Q table under the multiple targets, and generates an order initial strategy;
s5: and the dynamic adjustment module executes the order, adjusts the game influence factors according to the execution condition and adjusts the Q table under the multiple targets until the order execution is completed.
2. The multimodal transport dynamic path planning method based on the game reinforcement learning as claimed in claim 1, wherein: in step S1, the order processing module receives order information of a user, where the order information includes a ship from location, a destination location, a ship from time, a cargo type, a cargo weight, and a user preference.
3. The game reinforcement learning-based multimodal transportation dynamic path planning method of claim 2, wherein the game module calculates an initial game influence factor according to the incoming order information in step S2, and specifically comprises the following sub-steps:
s2-1: setting the importance degree to be 1 grade to 5 grades, wherein the corresponding numerical values are 1, 3, 5, 7 and 9 respectively, and the higher the grade is, namely the larger the numerical value is, the higher the importance degree is;
s2-2: setting the initial importance degrees of all the targets to be 3 levels, namely setting the value to be 5, and setting different rules according to requirements by a user to adjust the importance degrees of all the targets;
s2-3: and calculating game influence factors according to the grades of the targets.
4. The multi-type intermodal dynamic path planning method based on game reinforcement learning as claimed in claim 3, wherein the parallel reinforcement learning module in step S3 constructs a state transition model in a reinforcement learning environment according to the incoming order information, constructs rewards in reinforcement learning according to the required targets, learns Q network under a single target, and learns Q network under multiple targets in parallel, specifically including the following sub-steps:
s3-1: according to the origin and destination points in the order, separating a complete routing network graph from the origin to the destination point from the map storage module, wherein the complete routing network graph comprises the origin and destination points and a route between the origin and destination points, and the routing network graph is represented as G (N, V, M), wherein N represents a node set, N represents {1,2, …, N }, M represents a set of transportation modes, M represents {0,1,2, …, M }, V represents a set of edges, and V represents { V { (V) }k|vk=(vijM), i ∈ N, j ∈ N, M ∈ M }, where v ∈ M }, wherekRepresenting an edge of a m transportation mode from an i node to a j node, wherein each node represents a city, one or more edges exist between two communicated nodes and comprise different transportation modes, and different edges have different time costs, transportation costs and carbon emission cost; where the time cost of an edge is denoted TvkThe transportation cost is represented as CvkCarbon emissions are expressed in terms of CEvk
S3-2: constructing a reinforcement learning environment according to a routing network diagram, wherein each node is a state S, and a current node selects a certain edge to a next node to be an action A; selecting an action according to the set of edges selectable from the node at each node, namely each state, so as to transfer to the next state and obtain a reward; the number of states is equal to the number of city nodes in the routing network graph, the nodes are connected by edges, and the set of the selectable edges starting from the node i is vki={vk|vk=(vij,m),j∈N,m∈M},vkiThe total number of elements in the set is L, L represents the total number of edges which can be selected from a node i, and the number of actions which can be selected at the node i is L;
s3-3: formulating corresponding environment rewards for each Q learning network; each Q learning network can comprise a Q network aiming at minimizing order execution time, a Q network aiming at minimizing order execution cost and a Q network aiming at minimizing carbon emission, wherein the number of the targets can be increased or decreased according to user requirements;
for Q-networks that minimize transit time, settings of rewards in the environment: the agent passes v from the current statekWhen the state is shifted to the next state, the reward is set to R-TvkIf Td delivery is delayed relative to the estimated time, then we get a penalty of-Td, with the goal of maximizing the sum of all state rewards, i.e., minimizing execution time;
for Q-networks that minimize transportation costs, the setting of rewards in the environment:
firstly, the transportation cost and the transit cost are set with corresponding punishment values in the environment according to the selected services, namely the difference of roads, railways, water ways and distances and the difference of the loading and unloading expenses of freight stations and wharfs;
secondly, stacking cost, detention cost and delay cost, and setting corresponding punishment values in the environment according to stacking time, detention time and delay time;
if the maximum transportation capacity and the maximum node transfer capacity are exceeded, penalty values are set; for Q-networks that minimize transportation costs, agents traverse v from the current statekWhen the state is shifted to the next state, the reward is set to R ═ CvkThe goal is to maximize the sum of all status rewards, i.e., minimize transportation costs;
setting of rewards in the environment for Q networks with the goal of minimizing carbon emissions: the agent passes v from the current statekWhen moving to next state, its reward is set as R-CEvkThe goal is to maximize the sum of all stateful rewards, i.e. minimize carbon emissions;
for all objectives, a common reward is set according to a co-existing constraint:
(1) when the carrying capacity exceeds the limit quantity, the reward value is set to be-theta 100;
(2) when the goods are turned back in the transportation process, setting the reward value to be-1000;
s3-4: setting a search strategy of reinforcement learning, wherein the search strategy belongs to a greedy method;
s3-5: setting a network model of Q network learning under a single target to comprise two deep neural networks, namely a Q-current network and a Q-target network, wherein the two networks have the same structure and inconsistent updating frequency; updating the Q-current network in real time, wherein the updating frequency of the Q-target network is lower than that of the Q-current network, the Q-target network is updated once when the iteration number of the Q-current network reaches a set value C, and the parameter of the Q-current network is updated to the Q-target network; the Q network takes the state as input and outputs the value of all optional actions in the state; the network model comprises a memory pool E for storing experience; the memories in the memory pool are randomly extracted for updating the network, so that the correlation among the memories can be disturbed, and the training efficiency is improved;
s3-6: learning for a Q network under a single target;
s3-7: when the learning times reach the maximum learning times, the single-target Q network completes learning;
s3-8: and learning a plurality of single-target learning Q networks in parallel.
5. The multi-type intermodal dynamic path planning method based on game reinforcement learning as claimed in claim 4, wherein the Q table calculation module under the multiple targets in step S-4 calculates the value table of each state-action pair in combination with a game influence factor and a Q network trained under each single target, calculates the action value of each action corresponding to each state under the game factor, and generates an order initial optimal strategy according to a greedy method, and the specific sub-steps are as follows:
s4-1: calculating the value of selectable actions under the state s of multiple targets by combining game influence factors;
s4-2: and generating an order initial strategy according to the generated Q table under multiple targets.
6. The multi-type intermodal dynamic path planning method based on game reinforcement learning as claimed in claim 5, wherein the dynamic adjustment module in step S5 executes the order, adjusts game influence factors according to the execution condition, and adjusts the Q network under multiple targets until the order execution is completed, and the specific sub-steps are as follows:
s5-1: executing an order, monitoring the execution condition of the order, adjusting game influence factors according to the execution condition of the order, and reducing the influence of uncertain factors on timeliness, such as monitoring a time node, wherein the specific substeps are as follows:
s5-1-1: when the order is executed, the time node is monitored, and if the order reaches the state s due to the influence of uncertain factors, the time node is later than the estimated time node TsThe time importance degree is improved by one level;
s5-1-2: recalculating the game influence factor;
s5-1-3: recalculating a Q table under multiple targets;
s5-1-4: updating the path according to the new Q table and updating the order execution strategy;
s5-1-5: and verifying whether the new order strategy meets the requirement, if so, executing the new strategy, and if not, turning to S5-1-2 until the new strategy meets the requirement.
7. The multimodal transportation dynamic path planning method based on game reinforcement learning as claimed in claim 3, wherein the step S2-3 of calculating game influence factors according to the levels of the targets comprises the following sub-steps:
s2-3-1: after the importance degrees of the n targets are adjusted according to requirements, the importance degree of the target I is level I;
s2-3-2: determining the value corresponding to the grade of the target I according to the grade I
Figure FDA0003028648140000062
S2-3-3: calculating the target according toi influence factor deltai
Figure FDA0003028648140000061
S2-3-4: the targets can be adjusted according to requirements, and the number and the content of the targets can be adjusted.
8. The multi-type intermodal dynamic path planning method based on game reinforcement learning as claimed in claim 4, wherein in step S3-4, the search policy is e greedy method, which includes the following sub-steps:
s3-4-1: setting a greedy factor epsilon, wherein the initial value of the greedy factor epsilon ranges from 0 to 1;
s3-4-2: randomly generating a number beta between 0 and 1;
s3-4-3: if beta is larger than or equal to the middle part of the action range, selecting the action with the largest action value in the action range, namely the action with the largest Q value, and if beta is smaller than the middle part of the action range, randomly selecting one action in the action range;
s3-4-4: as training progresses, the exploration rate e becomes smaller as the iteration progresses.
9. The multi-type intermodal dynamic path planning method based on game reinforcement learning as claimed in claim 4, wherein in step S3-6, the learning for the Q network under a single target specifically includes the following sub-steps: setting a maximum learning time T, a state set S, an action set A, a step length alpha, an attenuation factor gamma, an exploration rate E, Q-a current network Q, Q-a target network Q', the number of samples m of batch gradient descent and Q-a target network parameter updating frequency C;
s3-6-1: initializing the learning times i to be 1;
s3-6-2: when the learning frequency i is less than the maximum learning frequency T, go to step S3-6-3, otherwise go to step S3-7;
s3-6-3: initializing S to be the state of a ship-from place;
s3-6-4: using S as input in the Q network to obtain Q value output corresponding to all actions of the Q network;
s3-6-5: selecting a corresponding action A in the current Q value output by using an element greedy method, executing the current action A in the state S to obtain a new state S ' and an award R, judging whether the S ' is a target place D or not, and storing the new state S ' and the award R in a variable is _ D; storing the five-tuple of { S, A, R, S ', is _ D } into an empirical playback set E, and switching the state from S to S';
s3-6-6: sampling m samples from empirical playback set E Sj,Aj,Rj,S′j,is_DjJ is 1,2 … m, and the value ζ of the current target action value Q is calculatedj
ζj=Rj+γmaxa′Q′(S′j,A′j)
Wherein R isjIs a state slave SjSwitch to S'jThe reward obtained in time, gamma being the decay factor of the reward, maxa′Q′(S′j,A′j) Is in to S'jSelectable maximum action value at state;
s3-6-7: using the mean square error loss function:
Figure FDA0003028648140000081
updating all parameters w of the Q network through gradient back propagation of the neural network;
s3-6-8: if the learning times i are multiples of the updating frequency C, namely i% C is 1, updating the target Q network parameter w' ═ w;
s3-6-9: if S 'is not in the termination state, go to step S3-6-4 to continue learning, if S' is in the termination state, the learning number of times is increased once after the current round of learning is finished, i is i +1, if the new learning number of times is less than the maximum learning number, go to step S3-6-3, if the maximum learning number is reached, go to step S3-7;
s3-7: and when the learning times reach the maximum learning times, the single-target Q network completes the learning.
10. The game reinforcement learning-based multimodal transportation dynamic path planning method according to claim 5, wherein in step S4-1, the value of the selectable action in the state S under multiple targets is calculated by combining the game influence factors, and the specific sub-steps are as follows:
s4-1-1: inputting the state s into each trained Q network to obtain the value of the selectable action of the output of each Q network in the state, and outputting the value set Q of each action by the target i network in the state sis={Qis1,Qis2,…,Qisl,…,QisL};
S4-1-2: according to game influence factor deltaA,δB…,δi,…δnAnd QisCalculate the action A in s-State in combination with the following equationslThe value of (A):
Figure FDA0003028648140000091
s4-1-3: traversing and calculating the values of all actions in all states to form a multi-target Q value table, wherein the horizontal axis of the Q value table is a state, the vertical axis of the Q value table is an action, and each state-action pair corresponds to a unique action value;
step S4-2, generating an order initial strategy according to the generated Q table under multiple targets, and the specific substeps are as follows:
s42-1: setting the state S as an initial state;
s4-2-2: looking up a table according to the state S to obtain a value set of the optional actions in the state S;
s4-2-3: selecting the action with the highest action value from the selectable actions by using a greedy method, and refreshing the state to the next state;
s4-2-4: repeatedly performing S4-2-2-S4-2-3 until the next state where the refresh state arrives is the terminating state destination point D;
s4-2-5: backtracking from the end state to the initial state, obtaining the optimal strategy, namely generating the execution route of the order and reaching the time node T predicted by the state s from the initial states
CN202110423315.7A 2021-04-20 2021-04-20 Multi-type intermodal dynamic path planning method based on game reinforcement learning Active CN113159681B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110423315.7A CN113159681B (en) 2021-04-20 2021-04-20 Multi-type intermodal dynamic path planning method based on game reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110423315.7A CN113159681B (en) 2021-04-20 2021-04-20 Multi-type intermodal dynamic path planning method based on game reinforcement learning

Publications (2)

Publication Number Publication Date
CN113159681A true CN113159681A (en) 2021-07-23
CN113159681B CN113159681B (en) 2023-02-14

Family

ID=76868994

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110423315.7A Active CN113159681B (en) 2021-04-20 2021-04-20 Multi-type intermodal dynamic path planning method based on game reinforcement learning

Country Status (1)

Country Link
CN (1) CN113159681B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114612049A (en) * 2022-05-11 2022-06-10 弥费实业(上海)有限公司 Path generation method and device, computer equipment and storage medium
CN116107276A (en) * 2022-12-30 2023-05-12 福州大学 Logistics storage optimal coordination method based on distributed differential game

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101782982A (en) * 2009-07-07 2010-07-21 上海海事大学 Multiple-target integer linear programming method for path choice of container multimodal transport
CN111415048A (en) * 2020-04-10 2020-07-14 大连海事大学 Vehicle path planning method based on reinforcement learning
CN111626477A (en) * 2020-04-29 2020-09-04 河海大学 Multi-type joint transport path optimization method considering uncertain conditions
US20200300644A1 (en) * 2019-03-18 2020-09-24 Uber Technologies, Inc. Multi-Modal Transportation Service Planning and Fulfillment
CN112330070A (en) * 2020-11-27 2021-02-05 科技谷(厦门)信息技术有限公司 Multi-type intermodal transportation path optimization method for refrigerated container under carbon emission limit
CN112381271A (en) * 2020-10-30 2021-02-19 广西大学 Distributed multi-objective optimization acceleration method for rapidly resisting deep belief network
CN112434849A (en) * 2020-11-19 2021-03-02 上海交通大学 Dangerous goods transportation path dynamic planning method based on improved multi-objective algorithm
CN112612207A (en) * 2020-11-27 2021-04-06 合肥工业大学 Multi-target game solving method and system under uncertain environment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101782982A (en) * 2009-07-07 2010-07-21 上海海事大学 Multiple-target integer linear programming method for path choice of container multimodal transport
US20200300644A1 (en) * 2019-03-18 2020-09-24 Uber Technologies, Inc. Multi-Modal Transportation Service Planning and Fulfillment
CN111415048A (en) * 2020-04-10 2020-07-14 大连海事大学 Vehicle path planning method based on reinforcement learning
CN111626477A (en) * 2020-04-29 2020-09-04 河海大学 Multi-type joint transport path optimization method considering uncertain conditions
CN112381271A (en) * 2020-10-30 2021-02-19 广西大学 Distributed multi-objective optimization acceleration method for rapidly resisting deep belief network
CN112434849A (en) * 2020-11-19 2021-03-02 上海交通大学 Dangerous goods transportation path dynamic planning method based on improved multi-objective algorithm
CN112330070A (en) * 2020-11-27 2021-02-05 科技谷(厦门)信息技术有限公司 Multi-type intermodal transportation path optimization method for refrigerated container under carbon emission limit
CN112612207A (en) * 2020-11-27 2021-04-06 合肥工业大学 Multi-target game solving method and system under uncertain environment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
甄远迪等: "不确定情况下集装箱多式联运多目标规划", 《计算机应用与软件》, no. 05, 12 May 2018 (2018-05-12), pages 21 - 26 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114612049A (en) * 2022-05-11 2022-06-10 弥费实业(上海)有限公司 Path generation method and device, computer equipment and storage medium
CN116107276A (en) * 2022-12-30 2023-05-12 福州大学 Logistics storage optimal coordination method based on distributed differential game

Also Published As

Publication number Publication date
CN113159681B (en) 2023-02-14

Similar Documents

Publication Publication Date Title
CN113159681B (en) Multi-type intermodal dynamic path planning method based on game reinforcement learning
CN108847037B (en) Non-global information oriented urban road network path planning method
CN110225535B (en) Heterogeneous wireless network vertical switching method based on depth certainty strategy gradient
CN111967668A (en) Cold chain logistics path optimization method based on improved ant colony algorithm
CN107330561B (en) Multi-target shore bridge-berth scheduling optimization method based on ant colony algorithm
CN104766484B (en) Traffic Control and Guidance system and method based on Evolutionary multiobjective optimization and ant group algorithm
CN107944605A (en) A kind of dynamic traffic paths planning method based on data prediction
CN111445084B (en) Logistics distribution path optimization method considering traffic conditions and double time windows
CN102571570A (en) Network flow load balancing control method based on reinforcement learning
CN113296513B (en) Rolling time domain-based emergency vehicle dynamic path planning method in networking environment
CN113343575A (en) Multi-target vehicle path optimization method based on improved ant colony algorithm
CN112633555A (en) Method and system for optimizing logistics transportation scheme
CN112001064A (en) Full-autonomous water transport scheduling method and system between container terminals
CN108764805A (en) A kind of multi-model self-adapting recommendation method and system of collaborative logistics Services Composition
Siswanto et al. Maritime inventory routing problem with multiple time windows
CN114048924A (en) Multi-distribution center site selection-distribution path planning method based on hybrid genetic algorithm
CN115619065A (en) Multi-type intermodal transport path optimization method and system, electronic equipment and medium
CN114253215B (en) Civil cabin door automatic drilling and riveting path planning method based on improved ant colony algorithm
CN109800911B (en) Unified navigation method for delivery paths of multiple couriers
CN108830401B (en) Dynamic congestion charging optimal rate calculation method based on cellular transmission model
CN112330006A (en) Optimal path planning method applied to logistics distribution based on improved ant colony algorithm
Fallah et al. A green competitive vehicle routing problem under uncertainty solved by an improved differential evolution algorithm
CN103595652A (en) Method for grading QoS energy efficiency in power communication network
CN112561160A (en) Dynamic target traversal access sequence planning method and system
CN115470651A (en) Ant colony algorithm-based vehicle path optimization method with road and time window

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant