CN113159681A

CN113159681A - Multi-type intermodal dynamic path planning method based on game reinforcement learning

Info

Publication number: CN113159681A
Application number: CN202110423315.7A
Authority: CN
Inventors: 叶峰; 覃诗; 赖乙宗
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2021-07-23
Anticipated expiration: 2041-04-20
Also published as: CN113159681B

Abstract

The invention discloses a multi-type intermodal dynamic path planning method based on game reinforcement learning; the method comprises the following steps: s1 the order processing module receives the order information of the user; the S2 game module calculates game influence factors according to the transmitted order information; s3 the parallel reinforcement learning module constructs a state transition model in the reinforcement learning environment according to the transmitted order information, constructs rewards in the reinforcement learning according to the required targets and learns the Q network under the single target; s4, combining game factors and a Q network under a single target to calculate a Q table under multiple targets, and generating an order initial strategy; s5, executing the order, adjusting the game influence factor according to the order execution condition, and adjusting the Q network under the multiple targets until the order execution is completed.

Description

Multi-type intermodal dynamic path planning method based on game reinforcement learning

Technical Field

The invention relates to multi-type intermodal transportation path planning, in particular to a multi-type intermodal transportation dynamic path planning method based on game reinforcement learning.

Background

Along with the continuous improvement of the comprehensive transportation system in China, the multi-mode intermodal transportation is used as an advanced transportation organization form and is more and more widely applied in transportation practice, and along with the gradual maturity of a multi-mode intermodal service network, the economic benefit and the social benefit of transportation are also remarkably improved.

The multimodal transport path planning is a multi-constraint and multi-objective optimization problem under an uncertain environment.

The general goals pursued by the logistics industry include minimizing overall costs, minimizing overall transit time, and minimizing carbon emissions. Due to mutual exclusivity among the targets, when the number of path nodes is large, realizing the global optimization of multiple targets is a typical NP-HARD problem. Meanwhile, uncertain factors such as transportation scenes, user preferences, transportation process environment, personnel, equipment conditions and the like, and space-time constraints of transportation means increase the complexity of the problem. Most of the current solutions are path planning under a single target and a static scene.

Disclosure of Invention

The invention provides a multi-type combined transport dynamic path planning method based on game reinforcement learning, aiming at solving the problem of multi-type combined transport multi-target optimization with space-time constraint in an uncertain environment;

according to the invention, a plurality of Q networks are established according to a plurality of targets such as time, cost and social benefit, and meanwhile, a game mechanism is adopted according to a transportation scene, user preference and the like, and time, cost and social benefit are considered;

in order execution, a feedback regulation mechanism is introduced to dynamically regulate the path so as to achieve multi-target dynamic balance.

The invention is realized by the following technical scheme:

a multi-type intermodal dynamic path planning method based on game reinforcement learning comprises the following steps:

s1: the order processing module receives user order information;

s2: the game module is used for calculating game influence factors according to the transmitted order information;

s3: the parallel reinforcement learning module is used for constructing a state transition model in a reinforcement learning environment according to the transmitted order information, constructing rewards in reinforcement learning according to required targets, learning a Q network under a single target and performing the learning of a plurality of targets in parallel;

s4: the Q table calculation module under the multiple targets combines the game factors and the Q network under the multiple single targets to calculate the Q table under the multiple targets, and generates an order initial strategy;

s5: and the dynamic adjustment module executes the order, adjusts the game influence factors according to the execution condition and adjusts the Q table under the multiple targets until the order execution is completed.

In step S1, the order processing module receives order information of a user, where the order information includes a ship from location, a destination location, a ship from time, a cargo type, a cargo weight, and a user preference. The user preferences include shortest transit time, lowest transit cost, etc.

Step S2, the game module calculates an initial game influence factor according to the incoming order information, and specifically includes the following sub-steps:

s2-1: setting the importance degree to be 1 grade to 5 grades, wherein the corresponding numerical values are 1, 3, 5, 7 and 9 respectively, and the higher the grade is, namely the larger the numerical value is, the higher the importance degree is;

s2-2: setting the initial importance degrees of all the targets to be 3 levels, namely setting the value to be 5, and setting different rules according to requirements by a user to adjust the importance degrees of all the targets;

s2-3: and calculating game influence factors according to the grades of the targets.

The parallel reinforcement learning module in step S3 constructs a state transition model in a reinforcement learning environment according to the transmitted order information, constructs rewards in reinforcement learning according to the required targets, learns Q networks under a single target, and learns multiple targets in parallel, and specifically includes the following sub-steps:

s3-1: according to the origin and destination points in the order, separating a complete routing network diagram from the origin to the destination point from the map storage module, wherein the complete routing network diagram comprises the origin and destination points and a route between the two points, and the routing network diagram is represented as G (N, V, M), wherein N represents a node set, N represents {1,2, …, N }, M represents a set of transportation modes, M represents {0,1,2, …, M }, and V represents an edge setSet of (V) { V ═ V_k|v_k＝(v_ijM), i ∈ N, j ∈ N, M ∈ M }, where v ∈ M }, where_kRepresenting an edge of a transportation mode m from an i node to a j node, wherein each node represents a city, one or more edges exist between two communicated nodes and comprise different transportation modes, and different edges have different time costs, transportation costs, carbon emission sizes and other various costs; where the time cost of an edge is denoted T_vkThe transportation cost is represented as C_vkCarbon emissions are expressed in terms of CE_vk；

S3-2: constructing a reinforcement learning environment according to a routing network diagram, wherein each node is a state S, and a current node selects a certain edge to a next node to be an action A; selecting an action according to the set of edges selectable from the node at each node, namely each state, so as to transfer to the next state and obtain a reward; the number of states is equal to the number of city nodes in the routing network graph, the nodes are connected by edges, and the set of the selectable edges starting from the node i is v_ki＝{v_k|v_k＝(v_ij,m),j∈N，m∈M}，v_kiThe total number of elements in the set is L, L represents the total number of edges which can be selected from a node i, and the number of actions which can be selected at the node i is L;

s3-3: formulating corresponding environment rewards for each Q learning network; each Q learning network can comprise a Q network aiming at minimizing order execution time, a Q network aiming at minimizing order execution cost, a Q network aiming at minimizing carbon emission and the like, and the number of the targets can be increased or decreased according to the requirements of users;

for Q-networks that minimize transit time, settings of rewards in the environment: the agent passes v from the current state_kWhen the state is shifted to the next state, the reward is set to R-T_vkIf Td delivery is delayed relative to the estimated time, then we get a penalty of-Td, with the goal of maximizing the sum of all state rewards, i.e., minimizing execution time;

for Q-networks that minimize transportation costs, the setting of rewards in the environment:

firstly, the transportation cost and the transit cost are set with corresponding punishment values in the environment according to the selected services, namely the difference of roads, railways, water ways and distances and the difference of the loading and unloading expenses of freight stations and wharfs;

secondly, stacking cost, detention cost and delay cost, and setting corresponding punishment values in the environment according to stacking time, detention time and delay time;

if the maximum transportation capacity and the maximum node transfer capacity are exceeded, penalty values are set; for Q-networks that minimize transportation costs, agents traverse v from the current state_kWhen the state is shifted to the next state, the reward is set to R ═ C_vkThe goal is to maximize the sum of all status rewards, i.e., minimize transportation costs;

setting of rewards in the environment for Q networks with the goal of minimizing carbon emissions: the agent passes v from the current state_kWhen moving to next state, its reward is set as R-CE_vkThe goal is to maximize the sum of all stateful rewards, i.e. minimize carbon emissions;

for all objectives, some common rewards are set according to the co-existing constraints:

(1) when the carrying capacity exceeds the limit quantity, the reward value is set to be-theta 100;

(2) when the goods are turned back in the transportation process, setting the reward value to be-1000;

s3-4: setting a search strategy of reinforcement learning, wherein the search strategy belongs to a greedy method;

s3-5: setting a network model of Q network learning under a single target to comprise two deep neural networks, namely a Q-current network and a Q-target network, wherein the two networks have the same structure and inconsistent updating frequency; updating the Q-current network in real time, wherein the updating frequency of the Q-target network is lower than that of the Q-current network, the Q-target network is updated once when the iteration number of the Q-current network reaches a set value C, and the parameter of the Q-current network is updated to the Q-target network; the Q network takes the state as input and outputs the value of all optional actions in the state; the network model comprises a memory pool E for storing experience; the memories in the memory pool are randomly extracted for updating the network, so that the correlation among the memories can be disturbed, and the training efficiency is improved;

s3-6: learning for a Q network under a single target;

s3-7: when the learning times reach the maximum learning times, the single-target Q network completes learning;

s3-8: and learning a plurality of single-target learning Q networks in parallel.

The Q table calculation module under multiple targets in step S-4 calculates the value table of each state-action pair in combination with the game influence factor and the Q network trained under each single target, calculates the action value of each action corresponding to each state under the game factor, and generates an initial optimal order strategy according to a greedy method, which includes the following specific sub-steps:

s4-1, calculating the value of selectable actions in the state S under multiple targets by combining game influence factors;

and S4-2, generating an order initial strategy according to the generated Q table under multiple targets.

The dynamic adjustment module in step S5 executes the order, adjusts the game impact factor according to the execution condition, and adjusts the Q network under multiple targets until the order execution is completed, and the specific substeps are as follows:

s5-1, executing the order, monitoring the execution condition of the order, adjusting game influence factors according to the execution condition of the order, reducing the influence of uncertain factors on timeliness and the like, such as monitoring time nodes, and the concrete substeps are as follows:

s5-1-1, when executing order, monitoring time node, if order arriving state S is later than estimated time node T due to uncertain factor influence_sThe time importance degree is improved by one level;

s5-1-2 recalculating game influence factors;

s5-1-3 recalculating the Q table under multiple targets;

s5-1-4, updating the path according to the new Q table and updating the order execution strategy;

s5-1-5, verifying whether the new order strategy meets the requirement, if yes, executing the new strategy, and if not, turning to S5-1-2 until the new strategy meets the requirement.

The step S2-3 of calculating the game influence factor according to the level of each target includes the following sub-steps:

s2-3-1: after the importance degrees of the n targets are adjusted according to requirements and the like, the importance degree of the target I is level I;

s2-3-2: determining the value corresponding to the grade of the target I according to the grade I

S2-3-3: calculating the influence factor delta of the target i according to the following formula_i：

S2-3-4: the targets can be adjusted according to requirements, and the number and the content of the targets can be adjusted.

In the above step S3-4, the search policy is e greedy, which includes the following sub-steps:

s3-4-1: setting a greedy factor epsilon, wherein the initial value of the greedy factor epsilon ranges from 0 to 1;

s3-4-2: randomly generating a number beta between 0 and 1;

s3-4-3: if beta is larger than or equal to the middle part of the action range, selecting the action with the largest action value in the action range, namely the action with the largest Q value, and if beta is smaller than the middle part of the action range, randomly selecting one action in the action range;

s3-4-4: as training progresses, the exploration rate epsilon becomes smaller as iteration progresses;

in the above step S3-6, the learning of the Q network under a single target specifically includes the following sub-steps: setting a maximum learning time T, a state set S, an action set A, a step length alpha, an attenuation factor gamma, an exploration rate E, Q-a current network Q, Q-a target network Q', the number of samples m of batch gradient descent and Q-a target network parameter updating frequency C;

s3-6-1: initializing the learning times i to be 1;

s3-6-2: when the learning frequency i is less than the maximum learning frequency T, go to step S3-6-3, otherwise go to step S3-7;

s3-6-3: initializing S to be the state of a ship-from place;

s3-6-4: using S as input in the Q network to obtain Q value output corresponding to all actions of the Q network;

s3-6-5: selecting a corresponding action A in the current Q value output by using an element greedy method, executing the current action A in the state S to obtain a new state S ' and an award R, judging whether the S ' is a target place D or not, and storing the new state S ' and the award R in a variable is _ D; storing the five-tuple of { S, A, R, S ', is _ D } into an empirical playback set E, and switching the state from S to S';

s3-6-6: sampling m samples from empirical playback set E S_j,A_j,R_j,S′_j,is_D_jJ is 1,2 … m, and the value ζ of the current target action value Q is calculated_j：

ζ_j＝R_j+γmax_a′Q′(S′_j,A′_j)

Wherein R is_jIs a state slave S_jSwitch to S'_jThe reward obtained in time, gamma being the decay factor of the reward, max_a′Q′(S′_j,A′_j) Is in to S'_jSelectable maximum action value at state;

s3-6-7: using the mean square error loss function:

updating all parameters w of the Q network through gradient back propagation of the neural network;

s3-6-8: if the learning times i are multiples of the updating frequency C, namely i% C is 1, updating the target Q network parameter w' ═ w;

s3-6-9: if S 'is not in the termination state, go to step S3-6-4 to continue learning, if S' is in the termination state, the learning number of the current round is increased once, i is i +1, if the new learning number is less than the maximum learning number, go to step S3-6-3, if the maximum learning number is reached, go to step S3-7.

In the step S4-1, the value of the selectable action in the state S under multiple targets is calculated by combining the game influence factors, and the specific sub-steps are as follows:

s4-1-1, inputting state S into each trained Q network, obtaining value of selectable action of each Q network output in the state, outputting value set Q of each action by target i network in the state S_is＝{Q_is1,Q_is2,…,Q_isl,…,Q_isL}；

S4-1-2 according to game influence factor delta_A，δ_B…,δ_i,…δ_nAnd Q_isCalculate the action A in s-State in combination with the following equation_slThe value of (A):

s4-1-3, traversing and calculating the values of all actions in all states to form a multi-target Q value table, wherein the horizontal axis of the Q value table is a state, the vertical axis of the Q value table is an action, and each state-action pair corresponds to a unique action value;

step S4-2, generating an order initial strategy according to the generated Q table under multiple targets, and the specific substeps are as follows:

s4-2-1 sets state S to an initial state;

s4-2-2 looks up a table according to the state S to obtain a value set of the optional actions in the state S;

s4-2-3, selecting the action with the highest action value from the selectable actions by a greedy method, and refreshing the state to the next state;

s4-2-4 repeatedly executes S4-2-2 to S4-2-3 until the next state reached by the refresh state is the terminating state destination point D;

s4-2-5 backtracks from the end state to the initial state, and obtains the optimal strategy, namely, generating the execution route of the order and reaching the predicted time node T of the state S from the initial state_s。

Compared with the prior art, the invention has the following advantages and effects:

the invention provides a multi-target reinforcement learning framework based on a parallel DQN network, a plurality of Q networks are established according to three targets of time, cost and social benefit, and dynamic balance of the time, the cost and the social benefit is realized by adopting a game mechanism.

In the process of reinforcement learning training, the invention adopts an experience playback mechanism to accelerate the training of the network, uses an element greedy strategy to better balance the relationship of exploration and utilization, and can effectively reduce the training time by multi-network parallel learning.

The invention can dynamically solve the problem of multi-target optimization under multi-mode combined transport, can consider more factors compared with single-target optimization, can realize dynamic balance of time, cost and social benefit, and is more in line with actual conditions; and secondly, the influence of uncertain factors and the like on order execution is considered, dynamic adjustment can be performed according to the execution condition, the robustness is higher, and the adaptability is stronger.

Drawings

Fig. 1 is a flow chart of a dynamic path planning scheme of multimodal transport based on game reinforcement learning according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples.

As shown in fig. 1, the invention discloses a multi-type intermodal dynamic path planning method based on game reinforcement learning, which adopts the following scheme:

the order processing module is used for processing order information input by a user;

the game module is used for calculating game influence factors;

a parallel reinforcement learning module for Q network learning under each target

The Q table calculation module under the multiple targets is used for calculating the Q table under the multiple targets;

and the dynamic adjustment module is used for a closed-loop dynamic adjustment feedback module in the order processing process.

Step1. receiving order information and simply processing

The order processing module receives order information of a user, wherein the order information comprises a ship-from place, a destination place, a ship-from time, a cargo type, a cargo weight and user preferences. The user preferences include shortest transit time, lowest transit cost, etc.

Step2. the game module calculates the initial game influence factor according to the transmitted order information

Step21 sets the importance level to 5 levels, 1 to 5 levels, and the corresponding values are 1, 3, 5, 7, 9, wherein the higher the level, i.e., the greater the value, the higher the importance level.

Step22 sets the initial importance level of each target to be 3 grade, that is, the value is 5, and the user makes different rules according to the requirement to adjust the importance level of each target.

Step23 default rule is set as: 1. if the goods are cold chain transportation and the like, the importance degree of the time target is improved by level 1; 2. if the user requirement is timeliness priority, the importance degree of the time target is promoted to 5 grade; 3. if the user demand is that the transportation cost is low and priority, the cost target is promoted to 5 grades; 4. the carbon emission importance level must not fall below level 2.

Step24, calculating game influence factors according to the grades of the targets, and the specific steps are as follows:

after Step241 is adjusted according to requirements and the like, the importance degree of a time target is grade A, the importance degree of a cost target is grade B, and the importance degree of carbon emission is grade C;

step242, determining numerical values a, B and C corresponding to the levels according to the levels A, B and C;

step243 calculates the influence factor of each target according to the following formula, wherein the influence factor of the time target is delta_AThe impact factor of the transportation cost target is δ_BThe minimum carbon emission target has an impact factor δ_C：

The target in Step244 can be adjusted according to the requirement, and the number and the content of the targets can be adjusted.

The Step3 parallel reinforcement learning module constructs a state transition model in the reinforcement learning environment according to the transmitted order information, constructs rewards in the reinforcement learning according to the required targets, and learns the Q network under the single target

Step31, separating a complete routing network diagram from the origin point to the destination point from the map storage module according to the origin point and the destination point in the order, wherein the complete routing network diagram comprises the origin point and the destination point and a route between the two points, and the routing network diagram is represented as G (N, V, M), wherein N represents a node set, N represents {1,2, …, N }, M represents a set of transportation modes, M represents {0,1,2, …, M }, V represents a set of edges, and V represents { V { (V) } and a route between the two points is represented by a node set, and the node set is a node set_k|v_k＝(v_ijM), i ∈ N, j ∈ N, M ∈ M }, where v ∈ M }, where_kRepresenting an edge of a m transportation mode from an i node to a j node, wherein each node represents a city, one or more edges exist between two communicated nodes and comprise different transportation modes, and different edges have different time costs, transportation costs and carbon emission cost; where the time cost of an edge is denoted T_vkThe transportation cost is represented as C_vkCarbon emissions are expressed in terms of CE_vk。

Step32, constructing a reinforcement learning environment according to the routing network diagram, wherein each node is a state S, and the current node selects a certain edge to the next node to be an action A. According to v at each node n, i.e. each state_njTo select an action, to transition to the next state, and to obtain a reward. The number of states is equal to the number of city nodes in the routing network graph, the nodes are connected by edges, and the set of the selectable edges starting from the node i is v_ki＝{v_k|v_k＝(v_ij,m),j∈N，m∈M}，v_kiThe total number of elements in the set is L, L denotes the total number of edges that can be selected starting from node i, where the number of actions that can be selected is L.

Step33 formulates a corresponding environmental reward for each Q-learning network. Each Q learning network comprises a Q network aiming at minimizing order execution time, a Q network aiming at minimizing order execution cost, a Q network aiming at minimizing carbon emission and the like, and the number of the targets can be increased or decreased according to the requirements of users.

For Q-networks that minimize transit time, settings of rewards in the environment: the agent passes v from the current state_kWhen the state is shifted to the next state, the reward is set to R-T_vkIf delivery of Td is delayed with respect to the estimated time, then we get a penalty of-Td, with the goal of maximizing the sum of all state rewards, i.e. minimizing execution time;

(2) when a return occurs during the transportation of the goods, the reward value is set to-1000.

Step34, setting a search strategy for reinforcement learning, wherein the search strategy belongs to a greedy method and comprises the following specific steps:

step341 sets greedy factor e, and the initial value is between 0 and 1;

step342 randomly generates a number β between 0 and 1;

step343, if beta is larger than or equal to the middle part of the family, selecting the action with the largest action value in the action range, namely the action with the largest Q value, and if beta is smaller than the middle part of the family, randomly selecting one action in the action range;

step344 is as training progresses, the exploration rate ∈ becomes smaller as iteration progresses.

Step35, setting that the network model of the Q network learning under the single target comprises two deep neural networks, namely a Q-current network and a Q-target network, wherein the two networks have the same structure and inconsistent update frequency. And updating the Q-current network immediately, wherein the updating frequency of the Q-target network is slower than that of the Q-current network, the Q-target network is updated once when the iteration number of the Q-current network reaches a set value C, and the parameters of the Q-current network are updated to the Q-target network. The Q network takes the state as input and outputs the value of all optional actions in the state. The network model comprises a memory pool E for storing experience. The memories in the memory pool are randomly extracted for updating the network, so that the correlation among the memories can be disturbed, and the training efficiency is improved.

Step36 is used for learning the Q network under a single target, and the specific steps are as follows:

and setting the maximum learning times T, a state set S, an action set A, a step length alpha, an attenuation factor gamma, an exploration rate E, Q-a current network Q, Q-a target network Q', the number of samples m of batch gradient descent and Q-target network parameter updating frequency C.

Step361 initializes the learning times i to 1;

step362, when the learning time i is less than the maximum learning time T, turning to Step363, otherwise, turning to Step 37;

step363 initializes S as the state of the start point;

step364 uses S as input in the Q network to obtain Q value output corresponding to all actions of the Q network;

step365 selects corresponding action A in the current Q value output by an Eviae method, executes the current action A in the state S to obtain a new state S ' and an award R, judges whether the S ' is a target place D or not, and stores the new state S ' and the award R in a variable is _ D. Storing the five-tuple of { S, A, R, S ', is _ D } into an empirical playback set E, and switching the state from S to S';

step366 samples m samples S from the empirical playback set E_j,A_j,R_j,S′_j,is_D_jJ ═ 1,2,, m, the value ζ of the current target action value Q is calculated_j：

ζ_j＝R_j+γmax_a′Q′(S′_j,A′_j)

Wherein R is_jIs a state slave S_jSwitch to S'_jThe reward obtained in time, gamma being the decay factor of the reward, max_a′Q′(S′_j,A′_j) Is in S'_jMaximum action value selectable at state;

step367 uses the mean square error loss function:

step368, if the learning time i is a multiple of the updating frequency C, that is, if i% C is 1, updating the target Q network parameter w' ═ w;

step369, if the S' is not in the end state, turning to Step364 to continue learning, if the S is in the end state, increasing the learning frequency once after the learning of the current round is finished, namely i is equal to i +1, if the new learning frequency is less than the maximum learning frequency, turning to Step363, and if the maximum learning frequency is reached, turning to Step 37;

step37 when the learning times reach the maximum learning times, the single-target Q network completes the learning.

Step38 learns multiple single-target learning Q-networks in parallel.

Step4, calculating a value table of each state-action pair by combining the game influence factor and the Q network trained under each single target, calculating action values of corresponding actions under each state under the game factor, and generating an order initial optimal strategy according to a greedy method

Step41, calculating the value of selectable actions under n states under multiple targets by combining game influence factors, and the specific steps are as follows:

step411 inputs the state n into each trained Q network to obtain the value of the selectable action of each Q network output in the state, such as inputting the state n, and the value Q of each action of the time target, namely A target network output in the state n_AN＝Q_AN1,Q_AN2,Q_AN3,…,Q_ANl(ii) a Value Q of each action output by B target network as transportation cost target in n state_BN＝Q_BN1,Q_BN2,Q_BN3,…,Q_BNl(ii) a Minimizing the carbon emissions target in the n state, i.e., the C target network outputs the value Q of each action_CN＝Q_CN1,Q_CN2,Q_CN3,…,Q_CNl；

Step412 according to game influence factor lambda_A，λ_B，λ_CAnd Q_ANl,Q_BNl,Q_CNlCalculate the action A in n state by combining the following formula_nlThe value of (A):

Q_Nl＝λ_A*Q_ANl+λ_B*Q_BNl+λ_C*Q_CNl

step413 calculates values of all actions in all states in a traversal mode, a multi-target Q value table is formed, the horizontal axis of the Q value table is a state, the vertical axis of the Q value table is an action, and each state-action pair corresponds to a unique action value.

Step42, generating an order initial strategy according to the generated Q table under multiple targets, and specifically comprising the following steps:

step421 sets state S to the initial state;

step422 looks up a table according to the state S to obtain a value set of the optional actions in the state S;

step423 selects the action with the highest action value from the selectable actions by using a greedy method, and refreshes the state to the next state;

step424 repeatedly executes steps 422 to 423 until the next state where the refresh state reaches is the destination point D of the termination state;

step425 backtracks from the end state to the initial state, and obtains the optimal strategy, i.e. the execution route of the generated order and the predicted time node T for reaching the state node N_s；

Step5, adjusting game influence factors according to the execution condition, and adjusting the Q network under multiple targets until the order execution is completed.

S51 orders are executed, the execution condition of the orders is monitored, game influence factors are adjusted according to the execution condition of the orders, influence of uncertain factors on timeliness and the like is reduced, and the time node is monitored, and the method specifically comprises the following steps:

step511, monitoring the time node when executing the order, and if the order is later than the estimated time node due to the influence of uncertain factors, improving the time importance degree by one level;

step512, recalculating the game influence factors;

step513 recalculates the Q table under multiple targets;

step514 updates the path according to the new Q table, and updates the order execution strategy;

step515 verifies that the new order policy is satisfied, if so, executes the new policy, otherwise, go to Step 512.

As described above, the present invention can be preferably realized.

The embodiments of the present invention are not limited to the above-described embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and they are included in the scope of the present invention.

Claims

1. A multi-type intermodal dynamic path planning method based on game reinforcement learning is characterized by comprising the following steps:

s1: the order processing module receives user order information;

2. The multimodal transport dynamic path planning method based on the game reinforcement learning as claimed in claim 1, wherein: in step S1, the order processing module receives order information of a user, where the order information includes a ship from location, a destination location, a ship from time, a cargo type, a cargo weight, and a user preference.

3. The game reinforcement learning-based multimodal transportation dynamic path planning method of claim 2, wherein the game module calculates an initial game influence factor according to the incoming order information in step S2, and specifically comprises the following sub-steps:

4. The multi-type intermodal dynamic path planning method based on game reinforcement learning as claimed in claim 3, wherein the parallel reinforcement learning module in step S3 constructs a state transition model in a reinforcement learning environment according to the incoming order information, constructs rewards in reinforcement learning according to the required targets, learns Q network under a single target, and learns Q network under multiple targets in parallel, specifically including the following sub-steps:

s3-1: according to the origin and destination points in the order, separating a complete routing network graph from the origin to the destination point from the map storage module, wherein the complete routing network graph comprises the origin and destination points and a route between the origin and destination points, and the routing network graph is represented as G (N, V, M), wherein N represents a node set, N represents {1,2, …, N }, M represents a set of transportation modes, M represents {0,1,2, …, M }, V represents a set of edges, and V represents { V { (V) }_k|v_k＝(v_ijM), i ∈ N, j ∈ N, M ∈ M }, where v ∈ M }, where_kRepresenting an edge of a m transportation mode from an i node to a j node, wherein each node represents a city, one or more edges exist between two communicated nodes and comprise different transportation modes, and different edges have different time costs, transportation costs and carbon emission cost; where the time cost of an edge is denoted T_vkThe transportation cost is represented as C_vkCarbon emissions are expressed in terms of CE_vk；

s3-3: formulating corresponding environment rewards for each Q learning network; each Q learning network can comprise a Q network aiming at minimizing order execution time, a Q network aiming at minimizing order execution cost and a Q network aiming at minimizing carbon emission, wherein the number of the targets can be increased or decreased according to user requirements;

for all objectives, a common reward is set according to a co-existing constraint:

s3-6: learning for a Q network under a single target;

5. The multi-type intermodal dynamic path planning method based on game reinforcement learning as claimed in claim 4, wherein the Q table calculation module under the multiple targets in step S-4 calculates the value table of each state-action pair in combination with a game influence factor and a Q network trained under each single target, calculates the action value of each action corresponding to each state under the game factor, and generates an order initial optimal strategy according to a greedy method, and the specific sub-steps are as follows:

s4-1: calculating the value of selectable actions under the state s of multiple targets by combining game influence factors;

s4-2: and generating an order initial strategy according to the generated Q table under multiple targets.

6. The multi-type intermodal dynamic path planning method based on game reinforcement learning as claimed in claim 5, wherein the dynamic adjustment module in step S5 executes the order, adjusts game influence factors according to the execution condition, and adjusts the Q network under multiple targets until the order execution is completed, and the specific sub-steps are as follows:

s5-1: executing an order, monitoring the execution condition of the order, adjusting game influence factors according to the execution condition of the order, and reducing the influence of uncertain factors on timeliness, such as monitoring a time node, wherein the specific substeps are as follows:

s5-1-1: when the order is executed, the time node is monitored, and if the order reaches the state s due to the influence of uncertain factors, the time node is later than the estimated time node T_sThe time importance degree is improved by one level;

s5-1-2: recalculating the game influence factor;

s5-1-3: recalculating a Q table under multiple targets;

s5-1-4: updating the path according to the new Q table and updating the order execution strategy;

s5-1-5: and verifying whether the new order strategy meets the requirement, if so, executing the new strategy, and if not, turning to S5-1-2 until the new strategy meets the requirement.

7. The multimodal transportation dynamic path planning method based on game reinforcement learning as claimed in claim 3, wherein the step S2-3 of calculating game influence factors according to the levels of the targets comprises the following sub-steps:

s2-3-1: after the importance degrees of the n targets are adjusted according to requirements, the importance degree of the target I is level I;

S2-3-3: calculating the target according toi influence factor delta_i：

8. The multi-type intermodal dynamic path planning method based on game reinforcement learning as claimed in claim 4, wherein in step S3-4, the search policy is e greedy method, which includes the following sub-steps:

s3-4-2: randomly generating a number beta between 0 and 1;

s3-4-4: as training progresses, the exploration rate e becomes smaller as the iteration progresses.

9. The multi-type intermodal dynamic path planning method based on game reinforcement learning as claimed in claim 4, wherein in step S3-6, the learning for the Q network under a single target specifically includes the following sub-steps: setting a maximum learning time T, a state set S, an action set A, a step length alpha, an attenuation factor gamma, an exploration rate E, Q-a current network Q, Q-a target network Q', the number of samples m of batch gradient descent and Q-a target network parameter updating frequency C;

s3-6-1: initializing the learning times i to be 1;

s3-6-3: initializing S to be the state of a ship-from place;

ζ_j＝R_j+γmax_a′Q′(S′_j,A′_j)

s3-6-7: using the mean square error loss function:

s3-6-9: if S 'is not in the termination state, go to step S3-6-4 to continue learning, if S' is in the termination state, the learning number of times is increased once after the current round of learning is finished, i is i +1, if the new learning number of times is less than the maximum learning number, go to step S3-6-3, if the maximum learning number is reached, go to step S3-7;

s3-7: and when the learning times reach the maximum learning times, the single-target Q network completes the learning.

10. The game reinforcement learning-based multimodal transportation dynamic path planning method according to claim 5, wherein in step S4-1, the value of the selectable action in the state S under multiple targets is calculated by combining the game influence factors, and the specific sub-steps are as follows:

s4-1-1: inputting the state s into each trained Q network to obtain the value of the selectable action of the output of each Q network in the state, and outputting the value set Q of each action by the target i network in the state s_is＝{Q_is1,Q_is2,…,Q_isl,…,Q_isL}；

S4-1-2: according to game influence factor delta_A，δ_B…,δ_i,…δ_nAnd Q_isCalculate the action A in s-State in combination with the following equation_slThe value of (A):

s4-1-3: traversing and calculating the values of all actions in all states to form a multi-target Q value table, wherein the horizontal axis of the Q value table is a state, the vertical axis of the Q value table is an action, and each state-action pair corresponds to a unique action value;

s42-1: setting the state S as an initial state;

s4-2-2: looking up a table according to the state S to obtain a value set of the optional actions in the state S;

s4-2-3: selecting the action with the highest action value from the selectable actions by using a greedy method, and refreshing the state to the next state;

s4-2-4: repeatedly performing S4-2-2-S4-2-3 until the next state where the refresh state arrives is the terminating state destination point D;

s4-2-5: backtracking from the end state to the initial state, obtaining the optimal strategy, namely generating the execution route of the order and reaching the time node T predicted by the state s from the initial state_s。