CN110716550B

CN110716550B - Gear shifting strategy dynamic optimization method based on deep reinforcement learning

Info

Publication number: CN110716550B
Application number: CN201911076016.XA
Authority: CN
Inventors: 陈刚; 袁靖; 张介; 顾爱博; 周楠; 王和荣; 苏树华; 陈守宝; 王良模; 王陶
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2019-11-06
Filing date: 2019-11-06
Publication date: 2022-07-22
Anticipated expiration: 2039-11-06
Also published as: CN110716550A

Abstract

The invention belongs to the field of construction machinery and vehicle engineering, and in particular relates to a dynamic optimization method for shifting strategies based on deep reinforcement learning. It includes the following steps: (1): determine the state input variable and action output variable of the shifting strategy; (2): determine the Markov decision process of the shifting strategy according to the state input variable and the action output variable; (3): according to the shifting strategy The strategy objective establishes the reward function of the reinforcement learning shifting strategy; (4): According to the Markov decision process and the reward function, solve the deep reinforcement learning shifting strategy; (5): Put the prediction Q network calculated in step (4) into the Shift strategy controller, during the driving process of the construction machinery and the vehicle, the construction machinery and the vehicle select the gear according to the shift strategy controller; (6): The prediction Q network is regularly updated during the driving process. The invention updates the shifting strategy through the deep reinforcement learning method, and realizes the dynamic optimization of the shifting strategy.

Description

A dynamic optimization method for shifting strategy based on deep reinforcement learning

技术领域technical field

本发明属于工程机械及车辆工程领域，具体涉及一种基于深度强化学习的换挡策略动态优化方法。The invention belongs to the field of construction machinery and vehicle engineering, and in particular relates to a dynamic optimization method of a gear shifting strategy based on deep reinforcement learning.

背景技术Background technique

换挡策略是目前工程机械及车辆控制技术的核心技术之一，指的是工程机械及车辆在行驶过程中，挡位随所选参数变化的规律。求解方法是建立换挡策略重点考虑的。换挡策略的求解方法包括图解法、解析法、遗传算法、动态规划法等。换挡策略的求解和优化是关于换挡策略研究的核心方向，尤其是换挡策略的动态优化是换挡策略研究的难点之一。Shift strategy is one of the core technologies of current construction machinery and vehicle control technology. The solution method is the key consideration in establishing the shifting strategy. The methods of solving the shifting strategy include graphical method, analytical method, genetic algorithm, dynamic programming method and so on. The solution and optimization of the shifting strategy is the core direction of the research on the shifting strategy, especially the dynamic optimization of the shifting strategy is one of the difficulties in the research of the shifting strategy.

“基于变载荷的最佳动力性AMT换挡规律修正”,李浩，控制工程，第22卷第1期，第50～54页，2015年1月。在两参数换挡策略的基础上引入了加速度为换挡参数，实现了考虑加速度的动态三参数换挡。其求解方法为解析法，在求解过程中需要针对各个油门开度对加速度-速度曲线进行拟合，求解复杂、计算量大，同时只能针对单一性能指标进行，也无法针对实际行驶状况进行动态优化。"Optimal Dynamic AMT Shifting Law Correction Based on Variable Load", Li Hao, Control Engineering, Vol. 22, No. 1, pp. 50-54, January 2015. On the basis of the two-parameter shifting strategy, the acceleration is introduced as the shifting parameter, and the dynamic three-parameter shifting that considers the acceleration is realized. The solution method is an analytical method. In the solution process, the acceleration-speed curve needs to be fitted for each accelerator opening. The solution is complicated and the calculation amount is large. At the same time, it can only be performed for a single performance index, and it cannot be dynamically performed for the actual driving conditions. optimization.

“Performance Evaluation Approach Improvement for IndividualizedGearshift Schedule Optimization”,Yin X,2016年05月。利用遗传算法对换挡策略进行了优化，提高了换挡策略的综合性能，解决了解析法只能求解单一性能指标的问题，但同样无法针对实际行驶状况进行动态优化。"Performance Evaluation Approach Improvement for Individualized Gearshift Schedule Optimization", Yin X, May 2016. The genetic algorithm is used to optimize the shifting strategy, which improves the comprehensive performance of the shifting strategy, and solves the problem that the analytical method can only solve a single performance index, but it is also unable to dynamically optimize the actual driving conditions.

“Optimal gear shift strategies for fuel economy and driveability”，VietDacNgo,Proceedings of the Institution of Mechanical Engineers,Part D:Journal of Automobile Engineering,第227卷第10期，第1398～1413页，2013年10月。通过动态规划方法对换挡策略针对特定驾驶循环工况进行求解。其缺点在于：动态规划在求解换挡规律时，需要构建复杂的状态图，状态图以表格形式表现。状态图的复杂程度取决于动态规划算法中的离散程度。过于复杂的状态图会因为贝尔曼纬度灾难而出现收敛速度下降或无法收敛的情况。同时由于针对特定驾驶循环进行优化，不能在行驶过程中进行动态优化。“Optimal gear shift strategies for fuel economy and driveability,” VietDacNgo, Proceedings of the Institution of Mechanical Engineers, Part D: Journal of Automobile Engineering, Vol. 227, No. 10, pp. 1398-1413, October 2013. The shifting strategy is solved for a specific driving cycle by a dynamic programming method. The disadvantage is that when dynamic programming solves the shifting law, it needs to construct a complex state diagram, and the state diagram is represented in the form of a table. The complexity of the state diagram depends on the degree of discreteness in the dynamic programming algorithm. Overly complex state diagrams will experience slow or no convergence due to the Bellman latitude catastrophe. At the same time, due to the optimization for a specific driving cycle, dynamic optimization cannot be carried out during driving.

在现有专利中，专利申请号201710887558.X公开了一种动态规划算法的汽车换挡规律优化方法。依据实施例分别制定了基于经济性和动力性的换挡规律。动态规划在求解换挡规律时，需要构建复杂的状态图，状态图复杂程度的多少取决于动态规划算法中的离散程度。过于复杂的状态图会因为贝尔曼纬度灾难而出现收敛速度下降或无法收敛的情况。同时无法针对实际行驶状况进行动态优化。In the existing patent, Patent Application No. 201710887558.X discloses a method for optimizing the gear shift rule of a vehicle based on a dynamic programming algorithm. According to the embodiment, the shifting laws based on economy and power are respectively formulated. When dynamic programming solves the shifting law, it needs to construct a complex state diagram, and the complexity of the state diagram depends on the degree of discreteness in the dynamic programming algorithm. Overly complex state diagrams will experience slow or no convergence due to the Bellman latitude catastrophe. At the same time, dynamic optimization for the actual driving situation is not possible.

在现有专利中，专利申请号201811306659.4公开了一种基于驾驶意图的换挡策略修正方法及系统。根据驾驶员的驾驶过程对当前的换挡修正系数和补偿偏移值进行更新，对原换挡策略进行修正，实现换挡策略的动态更新。但其换挡策略的动态更新规则需要人为制定，优化效果受人为制定影响较大，同时优化方法不具备通用性，只能针对单一车型。智能化程度较低。In the existing patent, Patent Application No. 201811306659.4 discloses a method and system for modifying a gear shift strategy based on driving intention. According to the driver's driving process, the current shift correction coefficient and compensation offset value are updated, and the original shift strategy is corrected to realize the dynamic update of the shift strategy. However, the dynamic update rules of its shifting strategy need to be formulated manually, and the optimization effect is greatly affected by the artificial formulation. At the same time, the optimization method is not universal and can only be used for a single model. The degree of intelligence is low.

总体而言，现有的换挡策略求解或优化方法大部分无法针对实际行驶状况进行动态优化，自适应能力较差。部分可以实现动态优化的换挡策略需要人为制定换挡策略的动态更新规则，智能化和通用性较低。In general, most of the existing shifting strategy solving or optimization methods cannot be dynamically optimized for actual driving conditions, and the adaptive ability is poor. Some shifting strategies that can achieve dynamic optimization require artificially formulating dynamic update rules for shifting strategies, which are less intelligent and versatile.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于提供一种基于深度强化学习的换挡策略动态优化方法。本发明构建了换挡策略马尔可夫决策过程和奖励函数，之后利用深度强化学习方法求解换挡策略，然后将深度强化学习方法求解出的预测Q网络放入换挡策略控制器中实现挡位的选择，同时，在日常行驶过程中采集行驶数据，通过深度强化学习方法对换挡策略进行更新，实现换挡策略的动态优化。The purpose of the present invention is to provide a dynamic optimization method of shifting strategy based on deep reinforcement learning. The invention constructs the shift strategy Markov decision process and reward function, then uses the deep reinforcement learning method to solve the shift strategy, and then puts the prediction Q network solved by the deep reinforcement learning method into the shift strategy controller to realize the shift position At the same time, the driving data is collected in the daily driving process, and the shifting strategy is updated through the deep reinforcement learning method to realize the dynamic optimization of the shifting strategy.

实现本发明目的的技术解决方案为：一种基于深度强化学习的换挡策略动态优化方法，包括如下步骤：The technical solution for realizing the purpose of the present invention is: a dynamic optimization method for shifting strategy based on deep reinforcement learning, comprising the following steps:

步骤(1)：确定换挡策略状态输入变量和动作输出变量；Step (1): determine the state input variable and action output variable of the shifting strategy;

步骤(2)：根据步骤(1)的状态输入变量和动作输出变量，确定换挡策略马尔科夫决策过程；Step (2): According to the state input variables and action output variables of step (1), determine the Markov decision process of the shifting strategy;

步骤(3)：根据换挡策略目标建立强化学习换挡策略奖励函数；Step (3): establish a reinforcement learning shift strategy reward function according to the shift strategy target;

步骤(4)：根据步骤(2)中的马尔科夫决策过程和步骤(3)中的奖励函数，求解深度强化学习换挡策略；具体为首先通过马尔可夫决策过程及奖励函数计算马尔可夫链，将马尔可夫链保存入经验池，然后根据经验池中数据更新深度强化学习换挡策略中的预测Q网络；Step (4): According to the Markov decision process in step (2) and the reward function in step (3), solve the deep reinforcement learning shifting strategy; specifically, first calculate the Markov decision process and reward function through the Markov decision process. Fu chain, save the Markov chain into the experience pool, and then update the prediction Q network in the deep reinforcement learning shifting strategy according to the data in the experience pool;

步骤(5)：将步骤(4)计算出的预测Q网络放入换挡策略控制器，工程机械及车辆在行驶过程中，工程机械及车辆根据换挡策略控制器选择挡位；Step (5): put the predicted Q network calculated in step (4) into the shift strategy controller, and during the driving process of the construction machinery and the vehicle, the construction machinery and the vehicle select the gear according to the shift strategy controller;

步骤(6)：在行驶过程中，采集工程机械及车辆行驶数据保存进经验池，并定期更新预测Q网络，更新完成后将预测Q网络放入换挡策略控制器，实现对换挡策略进行动态优化。Step (6): During the driving process, collect construction machinery and vehicle driving data and save them into the experience pool, and update the predicted Q network regularly. After the update is completed, put the predicted Q network into the shift strategy controller to implement the shift strategy. Dynamic optimization.

进一步的，所述步骤(1)中的状态输入变量包括车速v、加速度

和油门开度α_t，行驶坡度和地面摩擦阻力系数；动作输出变量包括档位操作和换挡操作，其中档位操作包括升档、降档或保持档位，换挡操作即选择的挡位n_g。Further, the state input variables in the step (1) include vehicle speed v, acceleration

and accelerator opening α _t , driving gradient and ground friction resistance coefficient; action output variables include gear operation and shifting operation, where gear operation includes upshifting, downshifting or maintaining gear, and shifting operation is the selected gear n _g .

进一步的，所述步骤(2)中的换挡策略马尔可夫决策过程表示为下一时刻状态为当前状态和所选动作的转移函数的形式，转移函数的形式如下：Further, the shift strategy Markov decision process in the step (2) is expressed as the form of the transfer function where the state at the next moment is the current state and the selected action, and the form of the transfer function is as follows:

s_t+1＝T(s_t,a_t)s _t ₊₁ =T(s _t ,at )

式中，s_t+1为下一时刻的状态变量，s_t为当前状态变量，a_t为所选动作变量，其中，s∈S,a∈A，S为状态变量的集合，A为动作变量的集合。In the formula, s _t+1 is the state variable at the next moment, s _t is the current state variable, and at _is the selected action variable, where s ∈ S, a ∈ A, S is the set of state variables, and A is the action A collection of variables.

进一步的，所述步骤(3)中的换挡策略奖励函数与换挡策略目标正相关，所述换挡策略目标包括动力性、经济型和舒适性。Further, the shift strategy reward function in the step (3) is positively correlated with the shift strategy target, and the shift strategy target includes power, economy and comfort.

进一步的，所述换挡策略目标为动力性换挡策略，描述为工程机械及车辆在舒适度约束条件下以最短时间t到达最高车速，奖惩机制为：Further, the target of the shifting strategy is a dynamic shifting strategy, which is described as the construction machinery and the vehicle reaching the maximum speed in the shortest time t under the condition of comfort constraints. The reward and punishment mechanism is:

式中，r为奖惩机制计算出的奖励；r_t为临时奖励，r_t＝-0.001||V_Tamx-v||；v_Tmax为当前油门开度α_t下的最大车速；J为工程机械及车辆的冲击度；J_max为所设计的最大允许冲击度。In the formula, r is the reward calculated by the reward and punishment mechanism; r _t is the temporary reward, r _t =-0.001||V _Tamx -v||; v _Tmax is the maximum vehicle speed under the current accelerator opening α _t ; J is the construction machinery and the impact degree of the vehicle; J _max is the designed maximum allowable impact degree.

进一步的，所述步骤(4)中的马尔科夫链的形式为：Further, the form of the Markov chain in the step (4) is:

<s_t,a_t,r_t,s_t+1><s _t ,a _t ,r _t ,s _t+1 >

式中，r_t为根据奖励目标计算出的临时奖励。In the formula, r _t is the temporary reward calculated according to the reward target.

进一步的，所述步骤(4)中的深度强化学习方法包含两个结构相同但参数不同的神经网络，称为预测Q网络和目标Q网络，其中预测Q网络的作用是计算在当前状态下各动作的Q值，目标Q网络用于更新预测Q网络。Further, the deep reinforcement learning method in the step (4) includes two neural networks with the same structure but different parameters, called the prediction Q network and the target Q network, wherein the function of the prediction Q network is to calculate the different parameters in the current state. The Q value of the action, which the target Q network uses to update the prediction Q network.

进一步的，步骤四中，所建立的马尔可夫链中，动作变量a_t的选择是通过贪心算法，贪心算法表示为：Further, in the fourth step, in the established Markov chain, the selection of the action variable at is through a _greedy algorithm, and the greedy algorithm is expressed as:

式中，Q_p为预测Q网络，θ_p为预测Q网络参数，e为贪心算法参数；In the formula, Q _p is the predicted Q network, θ _p is the predicted Q network parameter, and e is the greedy algorithm parameter;

在所述的步骤(4)中，将马尔可夫链保存入经验池，然后根据经验池中数据更新深度强化学习换挡策略中的预测Q网络，预测Q网络用于计算在行驶状态s_t下挡位集合A下的Q值，预测Q网络的输出为Q_p(s,A,θ_p)。In the step (4), the Markov chain is saved in the experience pool, and then the prediction Q network in the deep reinforcement learning shifting strategy is updated according to the data in the experience pool, and the prediction Q network is used to calculate the driving state s _t For the Q value under the lower gear set A, the output of the predicted Q network is Q _p (s, A, θ _p ).

进一步的，在所述的步骤(5)中，工程机械及车辆在行驶过程中，工程机械及车辆根据换挡策略控制器选择挡位，换挡控制器根据预测Q网络选择合适的挡位a*：Further, in the step (5), during the driving process of the construction machinery and the vehicle, the construction machinery and the vehicle select the gear according to the shifting strategy controller, and the shifting controller selects the appropriate gear a according to the predicted Q network. *:

a*(s)＝argmax_a[Q_p(s,a,θ_p)|a∈A]a*(s)=argmax _a [Q _p (s,a,θ _p )|a∈A]

式中，Q_p为预测Q网络，θ_p为预测Q网络参数。In the formula, Q _p is the prediction Q network, and θ _p is the prediction Q network parameter.

进一步的，所述步骤(6)中采集行驶数据包括：车速、油门开度、加速度、行驶坡度和地面摩擦阻力系数；Further, the collected driving data in the step (6) includes: vehicle speed, accelerator opening, acceleration, driving gradient and ground frictional resistance coefficient;

在所述的步骤(6)中更新预测Q网络的方法包括两种：方法一为通过工程机械及车辆行驶数据重构步骤(2)中的转移函数，然后根据步骤(3)和步骤(4)更新预测Q网络；方法二为根据步骤(4)中的预测Q网络更新方法直接更新预测Q网络；There are two methods for updating the predicted Q network in the step (6): the first method is to reconstruct the transfer function in the step (2) through the construction machinery and vehicle driving data, and then according to the step (3) and the step (4) ) update the prediction Q network; Method 2 is to directly update the prediction Q network according to the prediction Q network update method in step (4);

其中，方法一是通过采集工程机械及车辆行驶数据对步骤(2)中的转移函数进行重构，重构方法为对转移函数中的参数进行重新计算形成结构相同但参数不同的转移函数，或者通过采用神经网络、线性拟合和傅里叶变换法拟合转移函数；Among them, the first method is to reconstruct the transfer function in step (2) by collecting construction machinery and vehicle driving data, and the reconstruction method is to recalculate the parameters in the transfer function to form a transfer function with the same structure but different parameters, or Fit transfer function by applying neural network, linear fitting and Fourier transform;

其中，方法二是通过采集工程机械及车辆行驶数据然后根据步骤(4)中的预测Q网络更新方法进行更新，预测Q网络的更新方法为：Among them, the second method is to collect the construction machinery and vehicle driving data and then update according to the predicted Q network update method in step (4). The update method of the predicted Q network is:

Q_p(s,a,θ_p)＝Q_p(s,a,θ_p)+α(r+γmax_aQ_t(s,a,θ_t)-Q_p(s,a,θ_p))² Q _p (s,a,θ _p )=Q _p (s,a,θ _p )+α(r+γmax _a Q _t (s,a,θ _t )-Q _p (s,a,θ _p )) ²

式中，γ为奖励递减值；α为神经网络学习率；Q_t为目标Q网络，θ_t为目标Q网络参数In the formula, γ is the decreasing reward value; α is the learning rate of the neural network; Q _t is the target Q network, and θ _t is the target Q network parameter

本发明与现有技术相比，其显著优点在于：Compared with the prior art, the present invention has the following significant advantages:

(1)本申请通过采用深度强化学习方法，因为可以根据工程机械及车辆行驶过程构建包含马尔可夫决策过程和奖励函数的马尔可夫链实现对预测Q网络的更新的原理，能够解决对换挡策略进行求解和动态优化的问题；具有自适应能力强的特点；(1) This application adopts the deep reinforcement learning method, because the Markov chain including the Markov decision process and the reward function can be constructed according to the construction machinery and the driving process of the vehicle to realize the principle of updating the prediction Q network, which can solve the problem of swapping. The problem of solving and dynamic optimization of blocking strategy; it has the characteristics of strong self-adaptive ability;

(2)本申请通过采用深度强化学习方法，因为算法本身在进行求解和动态优化的步骤具有统一性，不受受控对象本体的影响，可以适用于乘用车、工程机械及车辆、特种车辆和电动车辆等不同车型；原因是可以通过采用神经网络、线性拟合和傅里叶变换法拟合转移函数，不受所运用适用对象本体的影响，具有通用性强的特点；(2) This application adopts the deep reinforcement learning method, because the algorithm itself is unified in the steps of solving and dynamic optimization, and is not affected by the controlled object ontology, and can be applied to passenger cars, construction machinery and vehicles, and special vehicles. Different models such as electric vehicles and electric vehicles; the reason is that the transfer function can be fitted by using neural network, linear fitting and Fourier transform method, which is not affected by the applied object ontology, and has the characteristics of strong versatility;

(3)本申请通过采用深度强化学习方法对换挡策略进行求解和动态优化，因为算法本身不受受控对象本体的影响同时可以实现换挡策略的动态优化，具有智能性强的特点；(3) The present application solves and dynamically optimizes the shifting strategy by using the deep reinforcement learning method, because the algorithm itself is not affected by the controlled object ontology and can realize the dynamic optimization of the shifting strategy, and has the characteristics of strong intelligence;

(4)本申请通过采用深度强化学习中的预测Q网络实现挡位的选择，代替了传统方法中的表格形式，因为神经网络拟合能力强，可适用于高维状态变量下的换挡策略的原因，能够解决贝尔曼纬度灾难的问题。(4) In this application, the selection of gears is realized by using the prediction Q network in deep reinforcement learning instead of the tabular form in the traditional method. Because the neural network has strong fitting ability, it can be applied to the gear shifting strategy under high-dimensional state variables. , which can solve the problem of the Bellman latitude catastrophe.

附图说明Description of drawings

图1是本发明基于深度强化学习的换挡策略动态优化方法原理图。FIG. 1 is a schematic diagram of the dynamic optimization method of shifting strategy based on deep reinforcement learning of the present invention.

图2是本发明求解深度强化学习换挡策略流程图。FIG. 2 is a flow chart of the present invention for solving the deep reinforcement learning shifting strategy.

图3是本发明采用的神经网络结构模型图。FIG. 3 is a diagram of a neural network structure model adopted in the present invention.

图4是本发明换挡策略的动态优化过程原理图。FIG. 4 is a schematic diagram of the dynamic optimization process of the shifting strategy of the present invention.

具体实施方式Detailed ways

本发明提供了一种基于深度强化学习的换挡策略动态优化方法。本发明构建了换挡策略马尔可夫决策过程和奖励函数，之后利用深度强化学习方法求解换挡策略。然后将深度强化学习方法求解出的预测Q网络放入换挡策略控制器中实现挡位的选择。同时，在日常行驶过程中采集行驶数据，通过深度强化学习方法对换挡策略进行更新。实现换挡策略的动态优化。The invention provides a dynamic optimization method of shifting strategy based on deep reinforcement learning. The invention constructs the shift strategy Markov decision process and reward function, and then uses the deep reinforcement learning method to solve the shift strategy. Then, the prediction Q network solved by the deep reinforcement learning method is put into the shift strategy controller to realize the gear selection. At the same time, the driving data is collected in the daily driving process, and the shifting strategy is updated through the deep reinforcement learning method. Realize dynamic optimization of shifting strategies.

一种基于深度强化学习的换挡策略动态优化方法，包括如下步骤：A method for dynamic optimization of shifting strategy based on deep reinforcement learning, comprising the following steps:

步骤一，确定换挡策略状态变量及动作变量。Step 1: Determine the state variable and action variable of the shifting strategy.

步骤二，根据状态输入变量和动作输出变量确定换挡策略马尔科夫决策过程。Step 2: Determine the Markov decision process of shifting strategy according to state input variables and action output variables.

步骤三，根据换挡策略优化目标建立强化学习换挡策略奖励函数。Step 3: Establish a reward function of the reinforcement learning shifting strategy according to the shifting strategy optimization objective.

步骤四，根据步骤二的马尔科夫决策过程和步骤三奖励函数，求解深度强化学习换挡策略。首先通过所建立的马尔可夫决策过程及奖励函数计算马尔可夫链，将马尔可夫链保存入经验池，然后根据经验池中数据更新深度强化学习换挡策略中的预测Q网络。Step 4, according to the Markov decision process of step 2 and the reward function of step 3, solve the deep reinforcement learning shifting strategy. First, the Markov chain is calculated through the established Markov decision process and reward function, and the Markov chain is saved in the experience pool, and then the prediction Q network in the deep reinforcement learning shifting strategy is updated according to the data in the experience pool.

步骤五，将步骤四计算出的预测Q网络放入换挡策略控制器，工程机械及车辆在行驶过程中，工程机械及车辆根据换挡策略控制器选择挡位。Step 5, put the predicted Q network calculated in step 4 into the shift strategy controller. During the driving process of the construction machinery and the vehicle, the construction machinery and the vehicle select the gear according to the shift strategy controller.

步骤六，在行驶过程中，采集工程机械及车辆行驶数据保存进经验池，并定期更新预测Q网络，更新完成后将预测Q网络放入换挡策略控制器，实现对换挡策略进行动态优化。Step 6: During the driving process, collect the driving data of construction machinery and vehicles and save it into the experience pool, and update the predicted Q network regularly. After the update is completed, put the predicted Q network into the shift strategy controller to realize dynamic optimization of the shift strategy. .

进一步，在所述步骤一，换挡策略状态变量为工程机械及车辆行驶状态变量或外界环境变量。动作变量为档位操作或换挡操作。档位操作包括升档、降档或保持档位；换挡操作即选择的挡位。Further, in the step 1, the state variable of the shifting strategy is the state variable of the construction machinery and the driving of the vehicle or the external environment variable. The action variable is gear operation or shift operation. Gear operations include upshifting, downshifting, or holding a gear; a shift operation is the selected gear.

在所述的步骤二中，换挡策略马尔可夫决策过程表示为下一时刻状态为当前状态和所选动作的转移函数T的形式。转移函数的形式为：In the second step, the Markov decision process of the shifting strategy is expressed in the form of a transition function T where the state at the next moment is the current state and the selected action. The transfer function has the form:

s_t+1＝T(s_t,a_t)s _t ₊₁ =T(s _t ,at )

式中，s_t+1为下一时刻的状态变量，s_t为当前状态变量，a_t为所选动作变量。其中，s∈S,a∈A。S为状态变量的集合，A为动作变量的集合。在换挡策略中，状态变量为工程机械及车辆行驶状态变量或外界环境变量，包括车速、油门开度、加速度、行驶坡度和地面摩擦阻力系数。动作变量包括档位操作或换挡操作。In the formula, s _t+1 is the state variable at the next moment, s _t is the current state variable, and at _is the selected action variable. Among them, s∈S, a∈A. S is a set of state variables, and A is a set of action variables. In the shifting strategy, the state variables are construction machinery and vehicle driving state variables or external environment variables, including vehicle speed, accelerator opening, acceleration, driving gradient and ground friction resistance coefficient. Action variables include gear operation or shift operation.

在所述的步骤三中，所建立的换挡策略奖励函数与换挡目标正相关。In the third step, the established shift strategy reward function is positively correlated with the shift target.

在所述的步骤三中，所述的换挡目标包括动力性、经济性、舒适性。In the step 3, the shifting targets include power, economy, and comfort.

在所述的步骤四中，通过所建立的马尔可夫决策过程及奖励函数计算马尔可夫链。马尔可夫链的形式为：In the fourth step, the Markov chain is calculated through the established Markov decision process and reward function. A Markov chain has the form:

<s_t,a_t,r_t,s_t+1><s _t ,a _t ,r _t ,s _t+1 >

在所述的步骤四中，所建立的马尔可夫链中，动作a_t的选择是通过贪心算法，贪心算法表示为：In the said step 4, in the established _Markov chain, the selection of the action at is through the greedy algorithm, and the greedy algorithm is expressed as:

式中，Q_p为预测Q网络，θ_p为预测Q网络参数。e为贪心算法参数。In the formula, Q _p is the prediction Q network, and θ _p is the prediction Q network parameter. e is the greedy algorithm parameter.

在所述的步骤四中，将马尔可夫链保存入经验池，然后根据经验池中数据更新深度强化学习换挡策略中的预测Q网络。预测Q网络用于计算在行驶状态s_t下挡位集合A下的Q值。预测Q网络的输出为Q_p(s,A,θ_p)，预测Q网络的更新方法为：In the fourth step, the Markov chain is saved in the experience pool, and then the prediction Q network in the deep reinforcement learning shifting strategy is updated according to the data in the experience pool. The predicted Q network is used to calculate the Q value in the gear set A in the driving state _st . The output of the predicted Q network is Q _p (s, A, θ _p ), and the update method of the predicted Q network is:

式中，γ为奖励递减值；α为神经网络学习率；Q_t为目标Q网络。θ_t为目标Q网络参数。In the formula, γ is the decreasing reward value; α is the learning rate of the neural network; Q _t is the target Q network. θ _t is the target Q network parameter.

在所述的步骤五中，工程机械及车辆在行驶过程中，工程机械及车辆根据换挡策略控制器选择挡位。换挡控制器根据预测Q网络选择合适的挡位a*。In the fifth step, during the driving process of the construction machinery and the vehicle, the construction machinery and the vehicle select the gear according to the shift strategy controller. The shift controller selects the appropriate gear a* according to the predicted Q network.

a*(s)＝argmax_a[Q_p(s,a,θ_p)|a∈A]a*(s)=argmax _a [Q _p (s,a,θ _p )|a∈A]

在所述的步骤六中，采集行驶数据数据包括：车速、油门开度、加速度、行驶坡度和地面摩擦阻力系数。In the step 6, the collected driving data includes: vehicle speed, accelerator opening, acceleration, driving gradient and ground friction resistance coefficient.

在所述的步骤六中，更新预测Q网络的方法包括两种。方法一为通过工程机械及车辆行驶数据重构步骤二中的转移函数，然后根据步骤三和步骤四更新预测Q网络。方法二为根据步骤四中的预测Q网络更新方法直接更新预测Q网络。In the above-mentioned step 6, there are two methods for updating the prediction Q network. The first method is to reconstruct the transfer function in the second step through the construction machinery and vehicle driving data, and then update the prediction Q network according to the third step and the fourth step. The second method is to directly update the predicted Q network according to the method for updating the predicted Q network in the fourth step.

在所述的步骤六中，更新预测Q网络的方法一是通过采集工程机械及车辆行驶数据对步骤二中的转移函数进行重构，重构方法为对转移函数中的参数进行重新计算形成结构相同但参数不同的转移函数，或者通过采用神经网络、线性拟合和傅里叶变换法拟合转移函数等。In the above-mentioned step 6, the first method of updating the prediction Q network is to reconstruct the transfer function in the second step by collecting construction machinery and vehicle driving data. The reconstruction method is to recalculate the parameters in the transfer function to form a structure. The same transfer function but with different parameters, or by applying neural networks, linear fitting and Fourier transform methods to fit transfer functions, etc.

在所述的步骤六中，更新预测Q网络的方法二是通过采集工程机械及车辆行驶数据然后根据步骤四中的预测Q网络更新方法进行更新，预测Q网络的更新方法为：In the described step 6, the second method of updating the predicted Q network is to update the predicted Q network by collecting the construction machinery and vehicle driving data and then update it according to the predicted Q network update method in step 4. The update method of the predicted Q network is:

在所述的步骤六中，通过所述的深度强化学习中预测Q网络的更新实现换挡策略的动态优化。In the step 6, the dynamic optimization of the shifting strategy is realized through the update of the predicted Q network in the deep reinforcement learning.

实施例Example

本发明提供一种基于深度强化学习的换挡策略动态优化方法。本发明构建了换挡策略马尔科夫决策过程，然后利用深度强化学习方法求解换挡策略。求解完成后将深度强化学习训练出的预测Q网络放入换挡策略控制器中实现挡位的选择。之后在行驶过程中，通过采集工程机械及车辆行驶数据对预测Q网络进行更新以实现换挡策略的动态优化。预测Q网络的更新方法包括：根据工程机械及车辆行驶数据重构换的挡策略转移函数更新预测Q网络和直接根据深度强化学习方法更新预测Q网络。基于深度强化学习的换挡策略动态优化方法原理如图1所示，包括如下步骤：The invention provides a dynamic optimization method for shifting strategy based on deep reinforcement learning. The invention constructs the Markov decision process of the shifting strategy, and then uses the deep reinforcement learning method to solve the shifting strategy. After the solution is completed, the prediction Q network trained by deep reinforcement learning is put into the shift strategy controller to realize the gear selection. Then, during the driving process, the predicted Q network is updated by collecting the data of construction machinery and vehicle driving to realize the dynamic optimization of the shifting strategy. The update method of the prediction Q network includes: updating the prediction Q network by reconstructing the shift strategy transfer function of the construction machinery and vehicle driving data and updating the prediction Q network directly according to the deep reinforcement learning method. The principle of the dynamic optimization method of shifting strategy based on deep reinforcement learning is shown in Figure 1, including the following steps:

步骤四，根据步骤二的马尔科夫决策过程和步骤三的奖励函数，求解深度强化学习换挡策略。首先通过所建立的马尔可夫决策过程及奖励函数计算马尔可夫链，将马尔可夫链保存入经验池，然后根据经验池中数据更新深度强化学习换挡策略中的预测Q网络。Step 4: According to the Markov decision process of step 2 and the reward function of step 3, the deep reinforcement learning shifting strategy is solved. First, the Markov chain is calculated through the established Markov decision process and reward function, and the Markov chain is saved in the experience pool, and then the prediction Q network in the deep reinforcement learning shifting strategy is updated according to the data in the experience pool.

以下结合附图和实施例对本发明的技术方案作具体实施描述。The technical solutions of the present invention will be described in detail below with reference to the accompanying drawings and embodiments.

步骤一，确定换挡策略状态变量及动作变量。在实施例中，换挡策略状态变量为车速v、加速度

和油门开度α_t。动作变量为挡位n_g。Step 1: Determine the state variable and action variable of the shifting strategy. In an embodiment, the shift strategy state variables are vehicle speed v, acceleration

and the accelerator opening α _t . The action variable is the gear position n _g .

在实施例中，根据状态变量(车速、加速度、油门开度)和动作变量(档位)确定换挡策略马尔科夫决策过程。马尔可夫决策过程状态转移函数T为：In an embodiment, the shift strategy Markov decision process is determined according to state variables (vehicle speed, acceleration, accelerator opening) and action variables (gear position). The state transition function T of the Markov decision process is:

式中，T_e为发动机输出转矩；i_g为挡位n_g对应的传动比；i₀为主减速器传动比；η_t为传动系统效率；m为汽车总重量；β为等效坡度阻力系数。C_d为空气阻力系数；A为汽车迎风面积；F_b为制动力；R为轮胎有效转动半径；ρ为空气密度。In the formula, T _e is the output torque of the engine; i _g is the transmission ratio corresponding to the gear position n _g ; i ₀ is the transmission ratio of the main reducer; η _t is the efficiency of the transmission system; m is the total weight of the vehicle; β is the equivalent slope OK. C _d is the air resistance coefficient; A is the windward area of the car; F _b is the braking force; R is the effective turning radius of the tire; ρ is the air density.

步骤三，根据换挡目标建立强化学习换挡策略奖励函数。在本实施例中，学习目标为动力性换挡策略，描述为工程机械及车辆在舒适度约束条件下以最短时间t到达最高车速。奖惩机制为：Step 3: Establish a reinforcement learning shift strategy reward function according to the shift target. In this embodiment, the learning target is a dynamic shifting strategy, which is described as the construction machinery and the vehicle reaching the maximum speed in the shortest time t under the condition of comfort constraints. The reward and punishment mechanism is:

步骤四，根据步骤二的马尔科夫决策过程和步骤三的奖励函数，求解深度强化学习换挡策略。首先通过所建立的马尔可夫决策过程及奖励函数计算马尔可夫链，将马尔可夫链保存入经验池，然后根据经验池中数据更新深度强化学习换挡策略中的预测Q网络。步骤四的流程如图2所示。具体步骤如下所示。Step 4: According to the Markov decision process of step 2 and the reward function of step 3, the deep reinforcement learning shifting strategy is solved. First, the Markov chain is calculated through the established Markov decision process and reward function, and the Markov chain is saved in the experience pool, and then the prediction Q network in the deep reinforcement learning shifting strategy is updated according to the data in the experience pool. The process of step 4 is shown in FIG. 2 . The specific steps are as follows.

第一步：首先初始化状态变量和动作变量，根据所建立的马尔可夫决策过程转移函数计算出下一时刻的状态。Step 1: Initialize state variables and action variables, and calculate the state at the next moment according to the established Markov decision process transfer function.

第二步：通过所设计的奖惩机制计算奖励。Step 2: Calculate the reward through the designed reward and punishment mechanism.

第三步：将上述的状态-动作-下一时刻状态和奖励表示成马尔可夫链的形式保存进经验池。Step 3: Represent the above state-action-next moment state and reward as a Markov chain and save it into the experience pool.

第四步：将下一时刻的状态作为当前状态，预测Q网络根据当前状态计算各动作下的Q值，然后通过贪心算法根据各动作下的Q值计算当前状态下的实际选择挡位。然后回到第一步，循环往复。Step 4: Taking the state at the next moment as the current state, the prediction Q network calculates the Q value under each action according to the current state, and then calculates the actual selected gear in the current state according to the Q value under each action through the greedy algorithm. Then go back to the first step and repeat the cycle.

在以上步骤中，当经验池中马尔可夫链的数量到达预定个数时，开始对预测Q网络进行更新。In the above steps, when the number of Markov chains in the experience pool reaches a predetermined number, the prediction Q network starts to be updated.

预测Q网络的更新过程通过预测Q网络和目标Q网络共同完成的，预测Q网络的更新方法为：The update process of the predicted Q network is completed by the predicted Q network and the target Q network. The update method of the predicted Q network is:

Q_p(s,a,θ_p)＝Q_p(s,a,θ_p)+α(r+γma_axQ_t(s,a,θ_t)-Q_p(s,a,θ_p))² Q _p (s,a,θ _p )=Q _p (s,a,θ _p )+α(r+γma _a xQ _t (s,a,θ _t )-Q _p (s,a,θ _p )) ²

式中，γ为奖励递减值；α为神经网络学习率；Q_t为目标Q网络。θ_t为目标Q网络参数。In the formula, γ is the decreasing value of reward; α is the learning rate of the neural network; Q _t is the target Q network. θ _t is the target Q network parameter.

在预测Q网络的更新过程中，需要定期将预测Q网络的参数导入复制到目标Q网络中以实现目标Q网络的更新。In the update process of the prediction Q network, the parameters of the prediction Q network need to be imported and copied to the target Q network regularly to realize the update of the target Q network.

预测Q网络和目标Q网络具有相同的神经网络结构。在本实施例中，预测Q网络和目标Q网络所采用的神经网络结构模型如图3所示。所使用的神经网络结构模型具有五个全连接层作为中间层，采用线性整流函数ReLU作为神经网络激活函数。线性整流函数ReLU表示为：The prediction Q-network and the target Q-network have the same neural network structure. In this embodiment, the neural network structure model adopted by the prediction Q network and the target Q network is shown in FIG. 3 . The neural network structure model used has five fully connected layers as intermediate layers, and adopts the linear rectification function ReLU as the neural network activation function. The linear rectification function ReLU is expressed as:

ReLU(x)＝max(0,Wx+b)ReLU(x)=max(0,Wx+b)

式中：W为神经网络的权重；b为神经网络的偏秩；x为神经网络输入。In the formula: W is the weight of the neural network; b is the partial rank of the neural network; x is the input of the neural network.

在本实施例中，神经网络数据输入为状态变量(车速、油门开度、加速度)。输出层输出的是所有挡位n_g对应的Q值。Q值越大说明在当前状态下，选择对应Q值的档位可以获得更大的最大化折扣累计奖励值。In this embodiment, the neural network data input is state variables (vehicle speed, accelerator opening, acceleration). The output layer outputs the Q values corresponding to all gears n _g . The larger the Q value, the greater the maximum discount cumulative reward value can be obtained by selecting the gear corresponding to the Q value in the current state.

在步骤五中，工程机械及车辆根据换挡策略控制器选择挡位。具体表现为：In step 5, the construction machinery and the vehicle select the gear according to the shift strategy controller. Specifically:

a*(s)＝argmax_a[Q_p(s,a,θ_p)|a∈A]a*(s)=argmax _a [Q _p (s,a,θ _p )|a∈A]

在步骤六中，更新预测Q网络的方法包括两种。方法一为通过工程机械及车辆行驶数据重构步骤二中的转移函数，然后根据步骤三和步骤四更新预测Q网络。方法二为根据步骤四中的预测Q网络更新方法直接更新预测Q网络。In step 6, there are two methods for updating the prediction Q network. The first method is to reconstruct the transfer function in the second step through the construction machinery and vehicle driving data, and then update the prediction Q network according to the third step and the fourth step. The second method is to directly update the predicted Q network according to the method for updating the predicted Q network in the fourth step.

在步骤六中，更新预测Q网络的方法一是通过采集工程机械及车辆行驶数据对步骤二中的转移函数进行重构，重构方法为对转移函数中的参数进行重新计算形成结构相同但参数不同的转移函数，或者通过采用神经网络、线性拟合和傅里叶变换法拟合转移函数等。在本实施例中，所重构的转移函数为：In step 6, the first method of updating the prediction Q network is to reconstruct the transfer function in step 2 by collecting construction machinery and vehicle driving data. The reconstruction method is to recalculate the parameters in the transfer function to form the same structure but the parameters Different transfer functions, or by applying neural networks, linear fitting and Fourier transform methods to fit transfer functions, etc. In this embodiment, the reconstructed transfer function is:

根据重构方法，在本实施例中，可以通过对转移函数中的参数进行重新计算。或者通过采用神经网络、线性拟合和傅里叶变换法拟合转移函数。无论进行那种形式的重构，重构后的转移函数可以统一表示为：According to the reconstruction method, in this embodiment, the parameters in the transfer function can be recalculated. Or fit the transfer function by applying neural networks, linear fitting and Fourier transform methods. No matter which form of reconstruction is performed, the reconstructed transfer function can be uniformly expressed as:

s_t+1＝T_new(s_t,a_t,Θ)s _t ₊₁ =T _new (s _t ,at ,Θ)

式中，Ω为转移函数参数。where Ω is the transfer function parameter.

重构结束后，需要重新进行步骤四和步骤五得到新的预测Q网络。After the reconstruction is completed, steps 4 and 5 need to be performed again to obtain a new prediction Q network.

在步骤六中，更新预测Q网络的方法二是通过采集工程机械及车辆行驶数据然后根据步骤四中的预测Q网络更新方法进行更新。In step 6, the second method for updating the predicted Q network is to collect construction machinery and vehicle driving data and then update according to the method for updating the predicted Q network in step 4.

在步骤六中，更新预测Q网络的方法二是通过采集工程机械及车辆行驶数据然后根据步骤四中的预测Q网络更新方法进行更新，以实现换挡策略的动态优化过程，换挡策略的动态优化过程如图4所示。具体过程如下：In step 6, the second method of updating the predicted Q network is to collect construction machinery and vehicle driving data and then update it according to the predicted Q network update method in step 4, so as to realize the dynamic optimization process of the shifting strategy, and the dynamic optimization of the shifting strategy. The optimization process is shown in Figure 4. The specific process is as follows:

第一步：采集工程机械及车辆行驶数据Step 1: Collect construction machinery and vehicle driving data

第二步：将采集的工程机械及车辆行驶数据进行处理，处理完成后的数据表达为马尔可夫链的形式，可以表示为：Step 2: Process the collected construction machinery and vehicle driving data, and express the processed data in the form of a Markov chain, which can be expressed as:

＜s_t,a_t,r_t,s_t+1><s _t , at _t , r _t , s _t+1 >

第三步：预测Q网络的更新，预测Q网络的更新过程通过预测Q网络和目标Q网络共同完成的，方法为：Step 3: Predict the update of the Q network. The update process of the predicted Q network is completed by the prediction of the Q network and the target Q network. The method is as follows:

除上述实施例外，本发明还可以有其他实施方式。凡采用等同替换或等效变换形成的技术方案，均落在本发明要求的保护范围。In addition to the above-described embodiments, the present invention may also have other embodiments. All technical solutions formed by equivalent replacement or equivalent transformation fall within the protection scope of the present invention.

Claims

1. a gear shifting strategy dynamic optimization method based on deep reinforcement learning, is characterized in that, comprises the steps:

Step (1): determine the state input variable and action output variable of the shifting strategy;

Step (2): According to the state input variables and action output variables of step (1), determine the Markov decision process of the shifting strategy;

Step (3): establish a reinforcement learning shift strategy reward function according to the shift strategy target;

Step (4): According to the Markov decision process in step (2) and the reward function in step (3), solve the deep reinforcement learning shifting strategy; specifically, first calculate the Markov decision process and reward function through the Markov decision process. Fu chain, save the Markov chain into the experience pool, and then update the prediction Q network in the deep reinforcement learning shifting strategy according to the data in the experience pool;

Step (5): put the predicted Q network calculated in step (4) into the shift strategy controller, and during the driving process, the construction machinery and the vehicle select the gear according to the shift strategy controller;

Step (6): During the driving process, collect construction machinery and vehicle driving data and save them into the experience pool, and update the predicted Q network regularly. After the update is completed, put the predicted Q network into the shift strategy controller to implement the shift strategy. dynamic optimization;

Step (1), determine the shifting strategy state variables and action output variables, the shifting strategy state variables are vehicle speed v, acceleration

and the accelerator opening α _t , the action output variable is the gear n _g ;

According to the state variables, namely vehicle speed, acceleration, accelerator opening and action output variables, namely gear position, the Markov decision-making process of the shifting strategy is determined. The state transition function T of the Markov decision-making process is:

In the formula, T _e is the output torque of the engine; i _g is the transmission ratio corresponding to the gear position n _g ; i ₀ is the transmission ratio of the main reducer; η _t is the efficiency of the transmission system; m is the total weight of the vehicle; β is the equivalent slope Drag coefficient; C _d is the air resistance coefficient; A is the windward area of the car; F _b is the braking force; R is the effective turning radius of the tire; ρ is the air density;

In step (3), the target of the shifting strategy is a dynamic shifting strategy, which is described as the construction machinery and the vehicle reaching the maximum speed in the shortest time t under the condition of comfort constraints. The reward and punishment mechanism is:

In the formula, r is the reward calculated by the reward and punishment mechanism; r _t is the temporary reward, r _t =-0.001||V _Tamx -v||; v _Tmax is the maximum vehicle speed under the current accelerator opening α _t ; J is the construction machinery and the impact degree of the vehicle; J _max is the designed maximum allowable impact degree;

The specific steps of step (4) are as follows;

Step 1: First initialize state variables and action variables, and calculate the state at the next moment according to the established Markov decision process transfer function;

Step 2: Calculate the reward through the designed reward and punishment mechanism;

Step 3: Represent the above state-action-next moment state and reward in the form of a Markov chain and save it into the experience pool;

Step 4: Taking the state at the next moment as the current state, the prediction Q network calculates the Q value under each action according to the current state, and then calculates the actual selected gear in the current state according to the Q value under each action through the greedy algorithm; then Go back to the first step and repeat the cycle;

In the above steps, when the number of Markov chains in the experience pool reaches a predetermined number, the prediction Q network is updated;

The update process of the predicted Q network is completed by the predicted Q network and the target Q network. The update method of the predicted Q network is:

Q _p (s,a,θ _p )=Q _p (s,a,θ _p )+α(r+γmax _a Q _t (s,a,θ _t )-Q _p (s,a,θ _p )) ²

In the formula, γ is the decreasing reward value; α is the learning rate of the neural network; Q _t is the target Q network, θ _t is the target Q network parameter, Q _p is the prediction Q network, and θ _p is the prediction Q network parameter.

2. The method according to claim 1, wherein the state input variables in the step (1) include vehicle speed v, acceleration

3. The method according to claim 2, wherein the shift strategy Markov decision process in the step (2) is expressed as the next moment state is the form of the transfer function of the current state and the selected action, The transfer function has the following form:

s _t ₊₁ =T(s _t ,at )

In the formula, s _t+1 is the state variable at the next moment, s _t is the current state variable, and at _is the selected action variable, where s ∈ S, a ∈ A, S is the set of state variables, and A is the action A collection of variables.

4 . The method according to claim 3 , wherein the shift strategy reward function in the step (3) is positively correlated with shift strategy objectives, and the shift strategy objectives include power, economy and comfort. 5 . sex.

5. method according to claim 4 is characterized in that, the form of the Markov chain in described step (4) is:

In the formula, r _t is the temporary reward calculated according to the reward target.

6. The method according to claim 5, wherein the deep reinforcement learning method in the step (4) comprises two neural networks with the same structure but different parameters, called prediction Q network and target Q network, wherein The role of the prediction Q network is to calculate the Q value of each action in the current state, and the target Q network is used to update the prediction Q network.

7. method according to claim 6 is characterized in that, in step (4), in the established _Markov chain, the selection of action variable at is by greedy algorithm, and greedy algorithm is expressed as:

In the formula, Q _p is the predicted Q network, θ _p is the predicted Q network parameter, and e is the greedy algorithm parameter;

In the step (4), the Markov chain is saved in the experience pool, and then the prediction Q network in the deep reinforcement learning shifting strategy is updated according to the data in the experience pool, and the prediction Q network is used to calculate the driving state s _t For the Q value under the lower gear set A, the output of the predicted Q network is Q _p (s, A, θ _p ).

8. The method according to claim 7, characterized in that, in the step (5), during the driving process of the construction machinery and the vehicle, the construction machinery and the vehicle select the gear according to the shift strategy controller, and shift the gears. The controller selects the appropriate gear a* according to the predicted Q network:

a*(s)=argmax _a [Q _p (s,a,θ _p )|a∈A]

In the formula, Q _p is the prediction Q network, and θ _p is the prediction Q network parameter.

9 . The method according to claim 8 , wherein the collected driving data in the step (6) comprises: vehicle speed, accelerator opening, acceleration, driving gradient and ground frictional resistance coefficient; 10 .

There are two methods for updating the predicted Q network in the step (6): the first method is to reconstruct the transfer function in the step (2) through the construction machinery and vehicle driving data, and then according to the step (3) and the step (4) ) update the prediction Q network; Method 2 is to directly update the prediction Q network according to the prediction Q network update method in step (4);

Among them, the first method is to reconstruct the transfer function in step (2) by collecting construction machinery and vehicle driving data, and the reconstruction method is to recalculate the parameters in the transfer function to form a transfer function with the same structure but different parameters, or Fit transfer function by applying neural network, linear fitting and Fourier transform;

Wherein, the second method is to collect construction machinery and vehicle driving data and then update according to the prediction Q network update method in step (4).