CN110716550A

CN110716550A - Gear shifting strategy dynamic optimization method based on deep reinforcement learning

Info

Publication number: CN110716550A
Application number: CN201911076016.XA
Authority: CN
Inventors: 陈刚; 袁靖; 张介; 顾爱博; 周楠; 王和荣; 苏树华; 陈守宝; 王良模; 王陶
Original assignee: Nanjing Tech University
Current assignee: Nanjing Tech University
Priority date: 2019-11-06
Filing date: 2019-11-06
Publication date: 2020-01-21
Anticipated expiration: 2039-11-06
Also published as: CN110716550B

Abstract

The invention belongs to the field of engineering machinery and vehicle engineering, and particularly relates to a gear shifting strategy dynamic optimization method based on deep reinforcement learning. The method comprises the following steps: (1): determining a gear shifting strategy state input variable and an action output variable; (2): determining a Markov decision process of a gear shifting strategy according to the state input variable and the action output variable; (3): establishing a reinforcement learning gear shifting strategy reward function according to a gear shifting strategy target; (4): solving a deep reinforcement learning gear shifting strategy according to a Markov decision process and a reward function; (5): putting the predicted Q network calculated in the step (4) into a gear shifting strategy controller, and selecting gears of the engineering machinery and the vehicle according to the gear shifting strategy controller in the driving process of the engineering machinery and the vehicle; (6): the predictive Q network is updated periodically during travel. According to the invention, the gear shifting strategy is updated by a deep reinforcement learning method, so that the dynamic optimization of the gear shifting strategy is realized.

Description

Gear shifting strategy dynamic optimization method based on deep reinforcement learning

Technical Field

The invention belongs to the field of engineering machinery and vehicle engineering, and particularly relates to a gear shifting strategy dynamic optimization method based on deep reinforcement learning.

Background

The gear shifting strategy is one of core technologies of the existing engineering machinery and vehicle control technology, and refers to the rule that gears change along with selected parameters in the driving process of the engineering machinery and the vehicle. The solving method is mainly considered by establishing a gear shifting strategy. The solving method of the gear shifting strategy comprises a graphical method, an analytical method, a genetic algorithm, a dynamic programming method and the like. The solution and optimization of the shift strategy are core directions related to the research of the shift strategy, and particularly, the dynamic optimization of the shift strategy is one of the difficulties in the research of the shift strategy.

The method comprises the steps of 'correcting the optimal dynamic AMT gear shifting law based on variable load', Lihao, control project, volume 22, phase 1, pages 50-54, and month 1 in 2015. Acceleration is introduced as a gear shifting parameter on the basis of a two-parameter gear shifting strategy, and dynamic three-parameter gear shifting considering the acceleration is realized. The solving method is an analytical method, an acceleration-speed curve needs to be fitted according to each accelerator opening degree in the solving process, the solving is complex and large in calculation amount, and meanwhile, the solving can be only carried out aiming at a single performance index, and dynamic optimization cannot be carried out aiming at the actual running condition.

"Performance Evaluation application for Indvidulized Gearshift Schedule Optimization", Yin X, year 2016, month 05. The gear shifting strategy is optimized by using the genetic algorithm, the comprehensive performance of the gear shifting strategy is improved, and the problem that only a single performance index can be solved by an analytical method is solved, but dynamic optimization cannot be performed on the actual driving condition.

"Optimal gear shift strategies for fuel economy and drive", VietDacNgo, Proceedings of the institute of Mechanical Engineers, Part D, Journal of automatic Engineering, Vol.227, No. 10, p.1398-1413, p.2013 for 10 months. And solving the gear shifting strategy aiming at the specific driving cycle working condition by a dynamic programming method. The disadvantages are that: when the shift schedule is solved by dynamic programming, a complex state diagram needs to be constructed, and the state diagram is expressed in a table form. The complexity of the state diagram depends on the degree of dispersion in the dynamic programming algorithm. An excessively complex state diagram may have a reduced convergence rate or may fail to converge due to a berman latitude disaster. Meanwhile, due to the fact that optimization is conducted according to a specific driving cycle, dynamic optimization cannot be conducted in the driving process.

In the existing patent, patent application No. 201710887558.X discloses a method for optimizing a shift schedule of an automobile by using a dynamic programming algorithm. According to the embodiment, shift schedules based on economy and dynamics are respectively established. When the shift schedule is solved by dynamic programming, a complex state diagram needs to be constructed, and the complexity degree of the state diagram depends on the discrete degree in a dynamic programming algorithm. An excessively complex state diagram may have a reduced convergence rate or may fail to converge due to a berman latitude disaster. And meanwhile, dynamic optimization cannot be carried out according to the actual running condition.

In prior art patent application No. 201811306659.4 discloses a shift strategy correction method and system based on driving intent. And updating the current gear shifting correction coefficient and the compensation deviant according to the driving process of a driver, correcting the original gear shifting strategy and realizing the dynamic update of the gear shifting strategy. However, the dynamic updating rule of the gear shifting strategy needs to be manually established, the optimization effect is greatly influenced by the manual establishment, and meanwhile, the optimization method is not universal and only can be used for a single vehicle type. The intelligent degree is lower.

Generally, most of the existing gear shifting strategy solving or optimizing methods cannot perform dynamic optimization aiming at actual driving conditions, and the self-adaptive capacity is poor. Part of gear shifting strategies capable of realizing dynamic optimization need to artificially establish dynamic updating rules of the gear shifting strategies, and the intelligent performance and the universality are low.

Disclosure of Invention

The invention aims to provide a gear shifting strategy dynamic optimization method based on deep reinforcement learning. The method comprises the steps of constructing a Markov decision process and a reward function of a gear shifting strategy, then solving the gear shifting strategy by using a deep reinforcement learning method, then putting a prediction Q network solved by the deep reinforcement learning method into a gear shifting strategy controller to realize gear selection, simultaneously collecting driving data in the daily driving process, updating the gear shifting strategy by using the deep reinforcement learning method, and realizing dynamic optimization of the gear shifting strategy.

The technical solution for realizing the purpose of the invention is as follows: a gear shifting strategy dynamic optimization method based on deep reinforcement learning comprises the following steps:

step (1): determining a gear shifting strategy state input variable and an action output variable;

step (2): determining a Markov decision process of a gear shifting strategy according to the state input variable and the action output variable in the step (1);

and (3): establishing a reinforcement learning gear shifting strategy reward function according to a gear shifting strategy target;

and (4): solving a deep reinforcement learning gear shifting strategy according to the Markov decision process in the step (2) and the reward function in the step (3); firstly, a Markov chain is calculated through a Markov decision process and a reward function, the Markov chain is stored in an experience pool, and then a prediction Q network in a deep reinforcement learning gear shifting strategy is updated according to data in the experience pool;

and (5): putting the predicted Q network calculated in the step (4) into a gear shifting strategy controller, and selecting gears of the engineering machinery and the vehicle according to the gear shifting strategy controller in the driving process of the engineering machinery and the vehicle;

and (6): during the driving process, collecting the driving data of the engineering machinery and the vehicle, storing the driving data into an experience pool, periodically updating the forecast Q network, and putting the forecast Q network into a gear shifting strategy controller after the updating is finished so as to realize the dynamic optimization of the gear shifting strategy.

Further, the state input variables in the step (1) comprise vehicle speed v and accelerationAnd accelerator opening degree alpha_tThe slope of travel and the coefficient of ground friction resistance; the action output variables include gear operations including upshift, downshift or hold, and shift operations, i.e., selected gear n_g。

Further, the markov decision process of the shift strategy in step (2) is expressed in the form of a transfer function with the next time state as the current state and the selected action, and the form of the transfer function is as follows:

s_t+1＝T(s_t,a_t)

in the formula, s_t+1Is the state variable at the next moment, s_tAs a current state variable, a_tIs the selected action variable, where S ∈ S, a ∈ A, S is the set of state variables, and A is the set of action variables.

Further, the shift strategy reward function in step (3) is positively correlated with shift strategy objectives, including power, economy and comfort.

Further, the gear shifting strategy target is a dynamic gear shifting strategy, which is described as that the engineering machinery and the vehicle reach the highest speed in the shortest time t under the comfort degree constraint condition, and the reward and punishment mechanism is as follows:

in the formula, r is the reward calculated by a reward and punishment mechanism; r is_tFor temporary awards, r_t＝-0.001||V_Tamx-v||；v_TmaxIs the current throttle opening alpha_tThe maximum vehicle speed; j is the impact degree of the engineering machinery and the vehicle; j. the design is a square_maxThe designed maximum allowable impact degree.

Further, the form of the markov chain in the step (4) is as follows:

<s_t,a_t,r_t,s_t+1>

in the formula, r_tIs a temporary prize calculated based on a prize target.

Further, the deep reinforcement learning method in step (4) includes two neural networks with the same structure but different parameters, which are called a predicted Q network and a target Q network, wherein the predicted Q network is used for calculating a Q value of each action in the current state, and the target Q network is used for updating the predicted Q network.

Further, in step four, the Markov chain is established with the action variable a_tThe selection is by a greedy algorithm, which is expressed as:

in the formula, Q_pTo predict the Q network, θ_pE is a greedy algorithm parameter for predicting Q network parameters;

in the step (4), the Markov chain is saved into an experience pool, and then a predicted Q network in the deep reinforcement learning shift strategy is updated according to data in the experience pool, wherein the predicted Q network is used for calculating the driving state s_tPredicting the Q value under the lower gear set A, and predicting the output of the Q network to be Q_p(s,A,θ_p)。

Further, in the step (5), during the driving process of the construction machine and the vehicle, the construction machine and the vehicle select a gear according to a gear shift strategy controller, and the gear shift controller selects an appropriate gear a according to the predicted Q network:

a*(s)＝argmax_a[Q_p(s,a,θ_p)|a∈A]

in the formula, Q_pTo predict the Q network, θ_pTo predict Q network parameters.

Further, the step (6) of collecting the driving data comprises: vehicle speed, accelerator opening, acceleration, running gradient and ground friction resistance coefficient;

the method for updating the predicted Q network in the step (6) comprises two methods: reconstructing the transfer function in the step (2) through the driving data of the engineering machinery and the vehicle, and then updating the prediction Q network according to the step (3) and the step (4); the second method is to directly update the predicted Q network according to the predicted Q network updating method in the step (4);

reconstructing the transfer function in the step (2) by acquiring the driving data of the engineering machinery and the vehicle, wherein the reconstruction method is to recalculate parameters in the transfer function to form the transfer function with the same structure but different parameters, or to fit the transfer function by adopting a neural network, linear fitting and a Fourier transform method;

and (4) acquiring the driving data of the engineering machinery and the vehicle, and updating according to the predicted Q network updating method in the step (4), wherein the updating method of the predicted Q network comprises the following steps:

Q_p(s,a,θ_p)＝Q_p(s,a,θ_p)+α(r+γmax_aQ_t(s,a,θ_t)-Q_p(s,a,θ_p))²

wherein γ is the reward decrement value; alpha is the neural network learning rate; q_tTo target Q network, θ_tFor the target Q network parameter

Compared with the prior art, the invention has the remarkable advantages that:

(1) by adopting the deep reinforcement learning method, the principle that the prediction Q network is updated by constructing the Markov chain comprising the Markov decision process and the reward function according to the driving process of the engineering machinery and the vehicle can be realized, and the problems of solving and dynamically optimizing the gear shifting strategy can be solved; the method has the characteristic of strong self-adaptive capacity;

(2) by adopting the deep reinforcement learning method, the algorithm has uniformity in the steps of solving and dynamic optimization, is not influenced by the controlled object body, and can be suitable for different vehicle types such as passenger vehicles, engineering machinery, vehicles, special vehicles, electric vehicles and the like; the transfer function can be fitted by adopting a neural network, linear fitting and a Fourier transform method, the influence of an applied applicable object body is avoided, and the method has the characteristic of strong universality;

(3) the gear shifting strategy is solved and dynamically optimized by adopting a deep reinforcement learning method, and the algorithm is not influenced by a controlled object body, and meanwhile, the dynamic optimization of the gear shifting strategy can be realized, so that the gear shifting strategy has the characteristic of strong intelligence;

(4) the method and the device have the advantages that the selection of gears is realized by adopting the prediction Q network in deep reinforcement learning, a table form in the traditional method is replaced, and the problem of Bellman latitude disaster can be solved due to the fact that the neural network has strong fitting capacity and is suitable for a gear shifting strategy under a high-dimensional state variable.

Drawings

FIG. 1 is a schematic diagram of a shift strategy dynamic optimization method based on deep reinforcement learning.

FIG. 2 is a flow chart for solving a deep reinforcement learning shift strategy according to the present invention.

FIG. 3 is a diagram of a neural network architecture model employed in the present invention.

FIG. 4 is a process diagram for dynamic optimization of the shift strategy of the present invention.

Detailed Description

The invention provides a gear shifting strategy dynamic optimization method based on deep reinforcement learning. The method constructs a Markov decision process and a reward function of the gear shifting strategy, and then solves the gear shifting strategy by using a deep reinforcement learning method. And then, putting the predicted Q network solved by the deep reinforcement learning method into a gear shifting strategy controller to realize gear selection. Meanwhile, driving data are collected in the daily driving process, and the gear shifting strategy is updated through a deep reinforcement learning method. And realizing dynamic optimization of the gear shifting strategy.

A gear shifting strategy dynamic optimization method based on deep reinforcement learning comprises the following steps:

step one, determining a gear shifting strategy state variable and an action variable.

And step two, determining a Markov decision process of the gear shifting strategy according to the state input variable and the action output variable.

And step three, establishing a reinforcement learning gear shifting strategy reward function according to the gear shifting strategy optimization target.

And step four, solving a deep reinforcement learning gear shifting strategy according to the Markov decision process in the step two and the reward function in the step three. Firstly, a Markov chain is calculated through an established Markov decision process and a reward function, the Markov chain is stored in an experience pool, and then a prediction Q network in a deep reinforcement learning gear shifting strategy is updated according to data in the experience pool.

And step five, putting the predicted Q network calculated in the step four into a gear shifting strategy controller, and selecting gears by the engineering machinery and the vehicle according to the gear shifting strategy controller in the driving process of the engineering machinery and the vehicle.

And step six, in the driving process, collecting the driving data of the engineering machinery and the vehicle, storing the driving data into an experience pool, periodically updating the predicted Q network, and putting the predicted Q network into a gear shifting strategy controller after the updating is finished so as to realize dynamic optimization of the gear shifting strategy.

Further, in the step one, the shift strategy state variables are engineering machinery and vehicle running state variables or external environment variables. The action variable is a gear operation or a shift operation. The gear operation comprises an upshift, a downshift or a gear holding; the shift operation is the selected gear.

In the second step, the Markov decision process of the gear shifting strategy is expressed in the form of a transfer function T of which the state is the current state and the selected action at the next moment. The transfer function is of the form:

s_t+1＝T(s_t,a_t)

in the formula, s_t+1Is the state variable at the next moment, s_tAs a current state variable, a_tIs the selected action variable. Wherein S ∈ S, and a ∈ A. S is a set of state variables, and A is a set of action variables. In the gear shifting strategy, the state variables are the running state variables of the engineering machinery and the vehicle or the external environment variables, including the speed, the opening degree of an accelerator, the acceleration, the running gradient and the ground friction resistance coefficient. The action variable includes a gear operation or a shift operation.

In the third step, the established gear shifting strategy reward function is positively correlated with the gear shifting target.

In the third step, the shift target includes power, economy and comfort.

In the fourth step, a Markov chain is calculated through the established Markov decision process and the reward function. The Markov chain is of the form:

<s_t,a_t,r_t,s_t+1>

in the formula, r_tIs a temporary prize calculated based on a prize target.

In the fourth step, the Markov chain is established, action a_tThe selection is by a greedy algorithm, which is expressed as:

in the formula, Q_pTo predict the Q network, θ_pTo predict Q network parameters. e is a greedy algorithm parameter.

In the fourth step, the Markov chain is stored into an experience pool, and then the prediction Q network in the deep reinforcement learning shift strategy is updated according to the data in the experience pool. Predictive Q network for calculating the driving state s_tQ value under the lower gear set A. Predicting the output of the Q network as Q_p(s,A,θ_p) The updating method of the prediction Q network comprises the following steps:

Q_p(s,a,θ_p)＝Q_p(s,a,θ_p)+α(r+γmax_aQ_t(s,a,θ_t)-Q_p(s,a,θ_p))²

wherein γ is the reward decrement value; alpha is the neural network learning rate; q_tIs the target Q network. Theta_tIs the target Q network parameter.

And in the fifth step, the engineering machinery and the vehicle select gears according to the gear shifting strategy controller in the driving process. The gear shift controller selects the appropriate gear a based on the predicted Q network.

a*(s)＝argmax_a[Q_p(s,a,θ_p)|a∈A]

In the sixth step, the collecting the driving data includes: vehicle speed, accelerator opening, acceleration, travel grade and ground friction resistance coefficient.

In the sixth step, two methods for updating the predicted Q network are included. And the first method is to reconstruct the transfer function in the second step through the driving data of the engineering machinery and the vehicle, and then update the predicted Q network according to the third step and the fourth step. The second method is to directly update the predicted Q network according to the predicted Q network updating method in the fourth step.

In the sixth step, the method for updating the predicted Q network is to reconstruct the transfer function in the second step by collecting the driving data of the engineering machinery and the vehicle, and the reconstruction method is to recalculate the parameters in the transfer function to form the transfer function with the same structure but different parameters, or to fit the transfer function by using a neural network, linear fitting, fourier transform method, and the like.

In the sixth step, the second method for updating the predicted Q network is to collect the driving data of the engineering machinery and the vehicle and update the driving data according to the updating method of the predicted Q network in the fourth step, and the updating method of the predicted Q network is as follows:

Q_p(s,a,θ_p)＝Q_p(s,a,θ_p)+α(r+γmax_aQ_t(s,a,θ_t)-Q_p(s,a,θ_p))²

in the sixth step, the dynamic optimization of the shift strategy is realized by predicting the update of the Q network in the deep reinforcement learning.

Examples

The invention provides a gear shifting strategy dynamic optimization method based on deep reinforcement learning. The method constructs a Markov decision process of the gear shifting strategy, and then solves the gear shifting strategy by using a deep reinforcement learning method. And after solving is completed, putting the prediction Q network trained by deep reinforcement learning into a gear shifting strategy controller to realize gear selection. And then, in the driving process, updating the predicted Q network by collecting the driving data of the engineering machinery and the vehicle so as to realize the dynamic optimization of the gear shifting strategy. The updating method of the prediction Q network comprises the following steps: and reconstructing a gear shifting strategy transfer function according to the driving data of the engineering machinery and the vehicle to update the prediction Q network and directly updating the prediction Q network according to a deep reinforcement learning method. The principle of the gear shifting strategy dynamic optimization method based on deep reinforcement learning is shown in fig. 1, and the method comprises the following steps:

The technical solution of the present invention is described below with reference to the accompanying drawings and examples.

Step one, determining a gear shifting strategy state variable and an action variable. In the embodiment, the shift strategy state variables are vehicle speed v and acceleration

And accelerator opening degree alpha_t. The action variable being gear n_g。

In an embodiment, the shift strategy markov decision process is determined from state variables (vehicle speed, acceleration, throttle opening) and action variables (gears). The Markov decision process state transfer function T is:

in the formula, T_eOutputting torque for the engine; i.e. i_gFor a gear n_gA corresponding gear ratio; i.e. i₀The transmission ratio of the main speed reducer is set; eta_tFor transmission system efficiency; m is the total weight of the automobile; beta is the equivalent gradient resistance coefficient. C_dIs the air resistance coefficient; a is the frontal area of the automobile; f_bIs braking force; r is the effective turning radius of the tire; ρ is the air density.

And step three, establishing a reinforcement learning gear shifting strategy reward function according to the gear shifting target. In the present embodiment, the learning objective is a dynamic shift strategy, which describes that the engineering machine and the vehicle reach the highest speed in the shortest time t under the comfort constraint condition. The reward and punishment mechanism is as follows:

And step four, solving a deep reinforcement learning gear shifting strategy according to the Markov decision process in the step two and the reward function in the step three. Firstly, a Markov chain is calculated through an established Markov decision process and a reward function, the Markov chain is stored in an experience pool, and then a prediction Q network in a deep reinforcement learning gear shifting strategy is updated according to data in the experience pool. The flow of step four is shown in fig. 2. The specific steps are as follows.

The first step is as follows: firstly, initializing a state variable and an action variable, and calculating the state of the next moment according to the established Markov decision process transfer function.

The second step is that: and calculating the reward through a designed reward and punishment mechanism.

The third step: the state-action-next-time state and reward representations described above are saved into the experience pool in the form of markov chains.

The fourth step: and taking the state at the next moment as the current state, calculating the Q value of each action according to the current state by the prediction Q network, and calculating the actually selected gear in the current state according to the Q value of each action by a greedy algorithm. Then returning to the first step and circulating to and fro.

In the above steps, when the number of markov chains in the experience pool reaches a predetermined number, updating of the predictive Q network is started.

The updating process of the predicted Q network is completed by the predicted Q network and the target Q network together, and the updating method of the predicted Q network comprises the following steps:

Q_p(s,a,θ_p)＝Q_p(s,a,θ_p)+α(r+γma_axQ_t(s,a,θ_t)-Q_p(s,a,θ_p))²

In the updating process of the predicted Q network, parameters of the predicted Q network need to be imported and copied into the target Q network periodically to achieve updating of the target Q network.

The predicted Q network and the target Q network have the same neural network structure. In the present embodiment, the neural network structure models adopted by the predicted Q network and the target Q network are shown in fig. 3. The used neural network structure model has five full connection layers as intermediate layers, and adopts a linear rectification function ReLU as a neural network activation function. The linear rectification function ReLU is expressed as:

ReLU(x)＝max(0,Wx+b)

in the formula: w is the weight of the neural network; b is the bias rank of the neural network; x is the neural network input.

In the present embodiment, the neural network data input is a state variable (vehicle speed, accelerator opening, acceleration). All gears n output by the output layer_gCorresponding Q value. The larger the Q value is, the larger the maximum discount accumulated reward value can be obtained by selecting the gear corresponding to the Q value under the current state.

And in the fifth step, the engineering machinery and the vehicle select gears according to the gear shifting strategy controller. The concrete expression is as follows:

a*(s)＝argmax_a[Q_p(s,a,θ_p)|a∈A]

In step six, the method for updating the predicted Q network includes two methods. And the first method is to reconstruct the transfer function in the second step through the driving data of the engineering machinery and the vehicle, and then update the predicted Q network according to the third step and the fourth step. The second method is to directly update the predicted Q network according to the predicted Q network updating method in the fourth step.

In the sixth step, the method for updating the predicted Q network is to reconstruct the transfer function in the second step by collecting the driving data of the engineering machinery and the vehicle, wherein the reconstruction method is to recalculate the parameters in the transfer function to form the transfer function with the same structure but different parameters, or to fit the transfer function by adopting a neural network, linear fitting, Fourier transform and the like. In this embodiment, the reconstructed transfer function is:

depending on the reconstruction method, in this embodiment, the parameters in the transfer function may be recalculated. Or by fitting the transfer function using a neural network, linear fitting, and fourier transform. Regardless of the form of reconstruction performed, the reconstructed transfer function can be uniformly expressed as:

s_t+1＝T_new(s_t,a_t,Θ)

where Ω is a transfer function parameter.

After the reconstruction is finished, the fourth step and the fifth step need to be carried out again to obtain a new prediction Q network.

And in the sixth step, the second method for updating the predicted Q network is to collect the driving data of the engineering machinery and the vehicle and then update according to the updating method of the predicted Q network in the fourth step.

In the sixth step, the second method for updating the predicted Q network is to acquire the driving data of the engineering machinery and the vehicle and update the driving data according to the predicted Q network updating method in the fourth step, so as to realize the dynamic optimization process of the gear shifting strategy, and the dynamic optimization process of the gear shifting strategy is shown in fig. 4. The specific process is as follows:

the first step is as follows: collecting engineering machinery and vehicle running data

The second step is that: the collected driving data of the engineering machinery and the vehicle are processed, and the processed data are expressed in a Markov chain form and can be expressed as follows:

＜s_t,a_t,r_t,s_t+1>

the third step: the updating of the predicted Q network is completed by the predicted Q network and the target Q network together, and the method comprises the following steps:

Q_p(s,a,θ_p)＝Q_p(s,a,θ_p)+α(r+γmax_aQ_t(s,a,θ_t)-Q_p(s,a,θ_p))²

in addition to the above embodiments, the present invention may have other embodiments. All technical solutions formed by adopting equivalent substitutions or equivalent transformations fall within the protection scope of the claims of the present invention.

Claims

1. A gear shifting strategy dynamic optimization method based on deep reinforcement learning is characterized by comprising the following steps:

and (5): putting the predicted Q network calculated in the step (4) into a gear shifting strategy controller, and selecting gears of the engineering machinery and the vehicle according to the gear shifting strategy controller in the driving process;

2. The method of claim 1, wherein the state input variables in step (1) include vehicle speed v, accelerationAnd accelerator opening degree alpha_tThe slope of travel and the coefficient of ground friction resistance; the motion output variables include gear operation and gear shift operation, wherein the gear operationIncluding upshifting, downshifting, or holding gears, the shifting operation being the selected gear n_g。

3. The method of claim 2, wherein the shift strategy markov decision process in step (2) is expressed in the form of a transfer function with the next moment state being the current state and the selected action, the transfer function being of the form:

s_t+1＝T(s_t,a_t)

4. The method of claim 3, wherein the shift strategy reward function of step (3) is positively correlated with shift strategy objectives, including power, economy, and comfort.

5. The method according to claim 4, wherein the shift strategy target is a dynamic shift strategy, which is described as the engineering machinery and the vehicle reaching the highest speed in the shortest time t under the comfort constraint condition, and the reward and punishment mechanism is as follows:

6. The method of claim 4, wherein the Markov chain of step (4) is of the form:

<s_t,a_t,r_t,s_t+1>

in the formula, r_tIs a temporary prize calculated based on a prize target.

7. The method according to claim 6, wherein the deep reinforcement learning method in step (4) comprises two neural networks with the same structure but different parameters, namely a predicted Q network and a target Q network, wherein the predicted Q network is used for calculating Q values of actions in the current state, and the target Q network is used for updating the predicted Q network.

8. The method of claim 7, wherein in step four, the Markov chain is established with the action variable a_tThe selection is by a greedy algorithm, which is expressed as:

9. The method according to claim 8, wherein in step (5), the work machine and the vehicle select gears according to a shift strategy controller during driving, and the shift controller selects an appropriate gear a according to the predicted Q network:

a*(s)＝argmax_a[Q_p(s,a,θ_p)|a∈A]

10. The method of claim 9, wherein the step (6) of collecting travel data comprises: vehicle speed, accelerator opening, acceleration, running gradient and ground friction resistance coefficient;

Q_p(s,a,θ_p)＝Q_p(s,a,θ_p)+α(r+γmax_aQ_t(s,a,θ_t)-Q_p(s,a,θ_p))²

wherein γ is the reward decrement value; alpha is the neural network learning rate; q_tTo target Q network, θ_tIs the target Q network parameter.