CN110716550B - Gear shifting strategy dynamic optimization method based on deep reinforcement learning - Google Patents

Gear shifting strategy dynamic optimization method based on deep reinforcement learning Download PDF

Info

Publication number
CN110716550B
CN110716550B CN201911076016.XA CN201911076016A CN110716550B CN 110716550 B CN110716550 B CN 110716550B CN 201911076016 A CN201911076016 A CN 201911076016A CN 110716550 B CN110716550 B CN 110716550B
Authority
CN
China
Prior art keywords
network
gear shifting
shifting strategy
predicted
strategy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911076016.XA
Other languages
Chinese (zh)
Other versions
CN110716550A (en
Inventor
陈刚
袁靖
张介
顾爱博
周楠
王和荣
苏树华
陈守宝
王良模
王陶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN201911076016.XA priority Critical patent/CN110716550B/en
Publication of CN110716550A publication Critical patent/CN110716550A/en
Application granted granted Critical
Publication of CN110716550B publication Critical patent/CN110716550B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0223Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving speed control of the vehicle
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0221Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Control Of Transmission Device (AREA)

Abstract

The invention belongs to the field of engineering machinery and vehicle engineering, and particularly relates to a gear shifting strategy dynamic optimization method based on deep reinforcement learning. The method comprises the following steps: (1): determining a gear shifting strategy state input variable and an action output variable; (2): determining a gear shifting strategy Markov decision process according to the state input variable and the action output variable; (3): establishing a reinforcement learning gear shifting strategy reward function according to a gear shifting strategy target; (4): solving a deep reinforcement learning gear shifting strategy according to a Markov decision process and a reward function; (5): putting the predicted Q network calculated in the step (4) into a gear shifting strategy controller, and selecting gears of the engineering machinery and the vehicle according to the gear shifting strategy controller in the driving process of the engineering machinery and the vehicle; (6): the predictive Q network is updated periodically during travel. According to the invention, the gear shifting strategy is updated by a deep reinforcement learning method, so that the dynamic optimization of the gear shifting strategy is realized.

Description

Gear shifting strategy dynamic optimization method based on deep reinforcement learning
Technical Field
The invention belongs to the field of engineering machinery and vehicle engineering, and particularly relates to a gear shifting strategy dynamic optimization method based on deep reinforcement learning.
Background
The gear shifting strategy is one of core technologies of the existing engineering machinery and vehicle control technology, and refers to a rule that gears change along with selected parameters in the driving process of the engineering machinery and the vehicle. The solving method is mainly considered by establishing a gear shifting strategy. The solving method of the gear shifting strategy comprises a graphical method, an analytical method, a genetic algorithm, a dynamic programming method and the like. The solution and optimization of the shift strategy are core directions related to the research of the shift strategy, and particularly, the dynamic optimization of the shift strategy is one of the difficulties in the research of the shift strategy.
The method comprises the steps of 'correcting the optimal dynamic AMT gear shifting law based on variable load', Lihao, control project, volume 22, No. 1, pages 50-54, and 1 month 2015. Acceleration is introduced as a gear shifting parameter on the basis of a two-parameter gear shifting strategy, and dynamic three-parameter gear shifting considering the acceleration is realized. The solving method is an analytical method, an acceleration-speed curve needs to be fitted according to each accelerator opening degree in the solving process, the solving is complex and large in calculation amount, and meanwhile, the solving can be only carried out aiming at a single performance index, and dynamic optimization cannot be carried out aiming at the actual running condition.
"Performance Evaluation Approach for induced geographic scheduling Optimization", Yin X, year 2016, month 05. The gear shifting strategy is optimized by using the genetic algorithm, the comprehensive performance of the gear shifting strategy is improved, and the problem that only a single performance index can be solved by an analytical method is solved, but dynamic optimization cannot be performed on the actual driving condition.
"Optimal gear shift strategies for fuel economy and drive availability", VietDacNgo, Proceedings of the institute of Mechanical Engineers, Part D: Journal of automatic Engineering, Vol.227, No. 10, p.1398 to 1413, p.2013, month 10. And solving the gear shifting strategy aiming at the specific driving cycle working condition through a dynamic programming method. The disadvantages are that: when the shift schedule is solved by dynamic programming, a complex state diagram needs to be constructed, and the state diagram is expressed in a table form. The complexity of the state diagram depends on the degree of dispersion in the dynamic programming algorithm. An overly complex state diagram may have a reduced rate of convergence or an inability to converge due to a bellman latitude disaster. Meanwhile, due to the fact that optimization is carried out aiming at a specific driving cycle, dynamic optimization cannot be carried out in the driving process.
In the existing patent, patent application No. 201710887558.X discloses a method for optimizing a shift schedule of an automobile by using a dynamic programming algorithm. According to the embodiment, shift schedules based on economy and dynamics are respectively established. When the gear shifting law is solved by dynamic planning, a complex state diagram needs to be constructed, and the complexity of the state diagram depends on the discrete degree in a dynamic planning algorithm. An overly complex state diagram may have a reduced rate of convergence or an inability to converge due to a bellman latitude disaster. And meanwhile, dynamic optimization cannot be carried out according to the actual running condition.
In prior art patent application No. 201811306659.4 discloses a shift strategy correction method and system based on driving intent. And updating the current gear shifting correction coefficient and the compensation deviant according to the driving process of a driver, and correcting the original gear shifting strategy to realize the dynamic update of the gear shifting strategy. However, the dynamic updating rule of the gear shifting strategy needs to be artificially established, the optimization effect is greatly influenced by the artificial establishment, and meanwhile, the optimization method is not universal and only can be used for a single vehicle type. The degree of intelligence is low.
Generally, most of the existing gear shifting strategy solving or optimizing methods cannot perform dynamic optimization aiming at actual driving conditions, and the self-adaptive capacity is poor. Part of gear shifting strategies capable of realizing dynamic optimization need to artificially establish dynamic updating rules of the gear shifting strategies, and the intelligent performance and the universality are low.
Disclosure of Invention
The invention aims to provide a gear shifting strategy dynamic optimization method based on deep reinforcement learning. According to the method, a gear shifting strategy Markov decision process and a reward function are constructed, then a gear shifting strategy is solved by using a deep reinforcement learning method, a prediction Q network solved by the deep reinforcement learning method is placed in a gear shifting strategy controller to realize gear selection, meanwhile, driving data are collected in the daily driving process, and the gear shifting strategy is updated by using the deep reinforcement learning method to realize dynamic optimization of the gear shifting strategy.
The technical solution for realizing the purpose of the invention is as follows: a gear shifting strategy dynamic optimization method based on deep reinforcement learning comprises the following steps:
step (1): determining a gear shifting strategy state input variable and an action output variable;
step (2): determining a Markov decision process of a gear shifting strategy according to the state input variable and the action output variable in the step (1);
and (3): establishing a reinforcement learning gear shifting strategy reward function according to a gear shifting strategy target;
and (4): solving a deep reinforcement learning gear shifting strategy according to the Markov decision process in the step (2) and the reward function in the step (3); firstly, a Markov chain is calculated through a Markov decision process and a reward function, the Markov chain is stored in an experience pool, and then a prediction Q network in a deep reinforcement learning gear shifting strategy is updated according to data in the experience pool;
and (5): putting the predicted Q network calculated in the step (4) into a gear shifting strategy controller, and selecting gears of the engineering machinery and the vehicle according to the gear shifting strategy controller in the driving process of the engineering machinery and the vehicle;
and (6): during the driving process, collecting the driving data of the engineering machinery and the vehicle, storing the driving data into an experience pool, periodically updating the forecast Q network, and putting the forecast Q network into a gear shifting strategy controller after the updating is finished so as to realize the dynamic optimization of the gear shifting strategy.
Further, the state input variables in the step (1) comprise vehicle speed v and acceleration
Figure BDA0002262464010000032
And accelerator opening degree alphatThe slope of travel and the coefficient of ground friction resistance; the action output variables include gear operations including an upshift, a downshift or a hold gear, and a shift operation, i.e., a selected gear ng
Further, the markov decision process of the shift strategy in step (2) is expressed in the form of a transfer function with the next time state as the current state and the selected action, and the form of the transfer function is as follows:
st+1=T(st,at)
in the formula, st+1Is the state variable at the next moment, stAs a current state variable, atIs the selected action variable, where S ∈ S, a ∈ A, S is the set of state variables, and A is the set of action variables.
Further, the shift strategy reward function in step (3) is positively correlated with shift strategy objectives, including power, economy and comfort.
Further, the gear shifting strategy target is a dynamic gear shifting strategy, which is described as that the engineering machinery and the vehicle reach the highest speed in the shortest time t under the comfort degree constraint condition, and the awarding and punishing mechanism is as follows:
Figure BDA0002262464010000031
in the formula, r is reward calculated by a reward punishment mechanism; r istFor temporary awards, rt=-0.001||VTamx-v||;vTmaxIs the current throttle opening alphatThe maximum vehicle speed; j is the impact degree of the engineering machinery and the vehicle; j. the design is a squaremaxThe designed maximum allowable impact degree.
Further, the form of the markov chain in the step (4) is as follows:
<st,at,rt,st+1>
in the formula, rtIs a temporary prize calculated based on a prize target.
Further, the deep reinforcement learning method in step (4) includes two neural networks with the same structure but different parameters, which are called a predicted Q network and a target Q network, wherein the predicted Q network is used to calculate Q values of actions in the current state, and the target Q network is used to update the predicted Q network.
Further, in step four, the Markov chain is established with the action variable atThe selection is by a greedy algorithm, which is expressed as:
Figure BDA0002262464010000041
in the formula, QpTo predict the Q network, θpE is a greedy algorithm parameter for predicting Q network parameters;
in the step (4), the Markov chain is saved into an experience pool, and then a predicted Q network in the deep reinforcement learning shift strategy is updated according to data in the experience pool, wherein the predicted Q network is used for calculating the driving state stPredicting the Q value under the lower gear set A and the output of the Q network to be Qp(s,A,θp)。
Further, in the step (5), during the driving process of the construction machine and the vehicle, the construction machine and the vehicle select a gear according to the shift strategy controller, and the shift controller selects an appropriate gear a according to the predicted Q network:
a*(s)=argmaxa[Qp(s,a,θp)|a∈A]
in the formula, QpTo predict the Q network, θpTo predict Q network parameters.
Further, the step (6) of collecting the driving data comprises: the system comprises the following components of vehicle speed, accelerator opening, acceleration, running gradient and ground friction resistance coefficient;
the method for updating the predicted Q network in the step (6) comprises two methods: reconstructing the transfer function in the step (2) through the driving data of the engineering machinery and the vehicle, and then updating the prediction Q network according to the step (3) and the step (4); the second method is to directly update the predicted Q network according to the predicted Q network updating method in the step (4);
reconstructing the transfer function in the step (2) by acquiring the driving data of the engineering machinery and the vehicle, wherein the reconstruction method is to recalculate parameters in the transfer function to form the transfer function with the same structure but different parameters, or to fit the transfer function by adopting a neural network, linear fitting and a Fourier transform method;
and (4) acquiring the driving data of the engineering machinery and the vehicle, and updating according to the predicted Q network updating method in the step (4), wherein the updating method of the predicted Q network comprises the following steps:
Qp(s,a,θp)=Qp(s,a,θp)+α(r+γmaxaQt(s,a,θt)-Qp(s,a,θp))2
wherein γ is the reward decrement value; alpha is the neural network learning rate; qtTo target Q network, θtAs a target Q network parameter
Compared with the prior art, the invention has the remarkable advantages that:
(1) by adopting the deep reinforcement learning method, the principle that the prediction Q network is updated by constructing the Markov chain comprising the Markov decision process and the reward function according to the driving process of the engineering machinery and the vehicle can be realized, and the problems of solving and dynamically optimizing the gear shifting strategy can be solved; the method has the characteristic of strong self-adaptive capacity;
(2) by adopting the deep reinforcement learning method, the algorithm has uniformity in the steps of solving and dynamic optimization, is not influenced by the controlled object body, and can be suitable for different vehicle types such as passenger vehicles, engineering machinery, vehicles, special vehicles, electric vehicles and the like; the transfer function can be fitted by adopting a neural network, linear fitting and a Fourier transform method, the influence of an applied applicable object body is avoided, and the method has the characteristic of strong universality;
(3) the gear shifting strategy is solved and dynamically optimized by adopting a deep reinforcement learning method, and the algorithm is not influenced by a controlled object body, and meanwhile, the dynamic optimization of the gear shifting strategy can be realized, so that the method has the characteristic of strong intelligence;
(4) the method and the device have the advantages that the selection of gears is realized by adopting the prediction Q network in deep reinforcement learning, a table form in the traditional method is replaced, and the problem of Bellman latitude disaster can be solved due to the fact that the neural network has strong fitting capacity and is suitable for a gear shifting strategy under a high-dimensional state variable.
Drawings
FIG. 1 is a schematic diagram of a gear shifting strategy dynamic optimization method based on deep reinforcement learning.
FIG. 2 is a flow chart for solving a deep reinforcement learning shift strategy according to the present invention.
FIG. 3 is a diagram of a neural network architecture model employed in the present invention.
FIG. 4 is a process diagram for dynamic optimization of the shift strategy of the present invention.
Detailed Description
The invention provides a gear shifting strategy dynamic optimization method based on deep reinforcement learning. The method constructs a Markov decision process and a reward function of the gear shifting strategy, and then utilizes a deep reinforcement learning method to solve the gear shifting strategy. And then, putting the predicted Q network solved by the deep reinforcement learning method into a gear shifting strategy controller to realize gear selection. Meanwhile, driving data are collected in the daily driving process, and the gear shifting strategy is updated through a deep reinforcement learning method. And realizing dynamic optimization of the gear shifting strategy.
A gear shifting strategy dynamic optimization method based on deep reinforcement learning comprises the following steps:
step one, determining a gear shifting strategy state variable and an action variable.
And step two, determining a Markov decision process of the gear shifting strategy according to the state input variable and the action output variable.
And step three, establishing a reinforcement learning gear shifting strategy reward function according to the gear shifting strategy optimization target.
And step four, solving a deep reinforcement learning gear shifting strategy according to the Markov decision process in the step two and the reward function in the step three. Firstly, a Markov chain is calculated through an established Markov decision process and a reward function, the Markov chain is stored in an experience pool, and then a prediction Q network in a deep reinforcement learning gear shifting strategy is updated according to data in the experience pool.
And step five, putting the predicted Q network calculated in the step four into a gear shifting strategy controller, and selecting gears by the engineering machinery and the vehicle according to the gear shifting strategy controller in the driving process of the engineering machinery and the vehicle.
And step six, in the driving process, acquiring the driving data of the engineering machinery and the vehicle, storing the driving data into an experience pool, periodically updating the predicted Q network, and putting the predicted Q network into a gear shifting strategy controller after the updating is finished so as to realize dynamic optimization of the gear shifting strategy.
Further, in the step one, the shift strategy state variables are engineering machinery and vehicle running state variables or external environment variables. The action variable is a shift operation or a gear shift operation. The gear operation includes an upshift, a downshift, or a gear holding; the gear shift operation is the selected gear.
In the second step, the markov decision process of the gear shifting strategy is expressed in the form of a transfer function T of which the state is the current state and the selected action at the next moment. The transfer function is of the form:
st+1=T(st,at)
in the formula st+1Is the state variable at the next time, stAs current state variable, atIs the selected action variable. Wherein S ∈ S and a ∈ A. S is a set of state variables, and A is a set of action variables. In the gear shifting strategy, the state variables are the running state variables of the engineering machinery and the vehicle or the external environment variables, including the speed, the opening degree of an accelerator, the acceleration, the running gradient and the ground friction resistance coefficient. The action variable includes a shift operation or a shift operation.
In step three, the established gear shift strategy reward function is positively correlated with the gear shift target.
In the third step, the shift targets include power, economy and comfort.
In the fourth step, a Markov chain is calculated through the established Markov decision process and the reward function. The Markov chain is of the form:
<st,at,rt,st+1>
in the formula, rtIs a temporary prize calculated based on a prize target.
In the fourth step, the Markov chain is established, action atThe selection is by a greedy algorithm, which is expressed as:
Figure BDA0002262464010000071
in the formula, QpTo predict the Q network, θpTo predict Q network parameters. e is a greedy algorithm parameter.
In the fourth step, the Markov chain is stored into an experience pool, and then the prediction Q network in the deep reinforcement learning shift strategy is updated according to the data in the experience pool. Predictive Q-network for calculating a driving state stQ value under the lower gear set A. Predicting the output of the Q network as Qp(s,A,θp) The updating method of the prediction Q network comprises the following steps:
Qp(s,a,θp)=Qp(s,a,θp)+α(r+γmaxaQt(s,a,θt)-Qp(s,a,θp))2
wherein γ is the reward decrement value; alpha is the neural network learning rate; qtIs the target Q network. Theta.theta.tIs the target Q network parameter.
And in the fifth step, the engineering machinery and the vehicle select gears according to the gear shifting strategy controller in the driving process. The gear shift controller selects the appropriate gear a based on the predicted Q network.
a*(s)=argmaxa[Qp(s,a,θp)|a∈A]
In the formula, QpTo predict the Q network, θpTo predict Q network parameters.
In the sixth step, the collecting the driving data includes: vehicle speed, accelerator opening, acceleration, gradient of travel, and ground friction resistance coefficient.
In the sixth step, two methods for updating the predicted Q network are included. And the first method is to reconstruct the transfer function in the second step through the driving data of the engineering machinery and the vehicle, and then update the predicted Q network according to the third step and the fourth step. The second method is to directly update the predicted Q network according to the predicted Q network updating method in the fourth step.
In the sixth step, the method for updating the predicted Q network is to reconstruct the transfer function in the second step by collecting the driving data of the engineering machinery and the vehicle, and the reconstruction method is to recalculate the parameters in the transfer function to form the transfer function with the same structure but different parameters, or to fit the transfer function by using a neural network, linear fitting, fourier transform method, and the like.
In the sixth step, the second method for updating the predicted Q network is to collect the driving data of the engineering machinery and the vehicle and update the driving data according to the updating method of the predicted Q network in the fourth step, and the updating method of the predicted Q network is as follows:
Qp(s,a,θp)=Qp(s,a,θp)+α(r+γmaxaQt(s,a,θt)-Qp(s,a,θp))2
in the sixth step, the dynamic optimization of the gear shifting strategy is realized through the updating of the predicted Q network in the deep reinforcement learning.
Examples
The invention provides a gear shifting strategy dynamic optimization method based on deep reinforcement learning. The method constructs a Markov decision process of the gear shifting strategy, and then solves the gear shifting strategy by utilizing a deep reinforcement learning method. And after solving is completed, putting the prediction Q network trained by deep reinforcement learning into a gear shifting strategy controller to realize gear selection. And then, in the driving process, updating the predicted Q network by collecting the driving data of the engineering machinery and the vehicle so as to realize the dynamic optimization of the gear shifting strategy. The updating method of the prediction Q network comprises the following steps: and (4) reconstructing the gear shifting strategy transfer function according to the driving data of the engineering machinery and the vehicle to update the prediction Q network and directly updating the prediction Q network according to a deep reinforcement learning method. The principle of the gear shifting strategy dynamic optimization method based on deep reinforcement learning is shown in fig. 1, and the method comprises the following steps:
step one, determining a gear shifting strategy state variable and an action variable.
And step two, determining a Markov decision process of the gear shifting strategy according to the state input variable and the action output variable.
And step three, establishing a reinforcement learning gear shifting strategy reward function according to the gear shifting strategy optimization target.
And step four, solving a deep reinforcement learning gear shifting strategy according to the Markov decision process in the step two and the reward function in the step three. Firstly, a Markov chain is calculated through an established Markov decision process and a reward function, the Markov chain is stored in an experience pool, and then a prediction Q network in a deep reinforcement learning gear shifting strategy is updated according to data in the experience pool.
And step five, putting the predicted Q network calculated in the step four into a gear shifting strategy controller, and selecting gears of the engineering machinery and the vehicle according to the gear shifting strategy controller in the driving process of the engineering machinery and the vehicle.
And step six, in the driving process, collecting the driving data of the engineering machinery and the vehicle, storing the driving data into an experience pool, periodically updating the predicted Q network, and putting the predicted Q network into a gear shifting strategy controller after the updating is finished so as to realize dynamic optimization of the gear shifting strategy.
The technical solution of the present invention is described below with reference to the accompanying drawings and examples.
Step one, determining a gear shifting strategy state variable and an action variable. In the embodiment, the state variables of the gear shifting strategy are vehicle speed v and acceleration
Figure BDA0002262464010000081
And accelerator opening degree alphat. The action variable being gear ng
In an embodiment, the shift strategy markov decision process is determined from state variables (vehicle speed, acceleration, throttle opening) and action variables (gears). The Markov decision process state transfer function T is:
Figure BDA0002262464010000091
in the formula, TeOutputting torque for the engine; i.e. igFor a gear ngA corresponding gear ratio; i.e. i0Is the transmission ratio of the main reducer; etatFor transmission system efficiency; m is the total weight of the automobile; beta is the equivalent gradient drag coefficient。CdIs the air resistance coefficient; a is the windward area of the automobile; fbIs braking force; r is the effective turning radius of the tire; ρ is the air density.
And step three, establishing a reinforcement learning gear shifting strategy reward function according to the gear shifting target. In the embodiment, the learning objective is a dynamic gear shift strategy, which is described as that the engineering machinery and the vehicle reach the highest speed in the shortest time t under the comfort constraint condition. The reward and punishment mechanism is as follows:
Figure BDA0002262464010000092
in the formula, r is the reward calculated by a reward and punishment mechanism; r istFor a temporary award rt=-0.001||VTamx-v||;vTmaxIs the current throttle opening alphatThe maximum vehicle speed; j is the impact degree of the engineering machinery and the vehicle; j is a unit ofmaxThe designed maximum allowable impact degree.
And step four, solving a deep reinforcement learning gear shifting strategy according to the Markov decision process in the step two and the reward function in the step three. Firstly, a Markov chain is calculated through an established Markov decision process and a reward function, the Markov chain is stored in an experience pool, and then a prediction Q network in a deep reinforcement learning gear shifting strategy is updated according to data in the experience pool. The flow of step four is shown in fig. 2. The specific steps are as follows.
The first step is as follows: firstly, initializing a state variable and an action variable, and calculating the state of the next moment according to the established Markov decision process transfer function.
The second step: the reward is calculated through a designed reward and punishment mechanism.
The third step: the state-action-next-time state and reward representations described above are saved into the experience pool in the form of markov chains.
The fourth step: and taking the state at the next moment as the current state, calculating the Q value under each action by the prediction Q network according to the current state, and then calculating the actually selected gear under the current state according to the Q value under each action through a greedy algorithm. Then returning to the first step and circulating.
In the above steps, when the number of markov chains in the experience pool reaches a predetermined number, the update of the predictive Q network is started.
The updating process of the predicted Q network is completed through the predicted Q network and the target Q network together, and the updating method of the predicted Q network comprises the following steps:
Qp(s,a,θp)=Qp(s,a,θp)+α(r+γmaaxQt(s,a,θt)-Qp(s,a,θp))2
wherein γ is the reward decrement value; alpha is the neural network learning rate; qtIs the target Q network. ThetatIs the target Q network parameter.
In the updating process of the predicted Q network, the parameters of the predicted Q network need to be imported and copied into the target Q network periodically to update the target Q network.
The predicted Q network and the target Q network have the same neural network structure. In the present embodiment, the neural network structure models adopted by the predictive Q network and the target Q network are shown in fig. 3. The used neural network structure model has five full connection layers as intermediate layers, and adopts a linear rectification function ReLU as a neural network activation function. The linear rectification function ReLU is expressed as:
ReLU(x)=max(0,Wx+b)
in the formula: w is the weight of the neural network; b is the bias rank of the neural network; x is the neural network input.
In the present embodiment, the neural network data inputs are state variables (vehicle speed, accelerator opening, acceleration). All gears n are output from the output layergThe corresponding Q value. The larger the Q value is, the more maximized the discount accumulated bonus value can be obtained by selecting the gear corresponding to the Q value under the current state.
And step five, putting the predicted Q network calculated in the step four into a gear shifting strategy controller, and selecting gears of the engineering machinery and the vehicle according to the gear shifting strategy controller in the driving process of the engineering machinery and the vehicle.
And in the fifth step, the engineering machinery and the vehicle select gears according to the gear shifting strategy controller. The concrete expression is as follows:
a*(s)=argmaxa[Qp(s,a,θp)|a∈A]
in the formula, QpTo predict the Q network, θpTo predict Q network parameters.
And step six, in the driving process, acquiring the driving data of the engineering machinery and the vehicle, storing the driving data into an experience pool, periodically updating the predicted Q network, and putting the predicted Q network into a gear shifting strategy controller after the updating is finished so as to realize dynamic optimization of the gear shifting strategy.
In step six, the method for updating the predicted Q network includes two methods. And the first method is to reconstruct the transfer function in the second step through the driving data of the engineering machinery and the vehicle, and then update the predicted Q network according to the third step and the fourth step. The second method is to directly update the predicted Q network according to the predicted Q network updating method in the fourth step.
In the sixth step, the method for updating the predicted Q network is to reconstruct the transfer function in the second step by acquiring the driving data of the engineering machinery and the vehicle, wherein the reconstruction method is to recalculate the parameters in the transfer function to form the transfer function with the same structure but different parameters, or to fit the transfer function by adopting a neural network, linear fitting, a Fourier transform method and the like. In this embodiment, the reconstructed transfer function is:
Figure BDA0002262464010000111
depending on the reconstruction method, in this embodiment, the parameters in the transfer function may be recalculated. Or by fitting the transfer function using a neural network, linear fitting, and fourier transform. Regardless of the form of reconstruction performed, the reconstructed transfer function can be expressed uniformly as:
st+1=Tnew(st,at,Θ)
where Ω is a transfer function parameter.
After reconstruction is finished, the fourth step and the fifth step need to be carried out again to obtain a new predicted Q network.
And in the sixth step, the second method for updating the predicted Q network is to collect the driving data of the engineering machinery and the vehicle and then update according to the updating method of the predicted Q network in the fourth step.
In the sixth step, the second method for updating the predicted Q network is to acquire the driving data of the engineering machinery and the vehicle and update the driving data according to the predicted Q network updating method in the fourth step, so as to realize the dynamic optimization process of the gear shifting strategy, and the dynamic optimization process of the gear shifting strategy is shown in fig. 4. The specific process is as follows:
the first step is as follows: collecting engineering machinery and vehicle running data
The second step: the collected driving data of the engineering machinery and the vehicle is processed, and the processed data is expressed in a Markov chain form and can be expressed as follows:
<st,at,rt,st+1>
the third step: updating the predicted Q network, wherein the updating process of the predicted Q network is completed by the predicted Q network and the target Q network together, and the method comprises the following steps:
Qp(s,a,θp)=Qp(s,a,θp)+α(r+γmaxaQt(s,a,θt)-Qp(s,a,θp))2
in addition to the above embodiments, the present invention may have other embodiments. All technical solutions formed by adopting equivalent substitutions or equivalent transformations fall within the protection scope of the present invention.

Claims (9)

1. A gear shifting strategy dynamic optimization method based on deep reinforcement learning is characterized by comprising the following steps:
step (1): determining a gear shifting strategy state input variable and an action output variable;
step (2): determining a gear shifting strategy Markov decision process according to the state input variable and the action output variable in the step (1);
and (3): establishing a reinforcement learning gear shifting strategy reward function according to a gear shifting strategy target;
and (4): solving a deep reinforcement learning gear shifting strategy according to the Markov decision process in the step (2) and the reward function in the step (3); firstly, calculating a Markov chain through a Markov decision process and a reward function, storing the Markov chain into an experience pool, and then updating a prediction Q network in a deep reinforcement learning gear shifting strategy according to data in the experience pool;
and (5): putting the predicted Q network calculated in the step (4) into a gear shifting strategy controller, and selecting gears of the engineering machinery and the vehicle according to the gear shifting strategy controller in the driving process;
and (6): in the running process, collecting the running data of the engineering machinery and the vehicle, storing the running data into an experience pool, periodically updating the predicted Q network, and putting the predicted Q network into a gear shifting strategy controller after the updating is finished to realize dynamic optimization of the gear shifting strategy;
step (1), determining a gear shifting strategy state variable and an action output variable, wherein the gear shifting strategy state variable is vehicle speed v and acceleration
Figure FDA0003670067980000011
And accelerator opening degree alphatThe motion output variable is a gear ng
Determining a Markov decision process of a gear shifting strategy according to state variables, namely vehicle speed and acceleration, and an accelerator opening and action output variable, namely a gear, wherein a state transfer function T in the Markov decision process is as follows:
Figure FDA0003670067980000012
in the formula, TeOutputting torque for the engine; i all right anglegFor gear ngA corresponding transmission ratio; i all right angle0Is the transmission ratio of the main reducer; etatTo the transmission system efficiency; m is the total weight of the automobile; beta is the equivalent gradient resistance coefficient; cdIs the air resistance coefficient; a is the windward area of the automobile; fbIs the braking force; r is effective rotation of tireA dynamic radius; ρ is the air density;
and (3) the gear shifting strategy target is a dynamic gear shifting strategy, which is described as that the engineering machinery and the vehicle reach the highest speed in the shortest time t under the comfort degree constraint condition, and the reward and punishment mechanism is as follows:
Figure FDA0003670067980000021
in the formula, r is the reward calculated by a reward and punishment mechanism; r istFor temporary awards, rt=-0.001||VTamx-v||;vTmaxIs the current throttle opening alphatThe maximum vehicle speed; j is the impact degree of engineering machinery and vehicles; j is a unit ofmaxThe designed maximum allowable impact degree;
the specific steps of the step (4) are as follows;
the first step is as follows: firstly, initializing a state variable and an action variable, and calculating the state of the next moment according to an established Markov decision process transfer function;
the second step is that: calculating rewards through a designed reward and punishment mechanism;
the third step: storing the state-action-next-time state and reward representation in a Markov chain into an experience pool;
the fourth step: taking the state at the next moment as the current state, calculating the Q value under each action according to the current state by the prediction Q network, and then calculating the actually selected gear under the current state according to the Q value under each action by a greedy algorithm; then returning to the first step, and circulating and reciprocating;
in the above steps, when the number of Markov chains in the experience pool reaches a preset number, the predicted Q network is updated;
the updating process of the predicted Q network is completed by the predicted Q network and the target Q network together, and the updating method of the predicted Q network comprises the following steps:
Qp(s,a,θp)=Qp(s,a,θp)+α(r+γmaxaQt(s,a,θt)-Qp(s,a,θp))2
wherein γ is the reward decrement value; alpha is the neural network learning rate; qtTo target Q network, θtFor the target Q network parameter, QpTo predict the Q network, θpTo predict Q network parameters.
2. The method according to claim 1, wherein the state input variables in step (1) include vehicle speed v, acceleration v
Figure FDA0003670067980000022
And accelerator opening degree alphatThe slope of travel and the coefficient of ground friction resistance; the action output variables include gear operations including upshift, downshift or hold, and shift operations, i.e., selected gear ng
3. A method according to claim 2, characterized in that the shift strategy markov decision process in step (2) is expressed in the form of a transfer function with the next moment state being the current state and the selected action, the transfer function being of the form:
st+1=T(st,at)
in the formula, st+1Is the state variable at the next time, stAs current state variable, atIs the selected action variable, where S ∈ S, a ∈ A, S is the set of state variables, and A is the set of action variables.
4. The method of claim 3, wherein the shift strategy reward function in step (3) is positively correlated with shift strategy objectives, including power, economy, and comfort.
5. The method of claim 4, wherein the Markov chain in step (4) is of the form:
<st,at,rt,st+1>
in the formula, rtIs a temporary prize calculated based on a prize target.
6. The method according to claim 5, wherein the deep reinforcement learning method in step (4) comprises two neural networks with the same structure but different parameters, namely a predicted Q network and a target Q network, wherein the predicted Q network is used for calculating Q values of actions in the current state, and the target Q network is used for updating the predicted Q network.
7. The method of claim 6, wherein in step (4), the Markov chain is established with an action variable atThe selection is by a greedy algorithm, which is expressed as:
Figure FDA0003670067980000031
in the formula, QpTo predict the Q network, θpE is a greedy algorithm parameter for predicting Q network parameters;
in the step (4), the Markov chain is saved into an experience pool, and then a predicted Q network in the deep reinforcement learning shift strategy is updated according to data in the experience pool, wherein the predicted Q network is used for calculating the driving state stPredicting the Q value under the lower gear set A and the output of the Q network to be Qp(s,A,θp)。
8. The method according to claim 7, wherein in the step (5), the working machine and the vehicle select gears according to a shift strategy controller during driving, and the shift controller selects an appropriate gear a according to the predicted Q network:
a*(s)=argmaxa[Qp(s,a,θp)|a∈A]
in the formula, QpTo predict the Q network, θpTo predict Q network parameters.
9. The method of claim 8, wherein collecting travel data in step (6) comprises: the system comprises the following components of vehicle speed, accelerator opening, acceleration, running gradient and ground friction resistance coefficient;
the method for updating the predicted Q network in the step (6) comprises two methods: reconstructing the transfer function in the step (2) through the driving data of the engineering machinery and the vehicle, and then updating the prediction Q network according to the step (3) and the step (4); the second method is to directly update the predicted Q network according to the predicted Q network updating method in the step (4);
reconstructing the transfer function in the step (2) by acquiring the driving data of the engineering machinery and the vehicle, wherein the reconstruction method is to recalculate parameters in the transfer function to form the transfer function with the same structure but different parameters, or to fit the transfer function by adopting a neural network, linear fitting and a Fourier transform method;
and (4) acquiring the driving data of the engineering machinery and the vehicle, and updating according to the predicted Q network updating method in the step (4).
CN201911076016.XA 2019-11-06 2019-11-06 Gear shifting strategy dynamic optimization method based on deep reinforcement learning Active CN110716550B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911076016.XA CN110716550B (en) 2019-11-06 2019-11-06 Gear shifting strategy dynamic optimization method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911076016.XA CN110716550B (en) 2019-11-06 2019-11-06 Gear shifting strategy dynamic optimization method based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN110716550A CN110716550A (en) 2020-01-21
CN110716550B true CN110716550B (en) 2022-07-22

Family

ID=69213797

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911076016.XA Active CN110716550B (en) 2019-11-06 2019-11-06 Gear shifting strategy dynamic optimization method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN110716550B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111487863B (en) * 2020-04-14 2022-06-17 东南大学 Active suspension reinforcement learning control method based on deep Q neural network
CN111882030B (en) * 2020-06-29 2023-12-05 武汉钢铁有限公司 Ingot adding strategy method based on deep reinforcement learning
CN111965981B (en) * 2020-09-07 2022-02-22 厦门大学 Aeroengine reinforcement learning control method and system
CN112395690A (en) * 2020-11-24 2021-02-23 中国人民解放军海军航空大学 Reinforced learning-based shipboard aircraft surface guarantee flow optimization method
CN112861269B (en) * 2021-03-11 2022-08-30 合肥工业大学 Automobile longitudinal multi-state control method based on deep reinforcement learning preferential extraction
CN114662982B (en) * 2022-04-15 2023-07-14 四川大学 Multistage dynamic reconstruction method for urban power distribution network based on machine learning
CN116069014B (en) * 2022-11-16 2023-10-10 北京理工大学 Vehicle automatic control method based on improved deep reinforcement learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107797534A (en) * 2017-09-30 2018-03-13 安徽江淮汽车集团股份有限公司 A kind of pure electronic automated driving system
CN108407797A (en) * 2018-01-19 2018-08-17 洛阳中科龙网创新科技有限公司 A method of the realization agricultural machinery self shifter based on deep learning
CN109325624A (en) * 2018-09-28 2019-02-12 国网福建省电力有限公司 A kind of monthly electric power demand forecasting method based on deep learning
CN109991856A (en) * 2019-04-25 2019-07-09 南京理工大学 A kind of integrated control method for coordinating of robot driver vehicle
CN110136481A (en) * 2018-09-20 2019-08-16 初速度(苏州)科技有限公司 A kind of parking strategy based on deeply study
CN110244701A (en) * 2018-03-08 2019-09-17 通用汽车环球科技运作有限责任公司 The method and apparatus of intensified learning for the autonomous vehicle based on the course sequence automatically generated

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6442455B1 (en) * 2000-12-21 2002-08-27 Ford Global Technologies, Inc. Adaptive fuel strategy for a hybrid electric vehicle
US20180018757A1 (en) * 2016-07-13 2018-01-18 Kenji Suzuki Transforming projection data in tomography by means of machine learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107797534A (en) * 2017-09-30 2018-03-13 安徽江淮汽车集团股份有限公司 A kind of pure electronic automated driving system
CN108407797A (en) * 2018-01-19 2018-08-17 洛阳中科龙网创新科技有限公司 A method of the realization agricultural machinery self shifter based on deep learning
CN110244701A (en) * 2018-03-08 2019-09-17 通用汽车环球科技运作有限责任公司 The method and apparatus of intensified learning for the autonomous vehicle based on the course sequence automatically generated
CN110136481A (en) * 2018-09-20 2019-08-16 初速度(苏州)科技有限公司 A kind of parking strategy based on deeply study
CN109325624A (en) * 2018-09-28 2019-02-12 国网福建省电力有限公司 A kind of monthly electric power demand forecasting method based on deep learning
CN109991856A (en) * 2019-04-25 2019-07-09 南京理工大学 A kind of integrated control method for coordinating of robot driver vehicle

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
拖拉机驾驶机器人设计及人机协作方法研究;卢伟;《南京信息工程大学学报》;20190328;第165-173页 *

Also Published As

Publication number Publication date
CN110716550A (en) 2020-01-21

Similar Documents

Publication Publication Date Title
CN110716550B (en) Gear shifting strategy dynamic optimization method based on deep reinforcement learning
CN108087541B (en) Multi-performance comprehensive optimal gear decision system of automobile stepped automatic transmission
CN111731303B (en) HEV energy management method based on deep reinforcement learning A3C algorithm
WO2021114742A1 (en) Comprehensive energy prediction and management method for hybrid electric vehicle
CN110936824B (en) Electric automobile double-motor control method based on self-adaptive dynamic planning
CN111267831A (en) Hybrid vehicle intelligent time-domain-variable model prediction energy management method
CN109591659B (en) Intelligent learning pure electric vehicle energy management control method
CN110550034A (en) two-gear AMT comprehensive gear shifting method for pure electric vehicle
CN112943914B (en) Vehicle gear shifting line determining method and device, computer equipment and storage medium
Zhao et al. Torque coordinating robust control of shifting process for dry dual clutch transmission equipped in a hybrid car
CN109334654A (en) A kind of parallel hybrid electric vehicle energy management method with gearbox-gear control
CN110985566B (en) Vehicle starting control method and device, vehicle and storage medium
CN113104023B (en) Distributed MPC network-connected hybrid electric vehicle energy management system and method
CN110792762A (en) Method for controlling prospective gear shifting of commercial vehicle in cruise mode
CN112009456A (en) Energy management method for network-connected hybrid electric vehicle
CN113682293B (en) Multi-system dynamic coordination control system and method for intelligent network-connected hybrid electric vehicle
CN112765723B (en) Curiosity-driven hybrid power system deep reinforcement learning energy management method
He et al. MPC-based longitudinal control strategy considering energy consumption for a dual-motor electric vehicle
Shen et al. Two-level energy control strategy based on ADP and A-ECMS for series hybrid electric vehicles
Mashadi et al. An automatic gear-shifting strategy for manual transmissions
CN115805840A (en) Energy consumption control method and system for range-extending type electric loader
CN113997926A (en) Parallel hybrid electric vehicle energy management method based on layered reinforcement learning
Zou et al. Research on shifting process control of automatic transmission
CN106347373A (en) Dynamic planning method based on battery SOC (state of charge) prediction
Wang et al. Predictive ramp shift strategy with dual clutch automatic transmission combined with GPS and electronic database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant