CN115476841A - Plug-in hybrid electric vehicle energy management method based on improved multi-target DDPG - Google Patents

Plug-in hybrid electric vehicle energy management method based on improved multi-target DDPG Download PDF

Info

Publication number
CN115476841A
CN115476841A CN202211235470.7A CN202211235470A CN115476841A CN 115476841 A CN115476841 A CN 115476841A CN 202211235470 A CN202211235470 A CN 202211235470A CN 115476841 A CN115476841 A CN 115476841A
Authority
CN
China
Prior art keywords
energy management
battery
plug
target
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211235470.7A
Other languages
Chinese (zh)
Inventor
孙希雷
付建勤
刘琦
袁硕
吴跃
许东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University Chongqing Research Institute
Original Assignee
Hunan University Chongqing Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University Chongqing Research Institute filed Critical Hunan University Chongqing Research Institute
Priority to CN202211235470.7A priority Critical patent/CN115476841A/en
Publication of CN115476841A publication Critical patent/CN115476841A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W20/00Control systems specially adapted for hybrid vehicles
    • B60W20/10Controlling the power contribution of each of the prime movers to meet required power demand
    • B60W20/11Controlling the power contribution of each of the prime movers to meet required power demand using model predictive control [MPC] strategies, i.e. control methods based on models predicting performance

Abstract

The invention discloses a plug-in hybrid electric vehicle energy management method based on improved multi-target DDPG, which comprises the following steps: establishing an energy management system model of the plug-in hybrid electric vehicle, which comprises a longitudinal dynamics model of the whole vehicle, an engine fuel consumption model, a battery equivalent circuit model, a battery service life model and a driving/generator model; acquiring state information of the plug-in hybrid electric vehicle in actual running, and inputting the state information into an energy management system model; taking the energy management system model as an intelligent agent of the IMDDPG, configuring a reward function of the IMDDPG according to the accumulated oil consumption of an engine, the deviation degree of the SOC of the battery and the change of the health condition of the battery, and performing multi-target optimization on the energy management system model by using the IMDDPG to obtain a trained reinforcement learning model; and inputting the initial state information and the driving working condition into the reinforcement learning model to obtain the energy management strategy in driving. The invention comprehensively considers the fuel economy and the battery life, and improves the optimality and universality of the strategy while ensuring the real-time performance.

Description

Plug-in hybrid electric vehicle energy management method based on improved multi-target DDPG
Technical Field
The invention relates to the field of hybrid electric vehicle energy management, in particular to a plug-in hybrid electric vehicle energy management method based on improved multi-target DDPG.
Background
The plug-in hybrid electric vehicle (PHEV) has the advantages of both a pure electric vehicle and a hybrid electric vehicle, not only overcomes the defects of short endurance mileage, imperfect infrastructure and the like of the pure electric vehicle, but also can further excavate the potential of an engine as a driving device, and has become the mainstream direction of research and development of various traditional vehicle manufacturers in the process of electric transformation at the present stage. The PHEV has a complex structure and multiple driving modes, so that the PHEV is a key technology for reasonably distributing the power requirements of an engine and a motor, reasonably switching the modes and gears and realizing optimal management of energy. Currently, the energy management strategies of PHEVs can be divided into rule-based and optimization-based strategies. However, the existing energy management strategies generally have the defects of poor instantaneity, complex calculation, poor adaptability, non-ideal optimization performance and the like, and the service life of a battery is also a key technology for restricting the development of the PHEV. Therefore, the research of the PHEV energy management strategy with real-time performance, adaptability and optimality has important research value and application value by comprehensively considering the economy of the PHEV and the service life of the battery.
With the popularization and development of artificial intelligence, an energy management strategy based on a deep reinforcement learning algorithm has the characteristics of instantaneity, adaptability and the like, and thus, the energy management strategy has attracted extensive attention of researchers. However, the existing deep reinforcement learning energy management strategy of the PHEV adopts a method of integrating multiple targets (such as fuel consumption and SOC deviation degree) into a single target through weight factors as a reward function, and the service life of the battery is mostly not taken into consideration.
Patent CN114801897A discloses a fuel cell hybrid power system energy management method based on DDPG algorithm, which is to establish a DDPG algorithm model for a hybrid power system of a dual-stack fuel cell and a lithium battery, and perform parameter matching on the algorithm model and a dynamic model; setting the state, action and reward of the algorithm model; an optimal objective function based on the running cost is established for the hybrid power system, so that the loss of the fuel cell and the lithium cell is reduced, and the service life of the fuel cell and the lithium cell is prolonged. The method has the following defects:
(1) The costs of the freight trucks are coupled into the objective function, and the diversity of the objective function is reduced.
(2) Four targets of hydrogen consumption, fuel cell loss, lithium battery degradation and SoC change are coupled into one target through the weight factors, and the calculation cost for determining the weight factors is increased. Meanwhile, according to different models and different target functions, the weight factor needs to be formulated again, the efficiency is low, and meanwhile the adaptability of the algorithm is reduced.
(3) The optimal solution of multi-objective optimization is often not unique, but a solution set composed of a plurality of mutually dominated optimal solutions, so that a plurality of targets are coupled into one target, global multi-objective optimization cannot be realized, and the optimality of understanding is reduced.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the PHEV energy management strategy integrates real-time performance, adaptability and optimality to a certain extent, based on continuous action, realizes optimal PHEV fuel economy and no large deviation of SOC, and simultaneously considers the service life of a battery to realize multi-target PHEV energy management.
Aiming at the technical problems in the prior art, the invention provides a plug-in hybrid electric vehicle energy management method based on improved multi-target DDPG, which comprehensively considers the fuel economy and the battery life, and improves the optimality and universality of strategies while ensuring the real-time performance.
In order to solve the technical problems, the technical scheme provided by the invention is as follows:
a plug-in hybrid electric vehicle energy management method based on improved multi-target DDPG comprises the following steps:
establishing an energy management system model of the plug-in hybrid electric vehicle, wherein the energy management system model comprises a longitudinal dynamic model of the whole vehicle, an engine fuel consumption model, a battery equivalent circuit model, a battery service life model and a driving/power generator model;
acquiring state information of the plug-in hybrid electric vehicle in actual running, and inputting the state information into the energy management system model;
the energy management system model is used as an intelligent body of the IMDDPG, a reward function of the IMDDPG is configured according to the accumulated oil consumption of an engine, the deviation degree of the SOC of a battery and the change of the health condition of the battery, the speed, the acceleration and the SOC of the plug-in hybrid electric vehicle are used as state variables of the IMDDPG, the required power of the plug-in hybrid electric vehicle is used as an action variable of the IMDDPG, and the IMDDPG is used for carrying out multi-target optimization on the energy management system model to obtain a trained reinforcement learning model;
and inputting the initial state information and the driving working condition into the reinforcement learning model to obtain the energy management strategy in driving.
Further, the multi-target optimization of the energy management system model by the IMDDPG comprises the following steps:
obtaining state information s of the energy management system model t Inputting the data into an Actor evaluation network to obtain a corresponding action a t And inputting the energy management system model to obtain a corresponding reward r(s) under the influence of the environment t ,a t ) And next state information s t+1 Current sample data e t =(s t ,a t ,r(s t ,a t ),s t+1 ) Storing the data in an experience pool, and repeating the step until the number of the sample data in the experience pool meets the requirement;
randomly selecting samples from the experience pool, and obtaining the state information s of the selected samples t Inputting the data into the Actor evaluation network to obtain the corresponding action a t State information s t And corresponding action a t Inputting the result into a Critic evaluation network, and solving the Pareto optimal front edge to obtain the cumulative reward Q(s) corresponding to the selected sample book t ,a t Theta), calculate the cumulative prize Q(s) t ,a t | θ) to obtain a loss function and perform back propagation, and updating parameters in the Actor evaluation network by using a gradient ascent method;
status information s of the sample to be selected t Inputting the data into the updated Actor evaluation network to obtain the updated corresponding actionAs a t State information s t And updated corresponding action a t Inputting the result into a Critic evaluation network, solving the Pareto optimal front edge to obtain the updated accumulated reward Q(s) corresponding to the selected sample book t ,a t |θ);
Next state information s of the sample to be selected t+1 Inputting the action into the Actor target network to obtain the corresponding action a t+1 Information on the state s t+1 And corresponding action a t+1 Inputting the data into a Critic target network, and solving the Pareto optimal leading edge to obtain the state information s of the selected sample t+1 And action a t+1 Accumulated reward Q(s) of t+1 ,a t+1 | θ'), calculating the cumulative prize Q(s) t+1 ,a t+1 The mean square error of | theta') obtains a loss function and reversely propagates, and parameters in the Critic evaluation network are updated by using a gradient descent method;
updating the parameters of the Actor evaluation network and the criticic evaluation network into the Actor target network and the criticic target network, and returning to obtain the state information s of the energy management system model t And inputting the evaluation result into an Actor for evaluating the steps in the network until the cycle number meets the requirement.
Further, before updating the parameters of the Actor evaluation network and the Critic evaluation network into the Actor target network and the Critic target network, whether the difference between the number of times of the current cycle and the number of cycles of the last update reaches a preset step length is also judged, if yes, the parameters of the Actor evaluation network and the Critic evaluation network are updated into the Actor target network and the Critic target network, otherwise, the execution is returned to obtain the state information s of the energy management system model t Input to the Actor for evaluating the steps in the network.
Further, the step of solving the Pareto optimal leading edge comprises the following steps: according to status information s t And corresponding action a t Solving a value function Q about optimal actions * (s t ,a t ) And randomly selecting an optimal action value function as the maximum accumulated reward at the Pareto optimal front, or selecting the optimal action value function with the minimum target as the maximum accumulated reward.
Further, the optimal action cost function Q * (s t ,a t ) The expression of (a) is as follows:
Figure BDA0003883422220000031
wherein R is t For the accumulated prize to be discounted,
Figure BDA0003883422220000032
gamma is a discount factor, gamma belongs to [0,1]],r i For the reward function at time i, i ∈ [ T, T ]]And T is the termination time.
Further, the whole vehicle longitudinal dynamics model expression is as follows:
Figure BDA0003883422220000033
wherein, F D As a driving force, P D For driving power, T D For drive torque, v is vehicle speed, F R 、F A 、F G 、F A Respectively rolling resistance, air resistance, gradient resistance and acceleration resistance in the running process of the vehicle, A is the windward area of the vehicle, C D Is the coefficient of air resistance, ρ is the air density, c R Is a rolling resistance coefficient, m is the total mass of the vehicle, g is the gravitational acceleration, theta is the road gradient, delta is the vehicle rotating mass conversion coefficient,
Figure BDA0003883422220000034
is the running acceleration, r is the wheel radius;
Figure BDA0003883422220000041
wherein, T EN 、T EM1 And T EM2 Torque, ω, of the engine, the EM1 motor and the EM2 motor, respectively, of the plug-in hybrid vehicle EN 、ω EM1 And ω EM2 Respectively of plug-in typeRotating speed, T, of motor-vehicle engine, EM1 motor and EM2 motor Out Is the output torque of the transmission, K 1 、K 2 Gear ratios of a ring gear and a sun gear of PG1 and PG2 of the plug-in hybrid automobile respectively, i is a gear transmission ratio of a main reducer.
Further, the engine fuel consumption model expression is as follows:
Figure BDA0003883422220000042
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003883422220000043
for instantaneous fuel consumption of the engine, m f Is the cumulative fuel consumption, T, of the engine EN 、ω EN Torque and rotational speed of the engine of the plug-in hybrid vehicle, respectively.
Further, the expression of the battery equivalent circuit model is as follows:
Figure BDA0003883422220000044
wherein, U Bat Is the terminal voltage of the battery, I Bat Is the battery current, U OC For open circuit current, R Bat Is the internal resistance of the battery, P Bat Is the battery power, SOC (0) is the initial value of SOC, Q Bat Is the battery capacity.
Further, the battery life model expression is as follows:
Figure BDA0003883422220000045
wherein Q is Loss For cell capacity fade, α and β are constant terms, E A R is the molar gas constant, T for the activation energy K Is the thermodynamic temperature of the environment, ah is ampere-hour flux, z is power exponent factor, Q Bat Is the battery capacity, I Bat Is the battery current, EOL isThe end of life of the battery, N is the total number of cycles, SOH is the state of health of the battery, and SOC is the state of charge of the battery.
Further, the drive/generator model expression is as follows:
η EM =f(T EMEM )
wherein, T EM As motor torque, ω EM Is the motor speed, η EM The corresponding motor efficiency;
Figure BDA0003883422220000051
wherein, P EM For mechanical power of the motor, P Bat,EM The power delivered to the motor by the battery.
Compared with the prior art, the invention has the advantages that:
(1) The invention comprehensively considers the economy of the plug-in hybrid electric vehicle and the service life of the battery, and realizes multi-target optimization of the energy management strategy of the plug-in hybrid electric vehicle.
(2) The method and the device use the IMDDPG to optimize the energy management strategy of the plug-in hybrid electric vehicle based on the continuous action in actual driving, overcome the defect that the energy management strategy based on discrete action is difficult to optimize, and better accord with the characteristic of actual driving.
(3) The IMDDPG based energy management strategy improves the optimality and universality of the strategy while ensuring the real-time performance through continuous learning, gets rid of the dependence of the prior energy management strategy on the driving working condition, ensures the optimality under the standard test working condition and the optimality under the actual driving working condition, and improves the adaptivity of the strategy.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention.
Fig. 2 is a schematic diagram of a power system structure of the plug-in hybrid electric vehicle.
FIG. 3 is a flow chart of IMDDPG.
Detailed Description
The invention is further described below with reference to the drawings and specific preferred embodiments of the description, without thereby limiting the scope of protection of the invention.
DDPG is a single-target reinforcement learning algorithm, a plurality of targets can only be equivalent to a single target through weight factors, firstly, the determination of the weight factors needs prior knowledge or a large amount of work, and the global optimization cannot be realized theoretically, so that the DDPG algorithm has many problems in solving the actual problem of multi-target optimization, and the improvement of the multi-target DDPG algorithm, namely IMDDPG, is provided.
Assuming that the continuous state set of the agent is S, the continuous action sequence is A, and when the current state of the agent is S t E is S, and takes action as a t When the state belongs to A, the state of the agent is transferred to a new state s under the action of the environment t+1 Belongs to S, and the generated instant reward is r (S) t ,a t )。
Deep reinforcement learning selects the action of the intelligent agent in a mode of maximizing accumulated reward, namely comprehensively considering instant reward and future reward, continuously improving a strategy pi to enable the obtained accumulated reward to be maximum, wherein the strategy corresponding to the maximum accumulated reward is the optimal strategy pi * (as). Where policy pi is the series of actions taken by the agent from start to finish.
The state of the agent is s t Adopt the action as a t Is the optimal action cost function Q * (s t ,a t ) Comprises the following steps:
Figure BDA0003883422220000061
wherein R is t For the accumulated prize to be discounted,
Figure BDA0003883422220000062
gamma is a discount factor, gamma belongs to [0,1]],r i For the reward at time i, i ∈ [ T, T ]]And T is the termination time. Optimal action cost function Q * (s t ,a t ) Obey the bellman equation:
Q * (s t ,a t )=E[r(s t ,a t )+γQ * (s t+1 ,a t+1 )|s t ,a t ] (2)
as shown in fig. 1, the main flow of the IMDDPG algorithm is as follows:
(1) As shown by the dashed box (1) in FIG. 1, the current state information s of the agent is determined t The input is input into the Actor evaluation network, and the output is a corresponding action (i.e. utilization), or one action (i.e. search) is randomly generated and is denoted by a t Will act a t Input into the environment, and obtain the reward r(s) through the action with the environment t ,a t ) And next state information s t+1 . The current state information s t Selected action a t The prize r(s) earned t ,a t ) And a next state s t+1 And storing the experience to an experience pool U. Then the status information s t+1 Inputting the data into the Actor evaluation network, and circulating the step (1) until a certain number e is stored t =(s t ,a t ,r(s t ,a t ),s t+1 ) During this process, the Actor evaluates the parameters in the network and does not update. Wherein, adopt epsilon-greedy algorithm to realize the equilibrium between exploration and the utilization when selecting the action, guarantee abundant exploration and reasonable utilization:
Figure BDA0003883422220000063
wherein epsilon is [0,1] as an exploration rate, the epsilon-greedy algorithm selects and explores according to the probability of epsilon, and an Actor is selected according to the probability of 1-epsilon to evaluate the action of network output. Therefore, in order to ensure the performance of the deep reinforcement learning algorithm and prevent the deep reinforcement learning algorithm from falling into local optimum, epsilon is generally set with a larger initial value to ensure sufficient exploration capacity, and as iteration progresses, epsilon value is gradually reduced to ensure full utilization and accelerate convergence of the algorithm.
(2) As shown by the dashed box (2) in fig. 1, a part of samples e is randomly extracted from the experience pool U, and first, the state information s in e is used t Inputting the data into an Actor evaluation network to obtain a corresponding action a t Then, thenState information s t And corresponding action a t The information is commonly input into a Critic evaluation network to obtain the state information s t And action a t Accumulated reward Q(s) of t ,a t | θ), θ represents a parameter of Critic evaluation network. By solving for Q(s) t ,a t | θ) is expected to get the loss function and propagate backwards, the calculation formula is:
L ω =E e~U [Q(s,a|θ)] (4)
wherein, Q(s) t ,a t | θ) as status information s t And action a t The omega is the parameter of the Actor evaluation network;
the parameters in the Actor evaluation network are updated by calculating gradients by using the loss functions by using a gradient ascent method, namely the parameters in the state s are obtained by back propagation t Lower Q(s) t ,a t | θ) the action corresponding to the maximum expected value, the gradient is calculated first:
Figure BDA0003883422220000064
wherein, Q(s) t ,a t | θ) as status information s t And action a t The accumulated award of the next round of money,
Figure BDA0003883422220000065
for the gradient of the cumulative reward function Q (s, a | θ) with respect to a,
Figure BDA0003883422220000071
a (s ω) is an action function output when the parameter of the Actor evaluation network is ω and the input is s;
and then updating parameters in the Actor evaluation network by using a gradient ascent method:
Figure BDA0003883422220000072
wherein alpha is l And evaluating the learning rate of network parameter updating for the Actor. In this step, only the parameters in the Actor evaluation network are updatedThe Critic evaluation network parameters remain unchanged.
Since the reward function r in the single-target DDPG algorithm is a numerical value, the accumulated reward Q is also a numerical value, and similarly, the expectation value of the accumulated reward and the loss function are both numerical values. When the single-target DDPG algorithm is improved into the multi-target DDPG algorithm, the Pareto theory is combined to improve the reward function from the single target to the multi-target, namely, a numerical value is improved into an array, and each numerical value in the array is a target of the reward function. The improvement involves two problems to be solved, one is how to select the maximum accumulated reward Q, and the other is how to solve the loss function and carry out back propagation.
First, the Pareto optimal frontier is introduced to select the maximum jackpot Q, as to how to select the maximum jackpot Q, i.e., how to compare the magnitudes of the jackpot Q.
The Pareto optimal leading edge means that if the multi-objective problem has i objective functions, A and B are two feasible solutions of the objective functions, and if all the objective function values of the solution A are superior to the solution B, the solution A is called to be superior to the solution B, namely the solution A dominates the solution B; if only part of the objective function of the solution A is better than that of the solution B, the solution A and the solution B are called to have no difference, namely the solution A is not the dominant solution B. If the objective function value of the solution A is superior to any other solution in the feasible space, the solution A is called as an optimal solution; if no other solution is found to be better than the solution A in the feasible space, the solution A is called a Pareto optimal solution. For the multi-objective optimization problem, an optimal solution does not exist generally, but a plurality of Pareto optimal solutions exist, and all the Pareto optimal solutions form a Pareto optimal front edge.
Therefore, when the maximum accumulated reward Q is selected, firstly, solving the Pareto optimal front edge of the optimal action cost function based on the formula (1), and formulating a rule on the Pareto optimal front edge to select the maximum accumulated reward Q, for example, randomly selecting a Q as the maximum accumulated reward, or selecting a Q with a certain target minimum as the maximum accumulated reward, etc., it should be noted that solving the Pareto optimal front edge based on the target function is a method commonly used by those skilled in the art, and the scheme does not relate to improvement of a specific calculation process thereof, so that the specific calculation process thereof is not described again.
Second, the penalty function is calculated and propagated backwards. Since a plurality of targets generate a plurality of loss functions, gradients are calculated from the plurality of loss functions, and back propagation is performed.
(3) As shown by the dashed box (3) in fig. 2, the state information s in the sample e sampled in (2) is compared t Inputting the data into an Actor evaluation network to obtain the corresponding action a after the Actor evaluation network is updated t Then the status information s t And corresponding action a t Jointly input into the Critic evaluation network, and obtain the status information s after the Actor evaluation network is updated t And action a t Accumulated reward Q(s) of t ,a t |θ)。
(4) As shown by the dashed box (4) in FIG. 2, the next step status information s in the sample e sampled in (2) is compared t+1 Inputting the input data into the target network of the Actor to obtain the corresponding action a t+1 Then the status information s t+1 And corresponding action a t+1 Jointly inputting the information into a Critic target network, and solving the Pareto optimal leading edge according to the method in the step (2) to obtain the state information s t+1 And action a t+1 Accumulated reward Q(s) of t+1 ,a t+1 And | theta') which is a parameter of the Critic target network.
(5) As shown by a dotted frame (5) in fig. 2, the Loss function Loss is backward-propagated, the Loss function Loss is a Mean Square Error (MSE), and the calculation formula is:
Figure BDA0003883422220000081
where E is the number of samples E sampled from the experience pool U, Q(s) t+1 ,a t+1 | θ') is the state information s t+1 And action a t+1 Accumulated reward of, Q(s) t ,a t | θ) as the state information s t And action a t Accumulated award of, r(s) t ,a t ) Is state information s t The reward of, gamma, is the discount factor.
Then, the gradient descent method is adopted to calculate gradients by using the loss functions to update parameters in the criticic evaluation network, and the gradients are calculated:
Figure BDA0003883422220000082
where E is the number of samples E sampled from the experience pool U, Q(s) t+1 ,a t+1 | θ') is the state information s t+1 And action a t+1 Accumulated reward of, Q(s) t ,a t | θ) as the state information s t And action a t Accumulated award of, r(s) t ,a t ) Is status information s t The reward of, gamma is the discount factor,
Figure BDA0003883422220000083
the gradient of the cumulative reward function Q (s, a | theta) about theta is shown, Q (s, a | theta) is the cumulative reward function which is output when the parameter of the Critic evaluation network is theta, the input state is s, and the input action is a;
and then updating parameters in the criticic evaluation network by using a gradient descent method:
Figure BDA0003883422220000084
wherein alpha is L The learning rate of network parameter updates is evaluated for criticic.
(6) And (5) circulating the steps (1) to (5), and updating the parameters of the Actor evaluation network and the criticic evaluation network into the Actor target network and the criticic target network after each step C.
In summary, the IMDDPG algorithm is combined with the Pareto theory to improve the reward function, and multi-target learning of the algorithm is achieved.
On the basis, the embodiment provides an energy management method for a plug-in hybrid electric vehicle based on an improved multi-target DDPG, as shown in fig. 2, which includes the following steps:
s1) establishing an energy management system model of the plug-in hybrid electric vehicle, wherein the energy management system model comprises a longitudinal dynamic model of the whole vehicle, an engine fuel consumption model, a battery equivalent circuit model, a battery service life model and a driving/power generator model;
s2) acquiring state information of the plug-in hybrid electric vehicle in actual running, and inputting the state information into the energy management system model;
s3) taking the energy management system model as an intelligent body of the IMDDPG, configuring a reward function of the IMDDPG according to the accumulated oil consumption of an engine, the deviation degree of the SOC of the battery and the change of the health condition of the battery, taking the speed, the acceleration and the SOC of the battery of the plug-in hybrid electric vehicle as state variables of the IMDDPG, taking the required power of the plug-in hybrid electric vehicle as action variables of the IMDDPG, and carrying out multi-target optimization on the energy management system model by using the IMDDPG to obtain a trained reinforcement learning model; and inputting the initial state information and the driving working condition into the reinforcement learning model to obtain the energy management strategy in driving.
For step S1, the power system structure of the plug-in hybrid electric vehicle is as shown in fig. 3, the engine drives the generator EM2 and the driving motor EM1 to operate through the transmission mechanism, the generator EM2 charges the battery, the battery supplies power to the driving motor EM1, and the transmission mechanism further supplies power to the driving wheels through the clutches C1 and C2, the gear sets PG1 and PG2, the transmission and the main reducer.
According to this structure, the energy management system model is constructed as follows:
(a) Establishing a longitudinal dynamic model of the whole vehicle, wherein the expression is as follows:
Figure BDA0003883422220000091
wherein, F D As a driving force, P D For driving power, T D For drive torque, v is vehicle speed, F R 、F A 、F G 、F A Respectively rolling resistance, air resistance, gradient resistance and acceleration resistance during the running of the vehicle, A is the windward area of the vehicle, C D Is the coefficient of air resistance, ρ is the air density, c R Is a rolling resistance coefficient, m is a total mass of the vehicle, g is a gravitational acceleration, theta is a road gradient, delta is a vehicle rotating mass conversion coefficient,
Figure BDA0003883422220000092
is the running acceleration, r is the wheel radius;
Figure BDA0003883422220000093
wherein, T EN 、T EM1 And T EM2 Torque, ω, of the engine, the EM1 motor and the EM2 motor, respectively, of the plug-in hybrid vehicle EN 、ω EM1 And ω EM2 The rotation speeds, T, of the plug-in hybrid electric vehicle engine, the EM1 motor and the EM2 motor respectively Out Is the output torque of the transmission, K 1 、K 2 The gear ratios of the ring gears and the sun gears of PG1 and PG2 of the plug-in hybrid automobile, respectively, i is the main reducer gear transmission ratio.
(b) The method comprises the following steps of establishing an engine fuel consumption model, wherein the establishment of the engine fuel consumption model is realized through table look-up and correction of test data, the instantaneous fuel consumption of an engine can be regarded as a function of the torque and the rotating speed of the engine, and the expression is as follows:
Figure BDA0003883422220000094
wherein the content of the first and second substances,
Figure BDA0003883422220000101
for instantaneous fuel consumption of the engine, m f Is the cumulative fuel consumption, T, of the engine EN 、ω EN The torque and the rotational speed of the engine of the plug-in hybrid vehicle, respectively.
(c) Establishing a battery equivalent circuit model, selecting an internal resistance model, and enabling the battery model to be equivalent to a circuit formed by connecting an ideal voltage source and a resistor in series, wherein the expression is as follows:
Figure BDA0003883422220000102
wherein, U Bat Is the terminal voltage of the battery, I Bat As a current of a battery,U OC For open circuit current, R Bat Is the internal resistance of the battery, P Bat For battery power, SOC (0) is the initial value of SOC, Q Bat Is the battery capacity.
(d) Establishing a battery life model, namely establishing the battery life model by using a semi-empirical model, and assuming that no difference exists among all battery units in the battery pack and the working temperature of the battery is basically kept constant, the expression of the battery life model is as follows:
Figure BDA0003883422220000103
wherein Q is Loss For cell capacity fade, α and β are constant terms, E A For activation energy, R is the molar gas constant, T K Is the thermodynamic temperature of the environment, ah is ampere-hour flux, z is power exponent factor, Q Bat Is the battery capacity, I Bat EOL is the end of life of the battery, N is the total number of cycles, SOH is the state of health of the battery, and SOC is the state of charge of the battery.
(e) In a driving/power generator model, a driving motor EM1 and a power generator EM2 are both permanent magnet synchronous motors, the comprehensive efficiency of the motor and an inverter can be expressed as a function of the torque and the rotating speed of the motor, and the expression is as follows:
η EM =f(T EMEM ) (13)
wherein, T EM As motor torque, ω EM Is the motor speed, η EM For the corresponding motor efficiency.
Figure BDA0003883422220000104
Wherein, P EM To the mechanical power of the machine, P Bat,EM The power delivered to the motor by the battery.
For step S2, the state information mainly includes two parts, vehicle state information and battery state information, wherein:
the vehicle state information mainly comprises the mass of the whole vehicle, the windward area, the road gradient, the ambient temperature, the instantaneous vehicle speed, the motor rotating speed, the motor efficiency and the like.
The battery state information mainly includes battery current, battery voltage, open-circuit voltage, internal resistance, SOC (state of charge of the battery), battery end life, and the like.
For step S3, the configuration of IMDDPG is as follows:
the reward function: in the embodiment, the economy and the battery life of the plug-in hybrid electric vehicle are taken as optimization targets, the IMDDPG algorithm optimizes the maximum accumulated reward, and the starting value and the ending value of the SOC of the plug-in hybrid electric vehicle are kept equal and are SOC set values, namely SOC start =SOC end =SOC Target The economic index is the accumulated oil consumption m of the engine f And degree D = (SOC) of deviation of battery SOC from set value after driving is finished end -SOC Target ) 2 The battery life index is the change in battery health Δ SOH.
Cumulative fuel consumption m of engine in energy management of plug-in hybrid vehicle f Degree of deviation of battery SOC D = (SOC) end -SOC Target ) 2 And the change in the state of health of the battery Δ SOH are both as small as possible, and therefore, the reward function is r = (-m) f ,-D,-ΔSOH)。
And (3) state variable: the vehicle speed, acceleration and battery SOC of the plug-in hybrid vehicle are taken as state variables, i.e., s = { v, acc, SOC }.
The action variables are as follows: the purpose of the energy management strategy of the plug-in hybrid electric vehicle is to realize reasonable mode switching and gear switching according to the required power, and the key point is to determine the required power of the plug-in hybrid electric vehicle, so that the required power P is obtained D As an action variable, i.e., a = { P D }。
Based on the configuration, the multi-target optimization of the energy management system model by the IMDDPG comprises the following steps:
s31) obtaining the state information S of the energy management system model as described in the step (1) of the IMDDPG algorithm t Is input intoThe Actor evaluates the network to obtain the corresponding action a t And inputting the energy management system model to obtain a corresponding reward r(s) under the influence of the environment t ,a t ) And next state information s t+1 Current sample data e t =(s t ,a t ,r(s t ,a t ),s t+1 ) Storing the data into an experience pool, and repeating the step until the number of the sample data in the experience pool meets the requirement;
s32) randomly selecting samples from the experience pool as described in the step (2) of the IMDDPG algorithm, and obtaining the state information S of the selected samples t Inputting the data into an Actor evaluation network to obtain a corresponding action a t Information on the state s t And corresponding action a t Inputting the result into a Critic evaluation network, and solving the Pareto optimal front edge to obtain the cumulative reward Q(s) corresponding to the selected sample book t ,a t | θ), the cumulative prize Q(s) is calculated t ,a t | θ) to obtain a loss function and perform back propagation, and updating parameters in the Actor evaluation network by using a gradient ascent method;
s33) State information S of the sample to be selected, as described in step (3) of the IMDDPG algorithm t Inputting the data into an updated Actor evaluation network to obtain an updated corresponding action a t Information on the state s t And updated corresponding action a t Inputting the accumulated reward Q(s) into a Critic evaluation network, and solving the Pareto optimal front edge to obtain the updated accumulated reward Q(s) corresponding to the selected sample book t ,a t |θ);
S34) the next state information S of the sample to be selected as described in step (4) of the IMDDPG algorithm t+1 Inputting the input data into the target network of the Actor to obtain the corresponding action a t+1 State information s t+1 And corresponding action a t+1 Inputting the data into a Critic target network, and solving the Pareto optimal leading edge to obtain the state information s of the selected sample t+1 And action a t+1 Accumulated reward Q(s) of t+1 ,a t+1 I θ'), calculating the cumulative prize Q(s) as described in step (5) of the IMDDPG algorithm t+1 ,a t+1 The mean square error of theta') is used for obtaining a loss function and reversely propagating, and the criticic evaluation is updated by using a gradient descent methodParameters in the network;
s35) judging whether the difference between the number of times of the current cycle and the number of times of the previous cycle during updating reaches a preset step length or not as in the step (6) of the IMDDPG algorithm;
if yes, updating the parameters of the Actor evaluation network and the criticic evaluation network into the Actor target network and the criticic target network, and then returning to obtain the state information s of the energy management system model t Inputting the data into an Actor evaluation network until the cycle number meets the requirement;
otherwise, executing and returning to obtain the state information s of the energy management system model t And inputting the evaluation result into an Actor for evaluating the steps in the network until the cycle number meets the requirement.
For the trained reinforcement learning model, after initial state information and driving conditions are input, the model can obtain a series of action information in driving, namely a series of corresponding required powers, so that reasonable mode switching and gear switching are realized according to the required powers, multi-objective optimization of the PHEV energy management strategy is realized, and the series of action information is the energy management strategy in driving.
In summary, the plug-in hybrid electric vehicle energy management method based on the improved multi-target DDPG of the embodiment establishes an energy management system model of the plug-in hybrid electric vehicle, and models the economy of the plug-in hybrid electric vehicle and models the service life of the battery. And the IMDDPG algorithm is used for carrying out multi-target optimization on the energy management strategy of the plug-in hybrid electric vehicle based on continuous actions in actual driving, the dependence of the previous energy management strategy on the driving working condition is eliminated, the self-adaptability to different working conditions is realized through the continuous learning of an intelligent agent, and meanwhile, the strategy instantaneity is ensured, and meanwhile, the strategy optimality is realized.
The foregoing is illustrative of the preferred embodiments of the present invention and is not to be construed as limiting the invention in any way. Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention shall fall within the protection scope of the technical solution of the present invention, unless the technical essence of the present invention departs from the content of the technical solution of the present invention.

Claims (10)

1. A plug-in hybrid electric vehicle energy management method based on improved multi-target DDPG is characterized by comprising the following steps:
establishing an energy management system model of the plug-in hybrid electric vehicle, which comprises a longitudinal dynamics model of the whole vehicle, an engine fuel consumption model, a battery equivalent circuit model, a battery service life model and a driving/generator model;
acquiring state information of the plug-in hybrid electric vehicle in actual running, and inputting the state information into the energy management system model;
the energy management system model is used as an intelligent body of the IMDDPG, a reward function of the IMDDPG is configured according to the accumulated oil consumption of an engine, the deviation degree of the SOC of a battery and the change of the health condition of the battery, the speed, the acceleration and the SOC of the plug-in hybrid electric vehicle are used as state variables of the IMDDPG, the required power of the plug-in hybrid electric vehicle is used as an action variable of the IMDDPG, and the IMDDPG is used for carrying out multi-target optimization on the energy management system model to obtain a trained reinforcement learning model;
and inputting initial state information and driving conditions into the reinforcement learning model to obtain an energy management strategy in driving.
2. The method of claim 1, wherein the multi-objective optimization of the energy management system model using IMDDPG comprises the steps of:
obtaining state information s of the energy management system model t Inputting the data into an Actor evaluation network to obtain a corresponding action a t And inputting the energy management system model to obtain a corresponding reward r(s) under the influence of the environment t ,a t ) And next state information s t+1 Current sample data e t =(s t ,a t ,r(s t ,a t ),s t+1 ) Storing in experience pool, and repeating the stepsUntil the number of sample data in the experience pool meets the requirement;
randomly selecting samples from the experience pool, and obtaining the state information s of the selected samples t Inputting the data into the Actor evaluation network to obtain the corresponding action a t State information s t And corresponding action a t Inputting the result into a Critic evaluation network, and solving the Pareto optimal front edge to obtain the cumulative reward Q(s) corresponding to the selected sample book t ,a t | θ), the cumulative prize Q(s) is calculated t ,a t | θ) to obtain a loss function and perform back propagation, and updating parameters in the Actor evaluation network by using a gradient ascent method;
status information s of the sample to be selected t Inputting the data into an updated Actor evaluation network to obtain an updated corresponding action a t State information s t And updated corresponding action a t Inputting the accumulated reward Q(s) into a Critic evaluation network, and solving the Pareto optimal front edge to obtain the updated accumulated reward Q(s) corresponding to the selected sample book t ,a t |θ);
Next state information s of the sample to be selected t+1 Inputting the input data into the target network of the Actor to obtain the corresponding action a t+1 Information on the state s t+1 And corresponding action a t+1 Inputting the data into a Critic target network, and solving the Pareto optimal leading edge to obtain the state information s of the selected sample t+1 And action a t+1 Accumulated reward Q(s) of t+1 ,a t+1 | θ'), calculating the cumulative prize Q(s) t+1 ,a t+1 The mean square error of | theta') obtains a loss function and reversely propagates, and parameters in the Critic evaluation network are updated by using a gradient descent method;
updating the parameters of the Actor evaluation network and the criticic evaluation network into the Actor target network and the criticic target network, and returning to obtain the state information s of the energy management system model t And inputting the evaluation result into an Actor for evaluating the steps in the network until the cycle number meets the requirement.
3. The plug-in hybrid vehicle energy management method based on the improved multi-target DDPG of claim 2, characterized in that the Actor is evaluatedBefore the parameters of the network and the Critic evaluation network are updated into the Actor target network and the Critic target network, whether the difference between the number of the current cycle and the number of the cycle during the last updating reaches a preset step length is judged, if yes, the parameters of the Actor evaluation network and the Critic evaluation network are updated into the Actor target network and the Critic target network, and if not, the state information s of the energy management system model is obtained through returning t And inputting the data into the Actor evaluation network.
4. The plug-in hybrid vehicle energy management method based on the improved multi-target DDPG as claimed in claim 2, wherein the step of solving Pareto optimal leading edge comprises: according to status information s t And corresponding action a t Solving a value function Q about optimal actions * (s t ,a t ) And randomly selecting an optimal action value function as the maximum accumulated reward at the Pareto optimal front edge, or selecting the optimal action value function with the minimum target as the maximum accumulated reward.
5. The improved multi-target DDPG based plug-in hybrid electric vehicle energy management method of claim 4, characterized in that an optimal action value function Q * (s t ,a t ) The expression of (a) is as follows:
Figure FDA0003883422210000021
wherein R is t For the accumulated award with the discount or discounts,
Figure FDA0003883422210000022
gamma is a discount factor, gamma belongs to [0,1]],r i For the reward function at time i, i ∈ [ T, T ]]And T is the termination time.
6. The plug-in hybrid vehicle energy management method based on the improved multi-target DDPG of claim 1, wherein the whole vehicle longitudinal dynamics model expression is as follows:
Figure FDA0003883422210000023
wherein, F D As a driving force, P D For driving power, T D For drive torque, v is vehicle speed, F R 、F A 、F G 、F A Respectively rolling resistance, air resistance, gradient resistance and acceleration resistance during the running of the vehicle, A is the windward area of the vehicle, C D Is the air resistance coefficient, ρ is the air density, c R Is a rolling resistance coefficient, m is the total mass of the vehicle, g is the gravitational acceleration, theta is the road gradient, delta is the vehicle rotating mass conversion coefficient,
Figure FDA0003883422210000024
is the running acceleration, r is the wheel radius;
Figure FDA0003883422210000025
wherein, T EN 、T EM1 And T EM2 Torque, ω, of the engine, the EM1 motor and the EM2 motor, respectively, of the plug-in hybrid vehicle EN 、ω EM1 And ω EM2 The rotation speeds, T, of the plug-in hybrid electric vehicle engine, the EM1 motor and the EM2 motor respectively Out Is the output torque of the transmission, K 1 、K 2 The gear ratios of the ring gears and the sun gears of PG1 and PG2 of the plug-in hybrid automobile, respectively, i is the main reducer gear transmission ratio.
7. The improved multi-target DDPG based plug-in hybrid vehicle energy management method of claim 1, wherein the engine fuel consumption model expression is as follows:
Figure FDA0003883422210000031
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003883422210000032
for instantaneous fuel consumption of the engine, m f Is the cumulative oil consumption, T, of the engine EN 、ω EN The torque and the rotational speed of the engine of the plug-in hybrid vehicle, respectively.
8. The improved multi-target DDPG based plug-in hybrid vehicle energy management method of claim 1, wherein the battery equivalent circuit model expression is as follows:
Figure FDA0003883422210000033
wherein, U Bat Is the terminal voltage of the battery, I Bat Is the battery current, U OC For open circuit current, R Bat Is the internal resistance of the battery, P Bat Is the battery power, SOC (0) is the initial value of SOC, Q Bat Is the battery capacity.
9. The improved multi-target DDPG based plug-in hybrid vehicle energy management method of claim 1, wherein the battery life model expression is as follows:
Figure FDA0003883422210000034
wherein Q is Loss For cell capacity fade, α and β are constant terms, E A R is the molar gas constant, T for the activation energy K Is the thermodynamic temperature of the environment, ah is ampere-hour flux, z is power exponent factor, Q Bat Is the battery capacity, I Bat EOL is the end of life of the battery, N is the total number of cycles, SOH is the state of health of the battery, and SOC is the state of charge of the battery.
10. The improved multi-target DDPG-based plug-in hybrid vehicle energy management method according to claim 1, wherein the drive/generator model expression is as follows:
η EM =f(T EMEM )
wherein, T EM As motor torque, ω EM Is the motor speed, η EM The corresponding motor efficiency;
Figure FDA0003883422210000041
wherein, P EM For mechanical power of the motor, P Bat,EM Power delivered to the motor by the battery.
CN202211235470.7A 2022-10-10 2022-10-10 Plug-in hybrid electric vehicle energy management method based on improved multi-target DDPG Pending CN115476841A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211235470.7A CN115476841A (en) 2022-10-10 2022-10-10 Plug-in hybrid electric vehicle energy management method based on improved multi-target DDPG

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211235470.7A CN115476841A (en) 2022-10-10 2022-10-10 Plug-in hybrid electric vehicle energy management method based on improved multi-target DDPG

Publications (1)

Publication Number Publication Date
CN115476841A true CN115476841A (en) 2022-12-16

Family

ID=84393216

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211235470.7A Pending CN115476841A (en) 2022-10-10 2022-10-10 Plug-in hybrid electric vehicle energy management method based on improved multi-target DDPG

Country Status (1)

Country Link
CN (1) CN115476841A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115935524A (en) * 2023-03-03 2023-04-07 北京航空航天大学 Parameter matching optimization method for hybrid transmission systems with different configurations
CN116639135A (en) * 2023-05-26 2023-08-25 中国第一汽车股份有限公司 Cooperative control method and device for vehicle and vehicle
CN117184095A (en) * 2023-10-20 2023-12-08 燕山大学 Hybrid electric vehicle system control method based on deep reinforcement learning
CN117184095B (en) * 2023-10-20 2024-05-14 燕山大学 Hybrid electric vehicle system control method based on deep reinforcement learning

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115935524A (en) * 2023-03-03 2023-04-07 北京航空航天大学 Parameter matching optimization method for hybrid transmission systems with different configurations
CN115935524B (en) * 2023-03-03 2023-05-02 北京航空航天大学 Optimizing method for parameter matching of hybrid transmission system with different configurations
CN116639135A (en) * 2023-05-26 2023-08-25 中国第一汽车股份有限公司 Cooperative control method and device for vehicle and vehicle
CN117184095A (en) * 2023-10-20 2023-12-08 燕山大学 Hybrid electric vehicle system control method based on deep reinforcement learning
CN117184095B (en) * 2023-10-20 2024-05-14 燕山大学 Hybrid electric vehicle system control method based on deep reinforcement learning

Similar Documents

Publication Publication Date Title
Han et al. Energy management based on reinforcement learning with double deep Q-learning for a hybrid electric tracked vehicle
CN111731303B (en) HEV energy management method based on deep reinforcement learning A3C algorithm
Liu et al. Modeling and control of a power-split hybrid vehicle
CN112287463B (en) Fuel cell automobile energy management method based on deep reinforcement learning algorithm
CN115476841A (en) Plug-in hybrid electric vehicle energy management method based on improved multi-target DDPG
Pisu et al. A comparative study of supervisory control strategies for hybrid electric vehicles
Li et al. Rule-based control strategy with novel parameters optimization using NSGA-II for power-split PHEV operation cost minimization
CN108528436A (en) A kind of ECMS multiple target dual blank-holders of ectonexine nesting
CN113554337B (en) Plug-in hybrid electric vehicle energy management strategy construction method integrating traffic information
CN115495997B (en) New energy automobile ecological driving method based on heterogeneous multi-agent deep reinforcement learning
CN113085665A (en) Fuel cell automobile energy management method based on TD3 algorithm
Jawale et al. Energy management in electric vehicles using improved swarm optimized deep reinforcement learning algorithm
CN115284973A (en) Fuel cell automobile energy management method based on improved multi-target Double DQN
CN116461391A (en) Energy management method for fuel cell hybrid electric vehicle
CN113815437B (en) Predictive energy management method for fuel cell hybrid electric vehicle
Dong et al. Rapid assessment of series–parallel hybrid transmission comprehensive performance: A near-global optimal method
Chang et al. An energy management strategy of deep reinforcement learning based on multi-agent architecture under self-generating conditions
Lee et al. An adaptive energy management strategy for extended-range electric vehicles based on Pontryagin's minimum principle
CN113581163B (en) Multimode PHEV mode switching optimization and energy management method based on LSTM
Gozukucuk et al. Design and simulation of an optimal energy management strategy for plug-in electric vehicles
CN114291067A (en) Hybrid electric vehicle convex optimization energy control method and system based on prediction
Ding Energy management system design for plug-in hybrid electric vehicle based on the battery management system applications
Özden Modeling and optimization of hybrid electric vehicles
Janulin et al. Energy Minimization in City Electric Vehicle using Optimized Multi-Speed Transmission
Sun et al. A Dynamic Programming based Fuzzy Logic Energy Management Strategy for Series-parallel Hybrid Electric Vehicles.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination