CN115476841A

CN115476841A - Plug-in hybrid electric vehicle energy management method based on improved multi-target DDPG

Info

Publication number: CN115476841A
Application number: CN202211235470.7A
Authority: CN
Inventors: 孙希雷; 付建勤; 刘琦; 袁硕; 吴跃; 许东
Original assignee: Hunan University Chongqing Research Institute
Current assignee: Hunan University Chongqing Research Institute
Priority date: 2022-10-10
Filing date: 2022-10-10
Publication date: 2022-12-16

Abstract

The invention discloses a plug-in hybrid electric vehicle energy management method based on improved multi-target DDPG, which comprises the following steps: establishing an energy management system model of the plug-in hybrid electric vehicle, which comprises a longitudinal dynamics model of the whole vehicle, an engine fuel consumption model, a battery equivalent circuit model, a battery service life model and a driving/generator model; acquiring state information of the plug-in hybrid electric vehicle in actual running, and inputting the state information into an energy management system model; taking the energy management system model as an intelligent agent of the IMDDPG, configuring a reward function of the IMDDPG according to the accumulated oil consumption of an engine, the deviation degree of the SOC of the battery and the change of the health condition of the battery, and performing multi-target optimization on the energy management system model by using the IMDDPG to obtain a trained reinforcement learning model; and inputting the initial state information and the driving working condition into the reinforcement learning model to obtain the energy management strategy in driving. The invention comprehensively considers the fuel economy and the battery life, and improves the optimality and universality of the strategy while ensuring the real-time performance.

Description

Plug-in hybrid electric vehicle energy management method based on improved multi-target DDPG

Technical Field

The invention relates to the field of hybrid electric vehicle energy management, in particular to a plug-in hybrid electric vehicle energy management method based on improved multi-target DDPG.

Background

The plug-in hybrid electric vehicle (PHEV) has the advantages of both a pure electric vehicle and a hybrid electric vehicle, not only overcomes the defects of short endurance mileage, imperfect infrastructure and the like of the pure electric vehicle, but also can further excavate the potential of an engine as a driving device, and has become the mainstream direction of research and development of various traditional vehicle manufacturers in the process of electric transformation at the present stage. The PHEV has a complex structure and multiple driving modes, so that the PHEV is a key technology for reasonably distributing the power requirements of an engine and a motor, reasonably switching the modes and gears and realizing optimal management of energy. Currently, the energy management strategies of PHEVs can be divided into rule-based and optimization-based strategies. However, the existing energy management strategies generally have the defects of poor instantaneity, complex calculation, poor adaptability, non-ideal optimization performance and the like, and the service life of a battery is also a key technology for restricting the development of the PHEV. Therefore, the research of the PHEV energy management strategy with real-time performance, adaptability and optimality has important research value and application value by comprehensively considering the economy of the PHEV and the service life of the battery.

With the popularization and development of artificial intelligence, an energy management strategy based on a deep reinforcement learning algorithm has the characteristics of instantaneity, adaptability and the like, and thus, the energy management strategy has attracted extensive attention of researchers. However, the existing deep reinforcement learning energy management strategy of the PHEV adopts a method of integrating multiple targets (such as fuel consumption and SOC deviation degree) into a single target through weight factors as a reward function, and the service life of the battery is mostly not taken into consideration.

Patent CN114801897A discloses a fuel cell hybrid power system energy management method based on DDPG algorithm, which is to establish a DDPG algorithm model for a hybrid power system of a dual-stack fuel cell and a lithium battery, and perform parameter matching on the algorithm model and a dynamic model; setting the state, action and reward of the algorithm model; an optimal objective function based on the running cost is established for the hybrid power system, so that the loss of the fuel cell and the lithium cell is reduced, and the service life of the fuel cell and the lithium cell is prolonged. The method has the following defects:

(1) The costs of the freight trucks are coupled into the objective function, and the diversity of the objective function is reduced.

(2) Four targets of hydrogen consumption, fuel cell loss, lithium battery degradation and SoC change are coupled into one target through the weight factors, and the calculation cost for determining the weight factors is increased. Meanwhile, according to different models and different target functions, the weight factor needs to be formulated again, the efficiency is low, and meanwhile the adaptability of the algorithm is reduced.

(3) The optimal solution of multi-objective optimization is often not unique, but a solution set composed of a plurality of mutually dominated optimal solutions, so that a plurality of targets are coupled into one target, global multi-objective optimization cannot be realized, and the optimality of understanding is reduced.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the PHEV energy management strategy integrates real-time performance, adaptability and optimality to a certain extent, based on continuous action, realizes optimal PHEV fuel economy and no large deviation of SOC, and simultaneously considers the service life of a battery to realize multi-target PHEV energy management.

Aiming at the technical problems in the prior art, the invention provides a plug-in hybrid electric vehicle energy management method based on improved multi-target DDPG, which comprehensively considers the fuel economy and the battery life, and improves the optimality and universality of strategies while ensuring the real-time performance.

In order to solve the technical problems, the technical scheme provided by the invention is as follows:

a plug-in hybrid electric vehicle energy management method based on improved multi-target DDPG comprises the following steps:

establishing an energy management system model of the plug-in hybrid electric vehicle, wherein the energy management system model comprises a longitudinal dynamic model of the whole vehicle, an engine fuel consumption model, a battery equivalent circuit model, a battery service life model and a driving/power generator model;

acquiring state information of the plug-in hybrid electric vehicle in actual running, and inputting the state information into the energy management system model;

the energy management system model is used as an intelligent body of the IMDDPG, a reward function of the IMDDPG is configured according to the accumulated oil consumption of an engine, the deviation degree of the SOC of a battery and the change of the health condition of the battery, the speed, the acceleration and the SOC of the plug-in hybrid electric vehicle are used as state variables of the IMDDPG, the required power of the plug-in hybrid electric vehicle is used as an action variable of the IMDDPG, and the IMDDPG is used for carrying out multi-target optimization on the energy management system model to obtain a trained reinforcement learning model;

and inputting the initial state information and the driving working condition into the reinforcement learning model to obtain the energy management strategy in driving.

Further, the multi-target optimization of the energy management system model by the IMDDPG comprises the following steps:

obtaining state information s of the energy management system model _t Inputting the data into an Actor evaluation network to obtain a corresponding action a _t And inputting the energy management system model to obtain a corresponding reward r(s) under the influence of the environment _t ,a _t ) And next state information s _t+1 Current sample data e _t ＝(s _t ,a _t ,r(s _t ,a _t ),s _t+1 ) Storing the data in an experience pool, and repeating the step until the number of the sample data in the experience pool meets the requirement;

randomly selecting samples from the experience pool, and obtaining the state information s of the selected samples _t Inputting the data into the Actor evaluation network to obtain the corresponding action a _t State information s _t And corresponding action a _t Inputting the result into a Critic evaluation network, and solving the Pareto optimal front edge to obtain the cumulative reward Q(s) corresponding to the selected sample book _t ,a _t Theta), calculate the cumulative prize Q(s) _t ,a _t | θ) to obtain a loss function and perform back propagation, and updating parameters in the Actor evaluation network by using a gradient ascent method;

status information s of the sample to be selected _t Inputting the data into the updated Actor evaluation network to obtain the updated corresponding actionAs a _t State information s _t And updated corresponding action a _t Inputting the result into a Critic evaluation network, solving the Pareto optimal front edge to obtain the updated accumulated reward Q(s) corresponding to the selected sample book _t ,a _t |θ)；

Next state information s of the sample to be selected _t+1 Inputting the action into the Actor target network to obtain the corresponding action a _t+1 Information on the state s _t+1 And corresponding action a _t+1 Inputting the data into a Critic target network, and solving the Pareto optimal leading edge to obtain the state information s of the selected sample _t+1 And action a _t+1 Accumulated reward Q(s) of _t+1 ,a _t+1 | θ'), calculating the cumulative prize Q(s) _t+1 ,a _t+1 The mean square error of | theta') obtains a loss function and reversely propagates, and parameters in the Critic evaluation network are updated by using a gradient descent method;

updating the parameters of the Actor evaluation network and the criticic evaluation network into the Actor target network and the criticic target network, and returning to obtain the state information s of the energy management system model _t And inputting the evaluation result into an Actor for evaluating the steps in the network until the cycle number meets the requirement.

Further, before updating the parameters of the Actor evaluation network and the Critic evaluation network into the Actor target network and the Critic target network, whether the difference between the number of times of the current cycle and the number of cycles of the last update reaches a preset step length is also judged, if yes, the parameters of the Actor evaluation network and the Critic evaluation network are updated into the Actor target network and the Critic target network, otherwise, the execution is returned to obtain the state information s of the energy management system model _t Input to the Actor for evaluating the steps in the network.

Further, the step of solving the Pareto optimal leading edge comprises the following steps: according to status information s _t And corresponding action a _t Solving a value function Q about optimal actions ^* (s _t ,a _t ) And randomly selecting an optimal action value function as the maximum accumulated reward at the Pareto optimal front, or selecting the optimal action value function with the minimum target as the maximum accumulated reward.

Further, the optimal action cost function Q ^* (s _t ,a _t ) The expression of (a) is as follows:

wherein R is _t For the accumulated prize to be discounted,

gamma is a discount factor, gamma belongs to [0,1]]，r _i For the reward function at time i, i ∈ [ T, T ]]And T is the termination time.

Further, the whole vehicle longitudinal dynamics model expression is as follows:

wherein, F _D As a driving force, P _D For driving power, T _D For drive torque, v is vehicle speed, F _R 、F _A 、F _G 、F _A Respectively rolling resistance, air resistance, gradient resistance and acceleration resistance in the running process of the vehicle, A is the windward area of the vehicle, C _D Is the coefficient of air resistance, ρ is the air density, c _R Is a rolling resistance coefficient, m is the total mass of the vehicle, g is the gravitational acceleration, theta is the road gradient, delta is the vehicle rotating mass conversion coefficient,

is the running acceleration, r is the wheel radius;

wherein, T _EN 、T _EM1 And T _EM2 Torque, ω, of the engine, the EM1 motor and the EM2 motor, respectively, of the plug-in hybrid vehicle _EN 、ω _EM1 And ω _EM2 Respectively of plug-in typeRotating speed, T, of motor-vehicle engine, EM1 motor and EM2 motor _Out Is the output torque of the transmission, K ₁ 、K ₂ Gear ratios of a ring gear and a sun gear of PG1 and PG2 of the plug-in hybrid automobile respectively, i is a gear transmission ratio of a main reducer.

Further, the engine fuel consumption model expression is as follows:

wherein, the first and the second end of the pipe are connected with each other,

for instantaneous fuel consumption of the engine, m _f Is the cumulative fuel consumption, T, of the engine _EN 、ω _EN Torque and rotational speed of the engine of the plug-in hybrid vehicle, respectively.

Further, the expression of the battery equivalent circuit model is as follows:

wherein, U _Bat Is the terminal voltage of the battery, I _Bat Is the battery current, U _OC For open circuit current, R _Bat Is the internal resistance of the battery, P _Bat Is the battery power, SOC (0) is the initial value of SOC, Q _Bat Is the battery capacity.

Further, the battery life model expression is as follows:

wherein Q is _Loss For cell capacity fade, α and β are constant terms, E _A R is the molar gas constant, T for the activation energy _K Is the thermodynamic temperature of the environment, ah is ampere-hour flux, z is power exponent factor, Q _Bat Is the battery capacity, I _Bat Is the battery current, EOL isThe end of life of the battery, N is the total number of cycles, SOH is the state of health of the battery, and SOC is the state of charge of the battery.

Further, the drive/generator model expression is as follows:

η _EM ＝f(T _EM ,ω _EM )

wherein, T _EM As motor torque, ω _EM Is the motor speed, η _EM The corresponding motor efficiency;

wherein, P _EM For mechanical power of the motor, P _Bat,EM The power delivered to the motor by the battery.

Compared with the prior art, the invention has the advantages that:

(1) The invention comprehensively considers the economy of the plug-in hybrid electric vehicle and the service life of the battery, and realizes multi-target optimization of the energy management strategy of the plug-in hybrid electric vehicle.

(2) The method and the device use the IMDDPG to optimize the energy management strategy of the plug-in hybrid electric vehicle based on the continuous action in actual driving, overcome the defect that the energy management strategy based on discrete action is difficult to optimize, and better accord with the characteristic of actual driving.

(3) The IMDDPG based energy management strategy improves the optimality and universality of the strategy while ensuring the real-time performance through continuous learning, gets rid of the dependence of the prior energy management strategy on the driving working condition, ensures the optimality under the standard test working condition and the optimality under the actual driving working condition, and improves the adaptivity of the strategy.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention.

Fig. 2 is a schematic diagram of a power system structure of the plug-in hybrid electric vehicle.

FIG. 3 is a flow chart of IMDDPG.

Detailed Description

The invention is further described below with reference to the drawings and specific preferred embodiments of the description, without thereby limiting the scope of protection of the invention.

DDPG is a single-target reinforcement learning algorithm, a plurality of targets can only be equivalent to a single target through weight factors, firstly, the determination of the weight factors needs prior knowledge or a large amount of work, and the global optimization cannot be realized theoretically, so that the DDPG algorithm has many problems in solving the actual problem of multi-target optimization, and the improvement of the multi-target DDPG algorithm, namely IMDDPG, is provided.

Assuming that the continuous state set of the agent is S, the continuous action sequence is A, and when the current state of the agent is S _t E is S, and takes action as a _t When the state belongs to A, the state of the agent is transferred to a new state s under the action of the environment _t+1 Belongs to S, and the generated instant reward is r (S) _t ,a _t )。

Deep reinforcement learning selects the action of the intelligent agent in a mode of maximizing accumulated reward, namely comprehensively considering instant reward and future reward, continuously improving a strategy pi to enable the obtained accumulated reward to be maximum, wherein the strategy corresponding to the maximum accumulated reward is the optimal strategy pi ^* (as). Where policy pi is the series of actions taken by the agent from start to finish.

The state of the agent is s _t Adopt the action as a _t Is the optimal action cost function Q ^* (s _t ,a _t ) Comprises the following steps:

wherein R is _t For the accumulated prize to be discounted,

gamma is a discount factor, gamma belongs to [0,1]]，r _i For the reward at time i, i ∈ [ T, T ]]And T is the termination time. Optimal action cost function Q ^* (s _t ,a _t ) Obey the bellman equation:

Q ^* (s _t ,a _t )＝E[r(s _t ,a _t )+γQ ^* (s _t+1 ,a _t+1 )|s _t ,a _t ] (2)

as shown in fig. 1, the main flow of the IMDDPG algorithm is as follows:

(1) As shown by the dashed box (1) in FIG. 1, the current state information s of the agent is determined _t The input is input into the Actor evaluation network, and the output is a corresponding action (i.e. utilization), or one action (i.e. search) is randomly generated and is denoted by a _t Will act a _t Input into the environment, and obtain the reward r(s) through the action with the environment _t ,a _t ) And next state information s _t+1 . The current state information s _t Selected action a _t The prize r(s) earned _t ,a _t ) And a next state s _t+1 And storing the experience to an experience pool U. Then the status information s _t+1 Inputting the data into the Actor evaluation network, and circulating the step (1) until a certain number e is stored _t ＝(s _t ,a _t ,r(s _t ,a _t ),s _t+1 ) During this process, the Actor evaluates the parameters in the network and does not update. Wherein, adopt epsilon-greedy algorithm to realize the equilibrium between exploration and the utilization when selecting the action, guarantee abundant exploration and reasonable utilization:

wherein epsilon is [0,1] as an exploration rate, the epsilon-greedy algorithm selects and explores according to the probability of epsilon, and an Actor is selected according to the probability of 1-epsilon to evaluate the action of network output. Therefore, in order to ensure the performance of the deep reinforcement learning algorithm and prevent the deep reinforcement learning algorithm from falling into local optimum, epsilon is generally set with a larger initial value to ensure sufficient exploration capacity, and as iteration progresses, epsilon value is gradually reduced to ensure full utilization and accelerate convergence of the algorithm.

(2) As shown by the dashed box (2) in fig. 1, a part of samples e is randomly extracted from the experience pool U, and first, the state information s in e is used _t Inputting the data into an Actor evaluation network to obtain a corresponding action a _t Then, thenState information s _t And corresponding action a _t The information is commonly input into a Critic evaluation network to obtain the state information s _t And action a _t Accumulated reward Q(s) of _t ,a _t | θ), θ represents a parameter of Critic evaluation network. By solving for Q(s) _t ,a _t | θ) is expected to get the loss function and propagate backwards, the calculation formula is:

L _ω ＝E _e～U [Q(s,a|θ)] (4)

wherein, Q(s) _t ,a _t | θ) as status information s _t And action a _t The omega is the parameter of the Actor evaluation network;

the parameters in the Actor evaluation network are updated by calculating gradients by using the loss functions by using a gradient ascent method, namely the parameters in the state s are obtained by back propagation _t Lower Q(s) _t ,a _t | θ) the action corresponding to the maximum expected value, the gradient is calculated first:

wherein, Q(s) _t ,a _t | θ) as status information s _t And action a _t The accumulated award of the next round of money,

for the gradient of the cumulative reward function Q (s, a | θ) with respect to a,

a (s ω) is an action function output when the parameter of the Actor evaluation network is ω and the input is s;

and then updating parameters in the Actor evaluation network by using a gradient ascent method:

wherein alpha is _l And evaluating the learning rate of network parameter updating for the Actor. In this step, only the parameters in the Actor evaluation network are updatedThe Critic evaluation network parameters remain unchanged.

Since the reward function r in the single-target DDPG algorithm is a numerical value, the accumulated reward Q is also a numerical value, and similarly, the expectation value of the accumulated reward and the loss function are both numerical values. When the single-target DDPG algorithm is improved into the multi-target DDPG algorithm, the Pareto theory is combined to improve the reward function from the single target to the multi-target, namely, a numerical value is improved into an array, and each numerical value in the array is a target of the reward function. The improvement involves two problems to be solved, one is how to select the maximum accumulated reward Q, and the other is how to solve the loss function and carry out back propagation.

First, the Pareto optimal frontier is introduced to select the maximum jackpot Q, as to how to select the maximum jackpot Q, i.e., how to compare the magnitudes of the jackpot Q.

The Pareto optimal leading edge means that if the multi-objective problem has i objective functions, A and B are two feasible solutions of the objective functions, and if all the objective function values of the solution A are superior to the solution B, the solution A is called to be superior to the solution B, namely the solution A dominates the solution B; if only part of the objective function of the solution A is better than that of the solution B, the solution A and the solution B are called to have no difference, namely the solution A is not the dominant solution B. If the objective function value of the solution A is superior to any other solution in the feasible space, the solution A is called as an optimal solution; if no other solution is found to be better than the solution A in the feasible space, the solution A is called a Pareto optimal solution. For the multi-objective optimization problem, an optimal solution does not exist generally, but a plurality of Pareto optimal solutions exist, and all the Pareto optimal solutions form a Pareto optimal front edge.

Therefore, when the maximum accumulated reward Q is selected, firstly, solving the Pareto optimal front edge of the optimal action cost function based on the formula (1), and formulating a rule on the Pareto optimal front edge to select the maximum accumulated reward Q, for example, randomly selecting a Q as the maximum accumulated reward, or selecting a Q with a certain target minimum as the maximum accumulated reward, etc., it should be noted that solving the Pareto optimal front edge based on the target function is a method commonly used by those skilled in the art, and the scheme does not relate to improvement of a specific calculation process thereof, so that the specific calculation process thereof is not described again.

Second, the penalty function is calculated and propagated backwards. Since a plurality of targets generate a plurality of loss functions, gradients are calculated from the plurality of loss functions, and back propagation is performed.

(3) As shown by the dashed box (3) in fig. 2, the state information s in the sample e sampled in (2) is compared _t Inputting the data into an Actor evaluation network to obtain the corresponding action a after the Actor evaluation network is updated _t Then the status information s _t And corresponding action a _t Jointly input into the Critic evaluation network, and obtain the status information s after the Actor evaluation network is updated _t And action a _t Accumulated reward Q(s) of _t ,a _t |θ)。

(4) As shown by the dashed box (4) in FIG. 2, the next step status information s in the sample e sampled in (2) is compared _t+1 Inputting the input data into the target network of the Actor to obtain the corresponding action a _t+1 Then the status information s _t+1 And corresponding action a _t+1 Jointly inputting the information into a Critic target network, and solving the Pareto optimal leading edge according to the method in the step (2) to obtain the state information s _t+1 And action a _t+1 Accumulated reward Q(s) of _t+1 ,a _t+1 And | theta') which is a parameter of the Critic target network.

(5) As shown by a dotted frame (5) in fig. 2, the Loss function Loss is backward-propagated, the Loss function Loss is a Mean Square Error (MSE), and the calculation formula is:

where E is the number of samples E sampled from the experience pool U, Q(s) _t+1 ,a _t+1 | θ') is the state information s _t+1 And action a _t+1 Accumulated reward of, Q(s) _t ,a _t | θ) as the state information s _t And action a _t Accumulated award of, r(s) _t ,a _t ) Is state information s _t The reward of, gamma, is the discount factor.

Then, the gradient descent method is adopted to calculate gradients by using the loss functions to update parameters in the criticic evaluation network, and the gradients are calculated:

where E is the number of samples E sampled from the experience pool U, Q(s) _t+1 ,a _t+1 | θ') is the state information s _t+1 And action a _t+1 Accumulated reward of, Q(s) _t ,a _t | θ) as the state information s _t And action a _t Accumulated award of, r(s) _t ,a _t ) Is status information s _t The reward of, gamma is the discount factor,

the gradient of the cumulative reward function Q (s, a | theta) about theta is shown, Q (s, a | theta) is the cumulative reward function which is output when the parameter of the Critic evaluation network is theta, the input state is s, and the input action is a;

and then updating parameters in the criticic evaluation network by using a gradient descent method:

wherein alpha is _L The learning rate of network parameter updates is evaluated for criticic.

(6) And (5) circulating the steps (1) to (5), and updating the parameters of the Actor evaluation network and the criticic evaluation network into the Actor target network and the criticic target network after each step C.

In summary, the IMDDPG algorithm is combined with the Pareto theory to improve the reward function, and multi-target learning of the algorithm is achieved.

On the basis, the embodiment provides an energy management method for a plug-in hybrid electric vehicle based on an improved multi-target DDPG, as shown in fig. 2, which includes the following steps:

s1) establishing an energy management system model of the plug-in hybrid electric vehicle, wherein the energy management system model comprises a longitudinal dynamic model of the whole vehicle, an engine fuel consumption model, a battery equivalent circuit model, a battery service life model and a driving/power generator model;

s2) acquiring state information of the plug-in hybrid electric vehicle in actual running, and inputting the state information into the energy management system model;

s3) taking the energy management system model as an intelligent body of the IMDDPG, configuring a reward function of the IMDDPG according to the accumulated oil consumption of an engine, the deviation degree of the SOC of the battery and the change of the health condition of the battery, taking the speed, the acceleration and the SOC of the battery of the plug-in hybrid electric vehicle as state variables of the IMDDPG, taking the required power of the plug-in hybrid electric vehicle as action variables of the IMDDPG, and carrying out multi-target optimization on the energy management system model by using the IMDDPG to obtain a trained reinforcement learning model; and inputting the initial state information and the driving working condition into the reinforcement learning model to obtain the energy management strategy in driving.

For step S1, the power system structure of the plug-in hybrid electric vehicle is as shown in fig. 3, the engine drives the generator EM2 and the driving motor EM1 to operate through the transmission mechanism, the generator EM2 charges the battery, the battery supplies power to the driving motor EM1, and the transmission mechanism further supplies power to the driving wheels through the clutches C1 and C2, the gear sets PG1 and PG2, the transmission and the main reducer.

According to this structure, the energy management system model is constructed as follows:

(a) Establishing a longitudinal dynamic model of the whole vehicle, wherein the expression is as follows:

wherein, F _D As a driving force, P _D For driving power, T _D For drive torque, v is vehicle speed, F _R 、F _A 、F _G 、F _A Respectively rolling resistance, air resistance, gradient resistance and acceleration resistance during the running of the vehicle, A is the windward area of the vehicle, C _D Is the coefficient of air resistance, ρ is the air density, c _R Is a rolling resistance coefficient, m is a total mass of the vehicle, g is a gravitational acceleration, theta is a road gradient, delta is a vehicle rotating mass conversion coefficient,

is the running acceleration, r is the wheel radius;

wherein, T _EN 、T _EM1 And T _EM2 Torque, ω, of the engine, the EM1 motor and the EM2 motor, respectively, of the plug-in hybrid vehicle _EN 、ω _EM1 And ω _EM2 The rotation speeds, T, of the plug-in hybrid electric vehicle engine, the EM1 motor and the EM2 motor respectively _Out Is the output torque of the transmission, K ₁ 、K ₂ The gear ratios of the ring gears and the sun gears of PG1 and PG2 of the plug-in hybrid automobile, respectively, i is the main reducer gear transmission ratio.

(b) The method comprises the following steps of establishing an engine fuel consumption model, wherein the establishment of the engine fuel consumption model is realized through table look-up and correction of test data, the instantaneous fuel consumption of an engine can be regarded as a function of the torque and the rotating speed of the engine, and the expression is as follows:

wherein the content of the first and second substances,

for instantaneous fuel consumption of the engine, m _f Is the cumulative fuel consumption, T, of the engine _EN 、ω _EN The torque and the rotational speed of the engine of the plug-in hybrid vehicle, respectively.

(c) Establishing a battery equivalent circuit model, selecting an internal resistance model, and enabling the battery model to be equivalent to a circuit formed by connecting an ideal voltage source and a resistor in series, wherein the expression is as follows:

wherein, U _Bat Is the terminal voltage of the battery, I _Bat As a current of a battery，U _OC For open circuit current, R _Bat Is the internal resistance of the battery, P _Bat For battery power, SOC (0) is the initial value of SOC, Q _Bat Is the battery capacity.

(d) Establishing a battery life model, namely establishing the battery life model by using a semi-empirical model, and assuming that no difference exists among all battery units in the battery pack and the working temperature of the battery is basically kept constant, the expression of the battery life model is as follows:

wherein Q is _Loss For cell capacity fade, α and β are constant terms, E _A For activation energy, R is the molar gas constant, T _K Is the thermodynamic temperature of the environment, ah is ampere-hour flux, z is power exponent factor, Q _Bat Is the battery capacity, I _Bat EOL is the end of life of the battery, N is the total number of cycles, SOH is the state of health of the battery, and SOC is the state of charge of the battery.

(e) In a driving/power generator model, a driving motor EM1 and a power generator EM2 are both permanent magnet synchronous motors, the comprehensive efficiency of the motor and an inverter can be expressed as a function of the torque and the rotating speed of the motor, and the expression is as follows:

η _EM ＝f(T _EM ,ω _EM ) (13)

wherein, T _EM As motor torque, ω _EM Is the motor speed, η _EM For the corresponding motor efficiency.

Wherein, P _EM To the mechanical power of the machine, P _Bat,EM The power delivered to the motor by the battery.

For step S2, the state information mainly includes two parts, vehicle state information and battery state information, wherein:

the vehicle state information mainly comprises the mass of the whole vehicle, the windward area, the road gradient, the ambient temperature, the instantaneous vehicle speed, the motor rotating speed, the motor efficiency and the like.

The battery state information mainly includes battery current, battery voltage, open-circuit voltage, internal resistance, SOC (state of charge of the battery), battery end life, and the like.

For step S3, the configuration of IMDDPG is as follows:

the reward function: in the embodiment, the economy and the battery life of the plug-in hybrid electric vehicle are taken as optimization targets, the IMDDPG algorithm optimizes the maximum accumulated reward, and the starting value and the ending value of the SOC of the plug-in hybrid electric vehicle are kept equal and are SOC set values, namely SOC _start ＝SOC _end ＝SOC _Target The economic index is the accumulated oil consumption m of the engine _f And degree D = (SOC) of deviation of battery SOC from set value after driving is finished _end -SOC _Target ) ² The battery life index is the change in battery health Δ SOH.

Cumulative fuel consumption m of engine in energy management of plug-in hybrid vehicle _f Degree of deviation of battery SOC D = (SOC) _end -SOC _Target ) ² And the change in the state of health of the battery Δ SOH are both as small as possible, and therefore, the reward function is r = (-m) _f ,-D,-ΔSOH)。

And (3) state variable: the vehicle speed, acceleration and battery SOC of the plug-in hybrid vehicle are taken as state variables, i.e., s = { v, acc, SOC }.

The action variables are as follows: the purpose of the energy management strategy of the plug-in hybrid electric vehicle is to realize reasonable mode switching and gear switching according to the required power, and the key point is to determine the required power of the plug-in hybrid electric vehicle, so that the required power P is obtained _D As an action variable, i.e., a = { P _D }。

Based on the configuration, the multi-target optimization of the energy management system model by the IMDDPG comprises the following steps:

s31) obtaining the state information S of the energy management system model as described in the step (1) of the IMDDPG algorithm _t Is input intoThe Actor evaluates the network to obtain the corresponding action a _t And inputting the energy management system model to obtain a corresponding reward r(s) under the influence of the environment _t ,a _t ) And next state information s _t+1 Current sample data e _t ＝(s _t ,a _t ,r(s _t ,a _t ),s _t+1 ) Storing the data into an experience pool, and repeating the step until the number of the sample data in the experience pool meets the requirement;

s32) randomly selecting samples from the experience pool as described in the step (2) of the IMDDPG algorithm, and obtaining the state information S of the selected samples _t Inputting the data into an Actor evaluation network to obtain a corresponding action a _t Information on the state s _t And corresponding action a _t Inputting the result into a Critic evaluation network, and solving the Pareto optimal front edge to obtain the cumulative reward Q(s) corresponding to the selected sample book _t ,a _t | θ), the cumulative prize Q(s) is calculated _t ,a _t | θ) to obtain a loss function and perform back propagation, and updating parameters in the Actor evaluation network by using a gradient ascent method;

s33) State information S of the sample to be selected, as described in step (3) of the IMDDPG algorithm _t Inputting the data into an updated Actor evaluation network to obtain an updated corresponding action a _t Information on the state s _t And updated corresponding action a _t Inputting the accumulated reward Q(s) into a Critic evaluation network, and solving the Pareto optimal front edge to obtain the updated accumulated reward Q(s) corresponding to the selected sample book _t ,a _t |θ)；

S34) the next state information S of the sample to be selected as described in step (4) of the IMDDPG algorithm _t+1 Inputting the input data into the target network of the Actor to obtain the corresponding action a _t+1 State information s _t+1 And corresponding action a _t+1 Inputting the data into a Critic target network, and solving the Pareto optimal leading edge to obtain the state information s of the selected sample _t+1 And action a _t+1 Accumulated reward Q(s) of _t+1 ,a _t+1 I θ'), calculating the cumulative prize Q(s) as described in step (5) of the IMDDPG algorithm _t+1 ,a _t+1 The mean square error of theta') is used for obtaining a loss function and reversely propagating, and the criticic evaluation is updated by using a gradient descent methodParameters in the network;

s35) judging whether the difference between the number of times of the current cycle and the number of times of the previous cycle during updating reaches a preset step length or not as in the step (6) of the IMDDPG algorithm;

if yes, updating the parameters of the Actor evaluation network and the criticic evaluation network into the Actor target network and the criticic target network, and then returning to obtain the state information s of the energy management system model _t Inputting the data into an Actor evaluation network until the cycle number meets the requirement;

otherwise, executing and returning to obtain the state information s of the energy management system model _t And inputting the evaluation result into an Actor for evaluating the steps in the network until the cycle number meets the requirement.

For the trained reinforcement learning model, after initial state information and driving conditions are input, the model can obtain a series of action information in driving, namely a series of corresponding required powers, so that reasonable mode switching and gear switching are realized according to the required powers, multi-objective optimization of the PHEV energy management strategy is realized, and the series of action information is the energy management strategy in driving.

In summary, the plug-in hybrid electric vehicle energy management method based on the improved multi-target DDPG of the embodiment establishes an energy management system model of the plug-in hybrid electric vehicle, and models the economy of the plug-in hybrid electric vehicle and models the service life of the battery. And the IMDDPG algorithm is used for carrying out multi-target optimization on the energy management strategy of the plug-in hybrid electric vehicle based on continuous actions in actual driving, the dependence of the previous energy management strategy on the driving working condition is eliminated, the self-adaptability to different working conditions is realized through the continuous learning of an intelligent agent, and meanwhile, the strategy instantaneity is ensured, and meanwhile, the strategy optimality is realized.

The foregoing is illustrative of the preferred embodiments of the present invention and is not to be construed as limiting the invention in any way. Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention shall fall within the protection scope of the technical solution of the present invention, unless the technical essence of the present invention departs from the content of the technical solution of the present invention.

Claims

1. A plug-in hybrid electric vehicle energy management method based on improved multi-target DDPG is characterized by comprising the following steps:

establishing an energy management system model of the plug-in hybrid electric vehicle, which comprises a longitudinal dynamics model of the whole vehicle, an engine fuel consumption model, a battery equivalent circuit model, a battery service life model and a driving/generator model;

and inputting initial state information and driving conditions into the reinforcement learning model to obtain an energy management strategy in driving.

2. The method of claim 1, wherein the multi-objective optimization of the energy management system model using IMDDPG comprises the steps of:

obtaining state information s of the energy management system model _t Inputting the data into an Actor evaluation network to obtain a corresponding action a _t And inputting the energy management system model to obtain a corresponding reward r(s) under the influence of the environment _t ,a _t ) And next state information s _t+1 Current sample data e _t ＝(s _t ,a _t ,r(s _t ,a _t ),s _t+1 ) Storing in experience pool, and repeating the stepsUntil the number of sample data in the experience pool meets the requirement;

randomly selecting samples from the experience pool, and obtaining the state information s of the selected samples _t Inputting the data into the Actor evaluation network to obtain the corresponding action a _t State information s _t And corresponding action a _t Inputting the result into a Critic evaluation network, and solving the Pareto optimal front edge to obtain the cumulative reward Q(s) corresponding to the selected sample book _t ,a _t | θ), the cumulative prize Q(s) is calculated _t ,a _t | θ) to obtain a loss function and perform back propagation, and updating parameters in the Actor evaluation network by using a gradient ascent method;

status information s of the sample to be selected _t Inputting the data into an updated Actor evaluation network to obtain an updated corresponding action a _t State information s _t And updated corresponding action a _t Inputting the accumulated reward Q(s) into a Critic evaluation network, and solving the Pareto optimal front edge to obtain the updated accumulated reward Q(s) corresponding to the selected sample book _t ,a _t |θ)；

Next state information s of the sample to be selected _t+1 Inputting the input data into the target network of the Actor to obtain the corresponding action a _t+1 Information on the state s _t+1 And corresponding action a _t+1 Inputting the data into a Critic target network, and solving the Pareto optimal leading edge to obtain the state information s of the selected sample _t+1 And action a _t+1 Accumulated reward Q(s) of _t+1 ,a _t+1 | θ'), calculating the cumulative prize Q(s) _t+1 ,a _t+1 The mean square error of | theta') obtains a loss function and reversely propagates, and parameters in the Critic evaluation network are updated by using a gradient descent method;

3. The plug-in hybrid vehicle energy management method based on the improved multi-target DDPG of claim 2, characterized in that the Actor is evaluatedBefore the parameters of the network and the Critic evaluation network are updated into the Actor target network and the Critic target network, whether the difference between the number of the current cycle and the number of the cycle during the last updating reaches a preset step length is judged, if yes, the parameters of the Actor evaluation network and the Critic evaluation network are updated into the Actor target network and the Critic target network, and if not, the state information s of the energy management system model is obtained through returning _t And inputting the data into the Actor evaluation network.

4. The plug-in hybrid vehicle energy management method based on the improved multi-target DDPG as claimed in claim 2, wherein the step of solving Pareto optimal leading edge comprises: according to status information s _t And corresponding action a _t Solving a value function Q about optimal actions ^* (s _t ,a _t ) And randomly selecting an optimal action value function as the maximum accumulated reward at the Pareto optimal front edge, or selecting the optimal action value function with the minimum target as the maximum accumulated reward.

5. The improved multi-target DDPG based plug-in hybrid electric vehicle energy management method of claim 4, characterized in that an optimal action value function Q ^* (s _t ,a _t ) The expression of (a) is as follows:

wherein R is _t For the accumulated award with the discount or discounts,

6. The plug-in hybrid vehicle energy management method based on the improved multi-target DDPG of claim 1, wherein the whole vehicle longitudinal dynamics model expression is as follows:

wherein, F _D As a driving force, P _D For driving power, T _D For drive torque, v is vehicle speed, F _R 、F _A 、F _G 、F _A Respectively rolling resistance, air resistance, gradient resistance and acceleration resistance during the running of the vehicle, A is the windward area of the vehicle, C _D Is the air resistance coefficient, ρ is the air density, c _R Is a rolling resistance coefficient, m is the total mass of the vehicle, g is the gravitational acceleration, theta is the road gradient, delta is the vehicle rotating mass conversion coefficient,

is the running acceleration, r is the wheel radius;

7. The improved multi-target DDPG based plug-in hybrid vehicle energy management method of claim 1, wherein the engine fuel consumption model expression is as follows:

for instantaneous fuel consumption of the engine, m _f Is the cumulative oil consumption, T, of the engine _EN 、ω _EN The torque and the rotational speed of the engine of the plug-in hybrid vehicle, respectively.

8. The improved multi-target DDPG based plug-in hybrid vehicle energy management method of claim 1, wherein the battery equivalent circuit model expression is as follows:

9. The improved multi-target DDPG based plug-in hybrid vehicle energy management method of claim 1, wherein the battery life model expression is as follows:

wherein Q is _Loss For cell capacity fade, α and β are constant terms, E _A R is the molar gas constant, T for the activation energy _K Is the thermodynamic temperature of the environment, ah is ampere-hour flux, z is power exponent factor, Q _Bat Is the battery capacity, I _Bat EOL is the end of life of the battery, N is the total number of cycles, SOH is the state of health of the battery, and SOC is the state of charge of the battery.

10. The improved multi-target DDPG-based plug-in hybrid vehicle energy management method according to claim 1, wherein the drive/generator model expression is as follows:

η _EM ＝f(T _EM ,ω _EM )

wherein, P _EM For mechanical power of the motor, P _Bat,EM Power delivered to the motor by the battery.