CN111367172B

CN111367172B - Hybrid system energy management strategy based on reverse deep reinforcement learning

Info

Publication number: CN111367172B
Application number: CN202010131644.XA
Authority: CN
Inventors: 李梓棋; 赵克刚
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-02-28
Filing date: 2020-02-28
Publication date: 2021-09-21
Anticipated expiration: 2040-02-28
Also published as: CN111367172A

Abstract

The invention discloses a hybrid system energy management strategy based on reverse deep reinforcement learning. The strategy comprises the following steps: calculating a global optimization SOC result by using an optimization solution method as expert knowledge; creating a reward neural network; obtaining parameters of a reward neural network by using the expert knowledge of the reverse reinforcement learning; creating an action neural network and evaluating the neural network; setting an SOC value before vehicle interaction; inputting the acquired SOC value before interaction into a reward neural network to obtain a reward value; inputting the acquired SOC value before interaction into an action neural network to obtain a mode distribution ratio; interacting with the environment by using a mode distribution ratio to obtain an interacted SOC value; inputting the SOC value before interaction, the mode distribution ratio, the reward value and the SOC value after interaction into an evaluation neural network to obtain an evaluation value; and the intelligent agent respectively calculates the gradient of each network and reversely transmits and updates the network parameters until the training is finished. The invention can learn the optimal reward function from expert knowledge, so that the deep reinforcement learning effect is better.

Description

Hybrid system energy management strategy based on reverse deep reinforcement learning

Technical Field

The invention relates to the field of hybrid system energy management, in particular to a hybrid system energy management strategy based on reverse deep reinforcement learning.

Background

The Hybrid Electric power coupling device couples power of a plurality of power sources such as an internal combustion engine and an Electric motor in a Hybrid Electric Vehicle (HEV), reasonably distributes power, and transmits the power to a drive axle to drive a Vehicle. It can be seen as a complex system combining mechanical, electrical, chemical and thermodynamic considerations, by integrating one or more electric motors into a transmission to form a set of automatic transmission systems with electric motors. China's automobile oil consumption regulation puts high requirements on energy conservation and emission reduction of whole automobile enterprises in the next 10-15 years. With the further increase of the energy saving and emission reduction pressure of the automobile, more automobile types (such as medium and large SUVs and MPVs) select a hybrid power scheme, and higher comprehensive requirements on the aspects of dynamic property, energy consumption, pure electric driving capability and the like are provided. Through the optimized arrangement of the mode execution sequence, the potential of energy conservation and emission reduction of the hybrid electric vehicle can be exerted to a greater extent, and the traditional limitation is broken for further improving the dynamic property, the energy consumption and the pure electric drive comprehensive performance of the whole vehicle. And the global optimization energy management strategy can well explore the energy-saving and emission-reducing potential of the hybrid power assembly configuration. However, the global optimization energy management strategy is only aimed at solving known working conditions and cannot be applied online, while the traditional reinforcement learning method is good at solving tasks with limited states and action spaces, but has no energy in solving the problem with high dimensionality of the states and the action spaces (cheng-he-li, cao-shi-ho-ming, li-chen-xi, xu-xiong-deep reverse reinforcement learning research reviews [ J ]. computer engineering and application, 2018,54(05): 24-35.).

Disclosure of Invention

In order to accelerate the solving speed of the global optimization energy management strategy, the invention provides a hybrid system energy management strategy based on reverse deep reinforcement learning.

The purpose of the invention is realized by at least one of the following technical solutions.

A hybrid system energy management strategy based on reverse deep reinforcement learning comprises the following steps:

s1: calculating a global hybrid mode distribution ratio and a global optimized SOC result by using an optimization solving method under one complete working condition, and forming an expert state-action pair as expert knowledge of reverse reinforcement learning;

s2: creating a reward function neural network and initializing parameters;

s3: learning to obtain parameters of the neural network of the reward function by utilizing reverse reinforcement learning;

s4: establishing an action neural network, evaluating the neural network, and initializing parameters of each network; the action neural network comprises an execution network and a target network, and the execution network and the target network are the same in structure; the evaluation neural network comprises an execution network and a target network, and the execution network and the target network are the same in structure;

s5: under one random working condition, acquiring an SOC value s before vehicle interaction;

s6: inputting the acquired SOC value s before interaction into a reward function neural network to obtain a reward value r;

s7: inputting the acquired SOC value s before interaction into an action neural network, and then, outputting a mixed mode distribution ratio a by the action neural network through processing of a plurality of hidden layers;

s8: controlling the vehicle to interact with the environment by using the hybrid mode distribution ratio a obtained in the step S7, and acquiring an SOC value S' after interaction;

s9: combining the SOC value s before interaction, the hybrid mode distribution ratio a, the reward value r and the SOC value s 'after interaction to obtain an experience vector (s, a, r, s'), and then storing the experience vector in a memory buffer;

s10: when the number of the experience vectors in the memory buffer reaches the maximum capacity, randomly extracting a set number of experience vectors from the memory buffer as the input of an evaluation neural network, and then outputting an evaluation value by the evaluation neural network according to a Bellman equation through the processing of a plurality of hidden layers;

s11: the reward function neural network, the action neural network and the evaluation neural network calculate respective weight gradients, and then parameters of the reward function neural network, the action neural network execution network and the evaluation neural network execution network are updated and executed through back propagation;

s12: updating parameters of an execution network of an action neural network and parameters of an execution network of the evaluation neural network by using soft replacement rules, and finishing the updating of the parameters of the action neural network and the evaluation neural network;

s13: after the parameters of the action neural network and the evaluation neural network are updated, repeating the steps S5-S12 until the set maximum step number is reached or the set convergence target is reached, finishing the training of the reward function neural network, the action neural network and the evaluation neural network, and storing the parameters of the neural network after the training is finished;

s14: the method comprises the steps of controlling a controlled object by using a trained reward function neural network, an action neural network and an evaluation neural network, firstly reading parameters of the neural networks after training is finished, then obtaining an SOC value s before interaction, inputting the trained action neural network, and outputting a distribution ratio a of a hybrid power system as a control quantity to control a vehicle.

Further, in step S1, the optimization solution method includes pseudo-spectrum method, dynamic programming method, and genetic algorithm.

Further, in step S2, the reward function neural network is composed of a fully-connected neural network, a convolutional neural network, and a long-short term memory neural network stacked in the order of the convolutional neural network, the long-short term memory neural network, and the fully-connected neural network; the fully-connected neural network is formed by stacking a plurality of fully-connected layers, the convolutional neural network is formed by stacking a plurality of convolutional layers, and the long-short term memory neural network is formed by stacking a plurality of long-short term memory layers.

Further, in step S3, the reverse reinforcement learning method specifically includes the following steps:

s3.1, selecting a strategy pi obtained through random generation, and then operating the strategy to obtain a series of state-action pair sequences zeta { (S)_t,a_t)}，s_tRepresenting the t-th state, a, in a state-action pair_tRepresenting the t-th action in the state-action pair, t representing the serial number of the state-action pair, and then calculating the action value Q through an evaluation neural network^π；

S3.2, using the expert State-action pairs obtained in step S1, according to the formula

Calculating expert action values

Wherein, theta^TParameter, mu, representing a network of reward functions^πSymbols representing strategies, E representing the mathematical expectation, gamma^tIndicating the discount rate, S_t0Indicating the tth of an expert State-action pair₀A state; a. the_t0Indicating the tth of an expert State-action pair₀Step action, and a is the t th dotted finger in the formula₀Step expert knowledge of the actions actually performed;

s3.3, by

Performing gradient descent update on the parameter theta for the target function; wherein the content of the first and second substances,

indicating the state of the first summation to the t-th summation and the second summation to the i-th summation,

representing the action of summing the first time to the t-th time, summing the second time to the i-th time,_Nindicates the total number of times of the second summation,_Lrepresenting the total number of first summations; lambda [ alpha ]₁The representation is an empirical constant used for balancing punishment and expectation, epsilon represents an artificially set precision threshold, and i represents a counting serial number of the second summation; if the learned state action pairs are consistent with the expert strategy, the loss function

Otherwise

Further, in step S4, the action neural network and the evaluation neural network are all composed of a fully connected neural network, a convolutional neural network, and a long-short term memory neural network stacked in the order of the convolutional neural network, the long-short term memory neural network, and the fully connected neural network; the fully-connected neural network is formed by stacking a plurality of fully-connected layers, the convolutional neural network is formed by stacking a plurality of convolutional layers, and the long-short term memory neural network is formed by stacking a plurality of long-short term memory layers.

Further, in step S5, the SOC value S before vehicle interaction is obtained through a vehicle model or preset.

Further, in step S7, the hidden layer includes a full-connection layer, a convolutional layer, a long-short term memory layer, a sigmoid layer, and a tanh layer, that is, the hidden layer is one of the full-connection layer, the convolutional layer, the long-short term memory layer, the sigmoid layer, and the tanh layer.

Further, in step S10, the bellman equation is:

wherein r is_tIndicating a prize value, s_t+1E represents the state obeying distribution E,

denotes the reward as rt, the state obeys the mathematical expectation of the distribution E, r(s)_t,a_t) Represents a state of s_tThe action is a_tThe value of the prize of the time of day,

representing the mathematical expectation when the action execution policy is pi.

Further, in step S11, the reward function neural network, the action neural network, and the evaluation neural network calculate respective weight gradients through gradient chain equations (Learning representations by back-providing errors, David E, Nature vol323, 1986).

Compared with the prior art, the invention has the beneficial effects that:

according to the invention, the optimal reward function can be learned from expert knowledge by utilizing reverse deep reinforcement learning, and the reward function does not need to be artificially designed, so that the deep reinforcement learning effect is better. Compared with a simple optimization solving method, the method has the advantages of high calculating speed and good real-time performance, and is suitable for online application on a real vehicle.

Drawings

FIG. 1 is a schematic overall flow chart of a hybrid system energy management strategy based on reverse deep reinforcement learning according to the present invention;

fig. 2 is a flowchart of a hybrid system energy management strategy method based on reverse deep reinforcement learning according to an embodiment of the present invention.

Detailed Description

For a better understanding of the objects, aspects and advantages of the present invention, reference is made to the following detailed description of the invention taken in conjunction with the accompanying drawings.

Example (b):

as shown in fig. 1 and fig. 2, a hybrid system energy management strategy based on reverse deep reinforcement learning includes the following steps:

s1: calculating a global hybrid mode distribution ratio and a global optimized SOC result by using an optimization solving method under one complete working condition, and forming an expert state-action pair as expert knowledge of reverse reinforcement learning; the optimization solving method comprises a pseudo-spectrum method, a dynamic programming method and a genetic algorithm.

S2: creating a reward function neural network and initializing parameters;

the reward function neural network is formed by stacking a fully-connected neural network, a convolutional neural network and a long-short term memory neural network in the order of the convolutional neural network, the long-short term memory neural network and the fully-connected neural network; the fully-connected neural network is formed by stacking a plurality of fully-connected layers, the convolutional neural network is formed by stacking a plurality of convolutional layers, and the long-short term memory neural network is formed by stacking a plurality of long-short term memory layers.

the reverse reinforcement learning method specifically comprises the following steps:

s3.1, selecting a strategy obtained by random generationPi, then run the strategy to get a series of state-action pair sequences ζ {(s)_t,a_t)}，s_tRepresenting the t-th state, a, in a state-action pair_tRepresenting the t-th action in the state-action pair, t representing the serial number of the state-action pair, and then calculating the action value Q through an evaluation neural network^π；

Calculating expert action values

s3.3, by

Otherwise

the action neural network and the evaluation neural network are all formed by stacking a fully-connected neural network, a convolutional neural network and a long-short term memory neural network in the order of the convolutional neural network, the long-short term memory neural network and the fully-connected neural network; the fully-connected neural network is formed by stacking a plurality of fully-connected layers, the convolutional neural network is formed by stacking a plurality of convolutional layers, and the long-short term memory neural network is formed by stacking a plurality of long-short term memory layers.

and the SOC value s before vehicle interaction is obtained through a vehicle model or is obtained through presetting.

the hidden layer comprises a full connection layer, a convolution layer, a long short-term memory layer, a sigmoid layer and a tanh layer, namely the hidden layer is one of the full connection layer, the convolution layer, the long short-term memory layer, the sigmoid layer and the tanh layer.

the Bellman equation is:

the rewarding function neural network, the action neural network and the evaluation neural network calculate respective weight gradients through gradient chain formulas (Learning representation by back-amplifying errors, David E, Nature vol323, 1986).

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The embodiments of the present invention are not limited to the above-mentioned embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and they are included in the scope of the present invention.

Claims

1. A hybrid system energy management strategy based on reverse deep reinforcement learning is characterized by comprising the following steps:

s2: creating a reward function neural network and initializing parameters;

s3: the method for learning and obtaining the parameters of the reward function neural network by utilizing the reverse reinforcement learning method specifically comprises the following steps:

Calculating expert action values

Wherein, theta^TParameter, mu, representing a network of reward functions^πSymbol representing strategy, E representing mathematical expectation, gamma representing discount rate, S_t0Indicating the tth of an expert State-action pair₀A state; a. the_t0Indicating the tth of an expert State-action pair₀Step action, and a is the t th dotted finger in the formula₀Step expert knowledge of the actions actually performed;

s3.3, by

representing the action from the first summation to the t-th summation and the second summation to the i-th summation, N representing the total number of summations of the second summation, and L tableShowing the total number of first summations; lambda [ alpha ]₁The representation is an empirical constant used to balance penalties with expectations, i represents the count number of the second summation; if the learned state action pairs are consistent with the expert strategy, the loss function

Otherwise

s12: updating parameters of an execution network of an action neural network and parameters of an execution network of the evaluation neural network by using softreplacement rules, and completing parameter updating of the action neural network and the evaluation neural network;

2. The hybrid dynamic system energy management strategy based on the inverse deep reinforcement learning of claim 1, wherein in step S1, the optimization solution method includes pseudo-spectral method, dynamic programming method, and genetic algorithm.

3. The hybrid system energy management strategy based on the reverse deep reinforcement learning of claim 1, wherein: in step S2, the reward function neural network is composed of a fully-connected neural network, a convolutional neural network, and a long-short term memory neural network stacked in the order of the convolutional neural network, the long-short term memory neural network, and the fully-connected neural network; the fully-connected neural network is formed by stacking a plurality of fully-connected layers, the convolutional neural network is formed by stacking a plurality of convolutional layers, and the long-short term memory neural network is formed by stacking a plurality of long-short term memory layers.

4. The hybrid dynamic system energy management strategy based on the inverse deep reinforcement learning of claim 1, wherein in step S4, the action neural network and the evaluation neural network are composed of a fully connected neural network, a convolutional neural network, and a long-short term memory neural network stacked in the order of the convolutional neural network, the long-short term memory neural network, and the fully connected neural network; the fully-connected neural network is formed by stacking a plurality of fully-connected layers, the convolutional neural network is formed by stacking a plurality of convolutional layers, and the long-short term memory neural network is formed by stacking a plurality of long-short term memory layers.

5. The hybrid system energy management strategy based on the reverse deep reinforcement learning of claim 1, wherein: in step S5, the SOC value S before vehicle interaction is obtained by a vehicle model or is obtained through presetting.

6. The hybrid system energy management strategy based on the reverse deep reinforcement learning of claim 1, wherein: in step S7, the hidden layer includes a full-connection layer, a convolutional layer, a long-short term memory layer, a sigmoid layer, and a tanh layer, that is, the hidden layer is one of the full-connection layer, the convolutional layer, the long-short term memory layer, the sigmoid layer, and the tanh layer.

7. The hybrid dynamic system energy management strategy based on the reverse deep reinforcement learning of claim 1, wherein in step S10, the bellman equation is:

8. The hybrid power system energy management strategy based on the reverse deep reinforcement learning of claim 1, wherein in step S11, the reward function neural network, the action neural network, and the evaluation neural network calculate their respective weight gradients through gradient chain formulas.