CN115563716A

CN115563716A - New energy automobile energy management and adaptive cruise cooperative optimization method

Info

Publication number: CN115563716A
Application number: CN202211253311.XA
Authority: CN
Inventors: 彭剑坤; 范毅; 陈伟琪; 殷国栋; 庄伟超; 江如海
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-10-13
Filing date: 2022-10-13
Publication date: 2023-01-03

Abstract

The invention discloses a hybrid electric vehicle energy management strategy and adaptive cruise control cooperative optimization method, which takes a hybrid electric vehicle as a research object, fuses a following model and a power battery energy management strategy based on a depth certainty strategy gradient algorithm, develops an ecological driving energy management strategy based on depth reinforcement learning, and improves fuel economy on the premise of realizing optimal following performance. The method mainly comprises the steps of constructing a simulation environment and loading training data; constructing an Actor and criticic training network based on a DDPG algorithm; training energy management strategies through a DDPG algorithm to obtain inheritable neural network parameters; and downloading the trained network parameters to the whole hybrid electric vehicle controller to realize real-time online application.

Description

New energy automobile energy management and adaptive cruise cooperative optimization method

Technical Field

The invention relates to a new energy automobile energy management and adaptive cruise cooperative optimization method which is mainly applied to ecological driving energy management strategy development based on deep reinforcement learning.

Background

Global warming due to the large amount of greenhouse gases mainly containing carbon dioxide (CO 2) is increasing, and the process of controlling carbon emissions to delay the global warming has become a common general consensus in countries of the world. A significant proportion of the CO2 emitted into the air comes from the use of fossil fuels by vehicles.

The energy source of the hybrid electric vehicle comprises two parts of heat energy generated by fossil fuel and electric energy stored by a battery, and compared with the traditional fuel oil vehicle, the hybrid electric vehicle has the advantages of less carbon emission and higher fuel oil economic benefit. The energy management strategy aims to improve fuel economy and maintain battery state of charge during vehicle operation. The adaptive cruise control is used for vehicle following scenes of urban roads and expressways and aims to improve the running efficiency and fuel economy of following vehicles. Currently, deep reinforcement learning is used for optimization of an energy management strategy and control of a following model respectively, but the two models are two independent models for the same problem and cannot achieve global optimization.

In order to achieve global optimal performance of an energy management strategy and a following model, energy management and adaptive cruise control are integrated into one model, and a scheme for developing an ecological driving energy management strategy based on deep reinforcement learning becomes possible.

Disclosure of Invention

Aiming at the technical problems in the field, the invention provides a framework combining an energy management strategy based on deep reinforcement learning and an adaptive cruise control algorithm on the basis of a deep reinforcement learning algorithm, and the framework is named as an ecological driving energy management strategy based on deep reinforcement learning.

The invention adopts the following technical scheme:

compared with the prior art, the technical scheme adopted by the invention has the following technical effects:

(1) The hybrid power energy management and the adaptive cruise control of the new energy automobile realize cooperative optimization under an algorithm architecture, and compared with a traditional layered architecture, the development difficulty of each system is reduced;

(2) The hybrid power energy management and the adaptive cruise system of the new energy automobile are separated from a simple uploading and issuing relation, and multi-parameter interaction is realized from the aspects of input states, reward functions, control actions and the like.

Drawings

FIG. 1 is an ecological driving energy management strategy algorithm framework based on deep reinforcement learning;

FIG. 2 is a graph of optimal fuel consumption for an engine;

FIG. 3 is a graph of battery characteristics;

figure 4 is a DDPG algorithm flow.

Detailed Description

The technical solutions of the present application will be further elaborated with reference to the drawings, and the described embodiments are only a part of the embodiments related to this patent. All non-inventive embodiments of this embodiment that are within the scope of this patent by other researchers in the field are considered to be within the scope of this patent.

The invention designs a new energy automobile energy management and adaptive cruise cooperative optimization method, which comprises the following specific steps as shown in figure 1:

step one, building a following model simulation environment, and preloading a battery characteristic curve and an optimal fuel economy curve as prior knowledge; and inputting vehicle running data under a mixed working condition, and using the vehicle running data as training data of a pilot vehicle in a following model.

And step two, creating an Actor network and a criticic network based on the DDPG algorithm and the neural network structure, creating a target network for the Actor network and the criticic network respectively, and constructing a training network and a total reward function of the energy management strategy of the hybrid electric vehicle.

Step three, the intelligent agent interacts with the simulation environment, and offline training is carried out on the energy management strategy of the hybrid electric vehicle through a DDPG algorithm based on the constructed Actor and Critic networks and reward functions to obtain sustainable neural network parameters;

and step four, downloading the inheritable network parameters obtained by the offline training into a vehicle control unit of the hybrid electric vehicle, and realizing real-time online application.

According to the ecological driving energy management strategy based on deep reinforcement learning, in the first step, a follow-up model simulation environment is built by SUMO software, and the speed and the acceleration of a vehicle in a simulation scene are obtained and controlled through a Traci interactive interface. The priori knowledge comprises a battery characteristic curve and an optimal fuel economy curve, wherein the battery characteristic curve is used for constructing a functional relation between internal resistance and open-circuit voltage and an SoC value, and the optimal fuel economy curve is used for constructing a functional relation between engine power and rotating speed and torque. The mixed working condition comprises an expressway working condition and an urban road working condition, and covers most of following scenes, so that the training result can be applied to various roads.

The internal resistance and the open-circuit voltage of the battery have functional relations with the SoC value thereof, three groups of test data are input, namely the relation between the internal resistance and the SoC value in a charging state, the relation between the internal resistance and the SoC value in a discharging state and the relation between the open-circuit voltage and the SoC value, and the functional relations between the internal resistance and the SoC value in the charging state, the internal resistance and the SoC value in the discharging state and the open-circuit voltage and the SoC value are respectively displayed and expressed through unitary linear interpolation fitting, so that the SoC value of the battery at any moment and any state can be solved by using the functional relations.

Inputting the operation data of the engine, the motor and the engine obtained from the bench test as prior knowledge, constructing an optimal fuel economy curve model for representing the functional relation among the rotating speed, the torque and the equivalent fuel consumption rate of the engine, carrying out binary interpolation fitting, displaying and expressing the functional relation, and solving the output power of the engine at any time and in any state by using the functional relation, wherein the output power of the engine is equal to the product of the rotating speed and the torque.

According to the ecological driving energy management strategy based on deep reinforcement learning, in the second step, the inertial navigation system and the global positioning system are used for obtaining real-time speed and acceleration data of the hybrid vehicle, and the SoC value of the hybrid vehicle at any moment is obtained through the following equation:

wherein SoC is the State of Charge, V _OC Is the open circuit voltage, R ₀ Is the internal resistance, P _b Is the output energy, Q, at the charge-discharge stage ₀ Is the initial capacity of the battery, Q is the nominal capacity of the battery, and I is the current of the battery at the present time.

And respectively defining a state vector and an action vector by combining the distance between two vehicles, the speed, the acceleration and the engine power in the following model as follows:

wherein v is _h And a _h Respectively the speed and acceleration of the target vehicle (rear vehicle), L is the inter-vehicle distance, i.e. the distance from the head of the target vehicle to the tail of the pilot vehicle, v _p And a _p Respectively the speed and acceleration of the pilot vehicle, e _h Is the target vehicle engine power. a is _h Is a control action of the following model, e _h Is the control action of the energy management policy.

In order to ensure the safety of the target vehicle during the following process and simultaneously take the riding comfort into consideration, the reward function of the following model is defined as follows:

r _follow ＝r _follow1 +r _follow2

wherein L is _min And L _max Is the minimum and maximum values of the respective vehicle separation, TTC is the time before collision, the reward function r _follo The purpose of the method is to limit the vehicle to run within the maximum and minimum following distances and describe the safety during the following process;jerk is the acceleration change rate of the target vehicle at the sampling moment, describes the comfort performance in the following process, and has a reward function r _follow2 To improve the ride experience for the driver and passengers.

In order to reduce the fuel consumption of the engine and maintain the SoC value of the battery within an acceptable range, the instantaneous fuel consumption of the engine and the battery charging maintenance cost need to be considered, so that the reward function of the energy management strategy is defined as follows:

r _energy ＝-[fuel+250(SoC _ref -SoC) ² ]

wherein fuel is the fuel consumption of the target vehicle at the sampling moment, soC _ref Is the nominal SoC value of the battery.

The ecological driving energy management strategy based on deep reinforcement learning provided by the invention is characterized in that an adaptive cruise and following model and a hybrid electric vehicle energy management strategy are innovatively fused together through a DDPG algorithm, and a total reward function comprises two parts of reward of the following model and reward of the energy management strategy, and is defined as follows:

reward＝r _follow +r _energy

next, a training network is constructed. Construction of an Actor network, denoted as μ (s | θ) ^μ ) Wherein θ ^μ The network parameters are input into the Actor network as the current state s, and the deterministic action a is output. Constructing a Critic network, denoted as Q (s, a | θ) ^Q )，θ ^Q The Critic network has the input of a current state s and a deterministic action a output of the Actor network, and the output of the Critic network is a value function and gradient information.

Respectively establishing target networks mu' (s | theta) of Actor network and Critic network ^μ ′)、Q′(s，a|θ ^Q ') the network structure and the parameter structure of the target network are the same as the corresponding network, and θ μ' is the parameter of the target network of the Actor network, θ μ ^Q ' is a parameter of a target network of the Critic network. And (3) training the energy management strategy of the hybrid electric vehicle by applying the constructed target networks of the Actor and Critic networks.

The invention relates to an ecological driving energy management strategy based on deep reinforcement learning, which comprises the following stepsAnd the intelligent agent in the DDPG algorithm framework interacts with the simulation environment, acquires the current environment state information, selects and executes actions according to the strategy, enters a new environment state, acquires rewards fed back by the environment, stores the information of the states, the actions, the rewards and the like at the same time, and realizes the training of the energy management strategy through an experience playback pool in a circulating way. In order to make the model converge more quickly and achieve better training effect, a prior experience playback technology is adopted in the algorithm, namely each group of experience data is assigned with one absolute value | delta ] of the time sequence error of the experience data _t I, the samples with higher probability values will have a higher probability to be sampled. The training steps are as follows:

step 1, initializing an Actor network, a Critic network and a target network thereof; a storage space R is defined as an experience replay pool and initialized.

And 2, introducing action noise by using Laplace random distribution to search a potential better strategy.

Step 3, combining the state s of the current time t according to the action strategy _t And Laplace random noise to obtain motion vector a _t ＝{a _h ，e _h And i.e.: a is _t ＝μ(s _t |θ ^μ )+Z _t . Performing action a _t Obtaining the reward r of the current time t _t And the state vector s at time t +1 _t+1 . Judging whether the current cycle is ended, if the pool value is true, ending the current cycle, and executing the step 2; if the pool value is false, continue to execute step 4.

Step 4, according to the absolute value | delta of the time sequence error _t I calculating the sampling probability P (t) and the importance weight omega _t ：

δ _t ＝y _t -Q(s _t ，a _t |θ ^Q )

Wherein:

y _t ＝r _t +γQ′[s _t+1 ，μ′(s _t+1 |θ ^μ ′)|θ ^Q ′]

where γ is the attenuation ratio, y _t Is the target Q value.

The absolute value of the timing error | delta _t I is sorted from big to small, rank (t) is marked as its serial number, according to which the experience p is defined _t The priority of (2):

define the experience p accordingly _t Sampling probability of (2):

where n is the size of the empirical playback pool, α represents the degree of control priority usage, and takes a value between 0 and 1, and when α =0 represents uniform sampling.

In order to increase the diversity of the experience pool and avoid the network from falling into an overfitting state, a sampling importance weight is defined:

wherein p is _min Represents p _t Minimum value of (d); beta is the annealing index, the initial value of which is beta ₀ Between 0 and 1, beta will anneal linearly to 1.

Step 5, the experience playback pool adopts a binary tree data structure, and the information generated in the interaction is processed by T _t ＝(s _t ，a _t ，r _t ,s _t+1 Bol) form into the terminal leaves and simultaneously storing T _t As a training data set for the Actor and criticc networks.

And 6, sampling from the experience playback pool R according to the sampling probability by a prior experience playback technology to obtain a small batch of samples S (the number of the samples is recorded as N) for training an Actor and a Critic network.

Step 7, calculating the gradient of the Critic network through a chain rule, and calculating a loss function L (theta) of the Critic network ^Q )：

Step 8, updating the parameter theta of the Critic network by using an adaptive matrix estimation algorithm (Adam) ^Q And calculating the gradient of the Actor network:

in the formula

Is the gradient operator, J is the objective function of the DDPG algorithm, a represents the action, and s represents the state.

Step 9, updating the parameter theta of the Actor network by using an adaptive matrix estimation algorithm (Adam) ^μ And updating target network parameters of the networks of the Actor and the Critic by using a soft updating method, namely updating the target networks of the Critic and the Actor in small amplitude in each time step:

in the formula, tau is an updating amplitude, and the default value is 0.001.

And 10, repeating the steps 2 to 9 until the training is finished, and then storing and downloading the neural network parameters.

In a preferred embodiment of the present invention, the step one specifically includes the following steps:

step 1, pre-writing road network and vehicle files in the SUMO, and calling the files in the python program through a Traci interactive interface.

Step 2, inputting prior knowledge, and obtaining an explicit functional relationship by an interpolation fitting method, wherein the explicit functional relationship comprises four groups of functional relationships: (1) The functional relation between the engine speed, the torque and the equivalent fuel consumption rate; (2) a functional relation between the internal resistance and the SoC value in a charging state; (3) functional relation between internal resistance and SoC value in discharge state; and (4) the functional relation between the open-circuit voltage and the SoC value. The images are drawn as shown in fig. 2 and 3. The functional relation is used for solving the battery SoC value and the engine output power at any time and in any state.

And 3, inputting mixed working condition data as the driving information of the pilot vehicle, wherein the mixed working condition consists of an expressway working condition and an urban road working condition and covers the following scene under most road conditions. The average speed in the set of data was 44km/h, the maximum speed was 116km/h, and the duration was 1858s.

And 4, step 4: the SoC value of the hybrid vehicle at any time is obtained by the following equation:

wherein SoC is the state of charge, V _OC Is an open circuit voltage, R ₀ Is the internal resistance, P _b Is the output energy, Q, in the charging and discharging phases ₀ Is the initial capacity of the battery, Q is the nominal capacity of the battery, and I is the current of the battery at the present time.

In a preferred embodiment of the present invention, the second step specifically includes the following steps:

step 1, defining a state set and a behavior set as follows:

wherein v is _h And a _h Respectively the speed and acceleration of the target vehicle (rear vehicle), L is the inter-vehicle distance, i.e. the distance from the head of the target vehicle to the tail of the pilot vehicle, v _p And a _p Speed and acceleration of the pilot vehicle, respectively, e _h Is the target vehicle engine power. a is _h Is a control action of the following model, e _h Is the control action of the energy management policy.

The reward function is defined as follows:

reward＝r _follo +r _follow2 +r _energy

r _energy ＝-[fuel+250(SoC _ref -SoC) ² ]

wherein TTC is the time before collision, jerk is the acceleration change rate of the target vehicle at the sampling moment, fuel is the fuel consumption of the target vehicle at the sampling moment, and SoC _ref Is the nominal SoC value of the battery.

Step 2, constructing an Actor network and recording as mu (s | theta) ^μ ) Wherein θ ^μ The network parameters are input into the Actor network as the current state s, and the output is the deterministic action a.

Step 3, constructing a Critic network, and recording the Critic network as Q (s, a | theta) ^Q )，θ ^Q Is a network parameter, and the inputs to the Critic network are the current states s and Ac _to r deterministic action a of the network output, the output being a function of the value and gradient information.

Step 4, respectively establishing target networks of the Actor network and the Critic network, wherein the network structure and the parameter structure of the target networks are the same as those of the corresponding networks, and recording theta ^μ ' is a parameter of a target network of an Actor network, theta ^Q ' is a parameter of a target network of the Critic network.

In a preferred embodiment of the present invention, the DDPG algorithm flow is shown in FIG. 4.

In a preferred embodiment of the present invention, the step three specifically includes the following steps:

step 1, initializing Actor network mu (s | theta) ^μ ) And Critic network Q (s, a | θ) ^Q ) And its target network mu' (s | theta) ^μ ′)、Q′(s，a|θ ^Q ') to a host; defining a storage space R as an experience playback pool, and setting the capacity to be N; initializing hyper-parameters alpha and beta; and setting the maximum circulation times M of the intelligent agent.

Step 2, initial simulation environment andtraining data of the pilot vehicle to obtain an initial state s _t 。

Step 3, according to the initial state s _t And selecting an action according to the action strategy and the Laplace random noise, namely: a is a _t ＝μ(s _t |θ ^μ )+Z _t . Performing action a _t Obtaining the prize r at the current time _t And the state vector s at the next time instant _t+1 . Judging whether the current cycle is ended, if the pool value is true, ending the current cycle, and executing the step 2; if the pool value is false, continue to execute step 4.

Step 4, calculating the absolute value | delta of the time sequence error _t L, calculating the sampling probability P (t) and the importance weight ω _t 。

Step 5, the information generated in the interaction is used as T _t ＝(s _t ，a _t ，r _t ,s _t+1 Bol) form into an experience playback pool, and simultaneously stores T _t As a training data set for the Actor and criticc networks.

And 6, sampling from the experience playback pool R according to the sampling probability through a prior experience playback technology to obtain small batch samples S, and training an Actor and a Critic network.

Step 7, calculating the gradient of the Critic network, and calculating the loss function L (theta) of the Critic network ^Q )：

in the formula

Is the gradient operator, J is the objective function of the algorithm, a denotesAction, s denotes state.

in the formula, tau is an updating amplitude, and the default value is 0.001.

And 10, repeating the steps 2 to 9 until the maximum cycle number M, finishing training, and then storing and downloading the neural network parameters.

In a preferred embodiment of the present invention, the step four specifically is: and downloading the network parameters obtained by off-line training into a vehicle control unit of the hybrid electric vehicle to realize real-time on-line application.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications or substitutions within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A new energy automobile energy management and adaptive cruise cooperative optimization method is characterized by comprising the following steps:

step one, building a following model simulation environment, and preloading a battery characteristic curve and an optimal fuel economy curve as prior knowledge; inputting vehicle running data under a mixed working condition, and using the vehicle running data as training data of a pilot vehicle in a following model;

creating an Actor network and a criticic network based on a DDPG algorithm and a neural network structure, respectively creating a target network for the Actor network and the criticic network, constructing a training network of the energy management strategy of the hybrid electric vehicle, and constructing a total reward function of the energy management strategy of the hybrid electric vehicle;

step three, the intelligent agent interacts with the simulation environment, and based on the established Actor network, criticic network and reward function, the energy management strategy of the hybrid electric vehicle is trained offline through a DDPG algorithm to obtain sustainable neural network parameters;

2. The new energy automobile energy management and adaptive cruise cooperative optimization method according to claim 1, characterized in that in the first step, SUMO software is used for building a car following model simulation environment, and the speed and the acceleration of a vehicle in a simulation scene are obtained and controlled through a Traci interaction interface.

3. The method for collaborative optimization of new energy vehicle energy management and adaptive cruise according to claim 1, wherein the prior knowledge includes: the battery characteristic curve is used for constructing a functional relation among the internal resistance, the open-circuit voltage and the SoC value, so that the SoC value of the battery at any time and in any state is solved; the optimal fuel economy curve is used for constructing a functional relation between the engine power and the rotating speed and the torque, so that the engine output power at any time and any state can be solved.

4. The method for collaborative optimization of new energy vehicle energy management and adaptive cruise according to claim 1, characterized in that the hybrid condition is composed of a highway condition and an urban road condition.

5. The method for energy management and adaptive cruise cooperative optimization of a new energy vehicle according to claim 3, wherein the SoC value of the hybrid vehicle at any time is obtained through the following equation:

wherein SoC is the state of charge, V _OC Is an open circuit voltage, R ₀ Is the internal resistance, P _b Is the output energy, Q, at the charge-discharge stage ₀ Is the initial capacity of the battery, Q is the nominal capacity of the battery, and I is the current of the battery at the present time.

6. The method for collaborative optimization of new energy automobile energy management and adaptive cruise control according to claim 1, characterized in that, in combination with distance between two vehicles, speed, acceleration and engine power in a follow-up model, state vector state and action vector action are respectively defined as follows:

wherein v is _h Is the speed of the target vehicle; l is the distance between vehicles, namely the distance from the head of the target vehicle to the tail of the pilot vehicle; v. of _p And a _p Respectively the speed and acceleration of the pilot vehicle; a is _h Is a control action of the following model, i.e. the acceleration of the target vehicle, e _h Is the control action of the energy management strategy, i.e. the target vehicle engine power;

the reward function defining the following model is as follows:

r _follow ＝r _follow1 +r _follow2

wherein L is _min And L _max Are the minimum and maximum values of the respective vehicle separation, TTC is the time before collision; jerk is the acceleration rate of the target vehicle at the sampling time;

the reward function that defines the energy management policy is as follows:

r _energy ＝-[fuel+250(SoC _ref -SoC) ² ]

wherein fuel is the fuel consumption of the target vehicle at the sampling time, soC _ref Is the nominal SoC value of the battery;

the total reward function defining the energy management strategy of the hybrid electric vehicle is as follows:

reward＝r _follow +r _energy 。

7. the method for collaborative optimization of new energy automobile energy management and adaptive cruise control according to claim 1, characterized by constructing an Actor network denoted as μ (s | θ) ^μ ) Wherein θ ^μ The method comprises the steps that network parameters are adopted, the input of an Actor network is a current state s, and the output is a deterministic action a; constructing a Critic network, denoted as Q (s, a | θ) ^Q )，θ ^Q The method comprises the steps that network parameters are input into a Critic network, the current state s and the deterministic action a output from an Actor network are input into the Critic network, and the output is a value function and gradient information;

respectively establishing target networks mu' (s | theta) of Actor network and Critic network ^μ′ )、Q′(s，a|θ ^Q′ ) Target network μ' (s | θ) ^μ′ )、Q′(s，a|θ ^Q′ ) Respectively with the corresponding network mu (s | theta) ^μ )、Q(s，a|θ ^Q ) Same, remember θ ^μ′ Is a parameter of the target network of the Actor network, θ ^Q′ A parameter of a target network which is a Critic network; and (3) training the energy management strategy of the hybrid electric vehicle by applying the constructed target networks of the Actor and the Critic network.

8. The method for collaborative optimization of new energy vehicle energy management and adaptive cruise according to claim 1, characterized by: and in the third step, the intelligent agent interacts with the simulation environment, acquires the current environment state information, selects and executes actions according to the strategy, enters a new environment state, acquires rewards fed back by the simulation environment, stores the state, the actions and the reward information at the same time, and realizes the training of the hybrid power energy management strategy through an experience playback pool.

9. The method for collaborative optimization of new energy vehicle energy management and adaptive cruise according to claim 1, characterized by: in the third step, the offline training of the energy management strategy of the hybrid electric vehicle adopts a prior experience playback technology, and the specific training steps are as follows:

step 1, initializing an Actor network, a Critic network and a target network thereof; defining a storage space R as an experience playback pool and initializing;

step 2, introducing action noise Z at time t by using Laplace random distribution _t To find a potentially better strategy;

step 3, combining the state s at the time t according to the action strategy _t And Laplace random noise to obtain the motion vector a at the time t _t ＝{a _h ，e _h And that is: a is a _t ＝μ(s _t |θ ^μ )+Z _t (ii) a Performing motion vector a _t Obtaining the total reward r of the current time _t And state s at time t +1 _t+1 (ii) a Judging whether the current cycle is ended, if the pool value is true, ending the current cycle, and returning to execute the step 2; if the pool value is false, continuing to execute the step 4;

δ _t ＝y _t -Q(s _t ，a _t |θ ^Q )

Wherein:

y _t ＝r _t +γQ′[s _t+1 ，μ′(s _t+1 |θ ^μ′ )|θ ^Q′ ]

where γ is the attenuation ratio, y _t Is the target Q value at time t;

define the experience p accordingly _t Sampling probability of (2):

where n is the size of the empirical playback pool and α represents the degree of control priority usage;

defining sampling importance weights:

wherein p is _min Represents p _t β is the annealing index;

step 5, the experience playback pool adopts a binary tree data structure, and the information generated in the interaction is processed by T _t ＝(s _t ，a _t ，r _t ,s _t+1 Bol) form into the end leaves and storing T at the same time _t As a training data set of the Actor and criticc networks;

step 6, sampling is carried out from the experience playback pool R according to the sampling probability through a prior experience playback technology, a small batch of samples S are obtained, the number of the samples is recorded as N, and the samples are used for training an Actor and a Critic network;

Step 8, updating the parameter theta of the Critic network by using the adaptive matrix estimation algorithm Adam ^Q And calculating the gradient of the Actor network:

in the formula

Is a gradient operator, J is an objective function of the DDPG algorithm, a represents an action, and s represents a state;

step 9, updating the parameter theta of the Actor network by using the adaptive matrix estimation algorithm Adam ^μ And updating target network parameters of the Actor and the Critic networks by using a soft updating method, namely updating the target networks of the Critic and the Actor by a set amplitude tau in each time step:

and 10, repeating the steps 2 to 9 until the preset maximum iteration number is reached, finishing the training, and then storing and downloading the neural network parameters.