CN117184095B

CN117184095B - Hybrid electric vehicle system control method based on deep reinforcement learning

Info

Publication number: CN117184095B
Application number: CN202311359313.1A
Authority: CN
Inventors: 张亚辉; 王子萌; 王众; 田阳; 焦晓红; 文桂林
Original assignee: Yanshan University
Current assignee: Yanshan University
Priority date: 2023-10-20
Filing date: 2023-10-20
Publication date: 2024-05-14
Anticipated expiration: 2043-10-20
Also published as: CN117184095A

Abstract

The invention provides a control method of a hybrid electric vehicle system based on deep reinforcement learning, which comprises the following steps: s1, acquiring multidimensional road condition information of a plug-in hybrid power logistics light truck in a historical driving process; s2, establishing a whole vehicle power system model; s3, pre-optimizing the two motors, and reducing the optimized dimension; s4, performing dynamic planning calculation to generate a state transition data set; s5, determining state variables, action variables and rewarding functions required by the reinforcement learning algorithm; s6, pretraining critic and actor networks by using the state transition data set generated in the step S4; and S7, building an environment-agent model, and continuously and iteratively training an energy management strategy by using a deep reinforcement learning algorithm. S8, applying the model. Model training is carried out by utilizing DDPG algorithm, and the trained deep reinforcement learning intelligent body is obtained, so that the self-adaption capability of the logistics light truck to random working conditions under a fixed route is realized under the condition of ensuring fuel economy.

Description

Hybrid electric vehicle system control method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of new energy automobiles, in particular to a hybrid electric vehicle system control method based on deep reinforcement learning.

Background

In recent years, both the industry and the academia are researching and developing new types of automotive driveline. Increasingly stringent requirements continue to reduce fuel consumption and pollutant emission levels. Although electric vehicles (ELECTRIC VEHICLE, EV) can achieve zero emission, the electric vehicles are limited in popularity due to the shortcomings of mileage, charging infrastructure, battery cost, and the like. Plug-in hybrid ELECTRIC VEHICLE (PHEV) combines the advantages of both conventional internal combustion engines and new electric engines, and is becoming increasingly appreciated by automobile manufacturers.

Currently, energy management strategies are key to improving fuel economy for hybrid vehicles. Existing energy management strategies (ENERGY MANAGEMENT STRATEGY, EMS) generally fall into three broad categories, rule-based strategies, optimization-based strategies, and learning-based strategies. Although effective and easy to implement, rule-based strategies are limited in control performance to different vehicle types and vehicle configurations and are therefore not suitable for diverse driving needs. The optimization-based strategies include dynamic planning (Dynamic Programming, DP), pointrian maximum principle (Pontryagin's Minimum Principle, PMP), etc., and in practical applications, the optimization-based strategies rely largely on global operating condition information. The reinforcement learning (reinforcement learning, RL) based strategy may address the deficiencies of the conventional control techniques described above. Compared with dynamic planning, the energy management strategy based on reinforcement learning saves a great deal of calculation time and cost under the condition of being capable of running on line, and can obtain a near global optimal result. These characteristics make it an efficient, robust development approach for hybrid system energy management strategies. However, the existing method is inaccurate in result and low in calculation speed, and is not suitable for popularization and application, so that a high-efficiency and accurate energy management method needs to be studied urgently.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a control method of a hybrid electric vehicle system based on deep reinforcement learning, which can be applied in real time while ensuring that the optimization result is close to the optimality, has high efficiency and rapidness in the treatment process and accurate result, can excavate the fuel saving potential of the configuration under various driving modes, and improves the overall economy of a fuel cell vehicle. And meanwhile, experience generated by dynamic programming of historical working conditions is introduced in the optimization process, so that the calculation speed of the reinforcement learning algorithm is increased, and the accuracy of a calculation result is further improved.

Specifically, the invention provides a control method of a hybrid electric vehicle system based on deep reinforcement learning, which specifically comprises the following steps:

S1, acquiring fixed-route multidimensional road condition information in the historical running process of a plug-in hybrid electric vehicle;

s2, establishing a whole vehicle power system model and a power battery model of the plug-in hybrid electric vehicle; the method specifically comprises the following substeps:

s21, a whole vehicle power system model is as follows:

P_dem＝P_m12+P_engη_AMT

Wherein P _dem is the required power, P _m12 is the equivalent motor power, P _eng is the engine power, and eta _AMT is the efficiency of the gearbox;

P_m12＝P_EM+P_ISGη_AMT

wherein P _EM is EM motor power, and P _ISG is ISG motor power;

S22, building a power battery model:

Wherein I _bat is power battery current, U _OC is power battery open-circuit voltage, and R _bat is power battery internal resistance;

In the method, in the process of the invention, Q _bat is the power battery capacity, which is the change rate of the state of charge of the power battery;

s3, pre-optimizing the two motors to reduce the optimized dimension, and specifically comprises the following substeps:

S31, taking total energy conversion efficiency of two motors as a target, and establishing a pre-optimized objective function as follows:

wherein P _EM is EM motor power, eta _EM is EM motor efficiency, P _ISG is ISG motor power, eta _ISG is ISG motor efficiency, and i _g is gear;

s32, discretizing the vehicle speed and the equivalent motor power to form grid points to be traversed;

S33, under each mode, calculating the optimal power distribution of the two motors under each grid point formed in the step S32 according to the objective function established in the step S31;

S4, carrying out dynamic planning calculation on the multidimensional road condition information of the plug-in hybrid electric vehicle acquired in the step S1 by combining the pre-optimized result in the step S3 to generate a state transition data set;

s5, determining state variables, action variables and rewarding functions required by the reinforcement learning algorithm; wherein, the state variable s, the action variable a and the reward function r are respectively:

s＝(P_dem,SOC,d,θ,v,v₁,v₂)

Wherein P _dem is the power required by the whole vehicle, SOC is the battery charge state, d is the driving distance, θ is the current road gradient, v is the current vehicle speed, v ₁ is the previous second vehicle speed, and v ₂ is the previous two seconds vehicle speed;

a＝(mode,P_eng)

Where mode is the selected mode and P _eng is engine power;

r＝-(C_fuel·price_fuel+Δ_SOC·Q_bat·price_electric)

Wherein, C _fuel is the fuel consumption of one step, price _fuel is the price of unit fuel consumption, delta _SOC is the state of charge change of one step, Q _bat is the power battery capacity, and price _electric is the price of one degree of electricity;

S6, pre-training critic and actor networks by using the state transition data set generated in the step S4;

S7, building an environment intelligent agent model, transferring the state transfer data set generated in the step S4 into an experience pool, taking the two neural networks trained in the step S6 as an initial network model, and continuously and iteratively training an energy management strategy by using a deep reinforcement learning algorithm until the algorithm converges to obtain a final environment intelligent agent model;

And S8, adding the converged environment intelligent agent model into the HCU controller, and performing online application on a fixed line.

Preferably, the fixed-route multidimensional road condition information obtained in the step S1 includes a plurality of sets of historical vehicle speed curves and road gradient curves collected by the electric vehicle driving on the fixed route.

Preferably, the modes in step S33 include a series mode, a pure electric mode, a parallel first gear mode, and a parallel second gear mode.

Preferably, step S4 specifically comprises the following sub-steps:

s41, selecting a plurality of groups of working conditions as road condition multidimensional information for the history working conditions under the same road section acquired in the step S1;

S42, respectively calculating the minimum intervention vehicle speed in each mode, and respectively traversing the executable modes under the grid points formed by time and SOC according to the minimum intervention vehicle speed condition;

And S43, when the pure electric mode, the first-gear series-parallel mode and the second-gear series-parallel mode are traversed, the power of the two motors is distributed according to the pre-optimized result in the step S3.

Preferably, the step S6 specifically comprises the following substeps;

S61, collecting a state S, an optimal action a, a generated return r, a next state S' and a value function V of the state under each grid point of the data set generated in the step S4;

S62, constructing critic and actor neural network structures, wherein the critic and actor neural network structures comprise an input layer, an hidden layer and an output layer, and determining the number of network neurons of each layer;

S63, constructing critic an output sample set of the network, wherein the output sample set is as follows:

Q(s,a)＝r+γ·V(s')

Wherein r is the return in the data set generated in the step 4, and V (s ') is the value function of the next state s' in the data set generated in the step 4;

S64, taking (S; a) and Q (S, a) in the step S61 and the step S63 as input and output samples of the critic network respectively, wherein S and a are input and output samples of the actor network, and pretraining the critic and actor networks by using a gradient descent method.

Preferably, in step S62, the neural network structures critic and actor output the values of the four modes and the engine power, and the actual actions are selected as follows:

a＝(argmax(V₁、V₂、V₃、V₄),P_eng)

Wherein V ₁、V₂、V₃、V₄ is the value of the series mode, the pure electric mode, the parallel first gear mode and the parallel second gear mode respectively.

Preferably, the step S7 specifically includes the following substeps:

S71, combining a whole vehicle environment module and a DDPG algorithm module to construct an interactive algorithm;

S72, migrating the state transition data set generated in the step S4 into an experience pool of an interactive algorithm, taking the two neural network models pre-trained in the step S6 as initial critic and actor neural network models, and completing the establishment of an intelligent agent module;

S73, defining real-time state parameters of the whole vehicle and corresponding rewarding values as input parameters of a neural network in an intelligent agent module in each training, taking control variables output by the neural network as input parameters of a whole vehicle model in an environment module, generating new rewarding values after a vehicle executes a control command, and storing obtained experiences in an experience pool;

s74, the agent updates through the strategy gradient to realize the learning updating step of the neural network;

and S75, repeatedly iterating until the algorithm converges to obtain a final environment intelligent agent model.

Preferably, the specific policy gradient update formula in step S74 is as follows:

wherein r is a single step reward, s and s' are the current state quantity and the state quantity at the next moment, a is the current action quantity, θ _Q and θ _μ are the network parameters of critic and actor at the current moment respectively, And/>The network parameters are respectively the target critic and actor, gamma and mu are weight parameters, and alpha and tau are learning rate and target network update rate.

Compared with the prior art, the invention has the following beneficial effects:

(1) The invention adopts the deep reinforcement learning algorithm to manage the energy, realizes the instantaneity and optimality of the energy management strategy, and ensures the maximization of energy utilization.

(2) According to the invention, under the condition that two motors work simultaneously, pre-optimization calculation is adopted, the problem of simultaneous optimization of three power sources is changed into a layered optimization problem, firstly, the power distribution of the two motors in the power system is regarded as a static optimization problem to perform power distribution and gear pre-optimization, and secondly, deep reinforcement learning is used for mode selection and power distribution of the engine and the equivalent motor, so that the optimization difficulty is reduced.

(3) According to the invention, the results of dynamic planning of a plurality of groups of working conditions acquired on a fixed route are used as expert knowledge to be transferred to the reinforcement learning agent, so that the convergence speed of the reinforcement learning algorithm is greatly increased, and the calculation efficiency can be greatly improved.

(4) According to the invention, the historical data and the topographic information are fully utilized aiming at the scene of the electric vehicle delivering goods on the fixed route, so that the method can effectively learn the driving characteristics of the electric vehicle under the fixed route, and is more applicable than a general energy management strategy under the fixed route delivering scene.

Drawings

FIG. 1 is a schematic flow chart of the present invention;

FIG. 2 is a flow chart of an energy management strategy algorithm in embodiment 1 of the present invention;

Fig. 3 is a schematic structural diagram of a P2P3 plug-in hybrid light card and a pattern division illustration in embodiment 1 of the present invention;

FIG. 4 is a pre-optimization flow chart in embodiment 1 of the present invention;

fig. 5 is a schematic diagram of critic and actor networks in embodiment 1 of the present invention;

FIG. 6 is a schematic illustration of a fixed route for light truck driving for a part of road segments used in the 6th IFAC E-COSM 2021 challenge in example 2 of the present invention;

FIGS. 7 a-7 b are maps of two motor efficiencies of the light truck of example 2 of the present invention;

FIG. 8 is a graph showing the results after mode two pre-optimization in example 2 of the present invention;

FIG. 9 is a graph showing the results after mode three pre-optimization in example 2 of the present invention;

FIG. 10 is a schematic diagram of the results after mode four pre-optimization in example 2 of the present invention;

FIG. 11 is a schematic diagram of the history collected under the road section shown in FIG. 6 in embodiment 2 of the present invention;

FIG. 12 is a schematic diagram of the energy distribution results of the dynamic planning energy management strategy in embodiment 2 of the present invention under a certain working condition;

fig. 13 is a diagram showing convergence characteristics of the training process in embodiment 2 of the present invention.

Detailed Description

The invention will be described in further detail in the following detailed description of specific embodiments thereof with reference to the drawings. The following examples or figures are illustrative of the invention and are not intended to limit the scope of the invention.

The invention provides a control method of a hybrid electric vehicle system based on deep reinforcement learning, as shown in fig. 1, the method specifically comprises the following steps:

s1, acquiring fixed-route multidimensional road condition information in the history running process of the plug-in hybrid electric vehicle.

s21, a whole vehicle power system model is as follows:

P_dem＝P_m12+P_engη_AMT

P_m12＝P_EM+P_ISGη_AMT

Where P _EM is the EM motor power and P _ISG is the ISG motor power.

S22, building a power battery model:

In the method, in the process of the invention, The change rate of the state of charge of the power battery is represented by Q _bat, which is the capacity of the power battery.

wherein P _EM is EM motor power, eta _EM is EM motor efficiency, P _ISG is ISG motor power, eta _ISG is ISG motor efficiency, and i _g is gear.

S32, discretizing the vehicle speed and the equivalent motor power to form grid points needing to be traversed.

S33, under each mode, calculating the optimal power distribution of the two motors under each grid point formed in the step S32 according to the objective function established in the step S31; each mode includes a series mode, a pure mode, a parallel first gear mode, and a parallel second gear mode.

S4, carrying out dynamic planning calculation on the multidimensional road condition information of the plug-in hybrid electric vehicle acquired in the step S1 by combining the pre-optimized result in the step S3 to generate a state transition data set.

The step S4 specifically includes the following substeps:

S41, selecting a plurality of groups of working conditions as road condition multidimensional information for the history working conditions under the same road section acquired in the step S1.

S42, calculating the minimum intervention vehicle speed in each mode, and traversing the executable modes according to the minimum intervention vehicle speed condition under the grid points formed by time and SOC.

s＝(P_dem,SOC,d,θ,v,v₁,v₂)

a＝(mode,P_eng)

Where mode is the selected mode and P _eng is engine power;

r＝-(C_fuel·price_fuel+Δ_SOC·Q_bat·price_electric)

where C _fuel is one-step fuel consumption, price _fuel is fuel consumption per unit, delta _SOC is one-step state of charge change, Q _bat is power battery capacity, and price _electric is one-degree electricity price.

S6, pre-training critic and actor networks by using the state transition data set generated in the step S4; s61, collecting the state S at each grid point of the data set generated in step S4, the optimal action a in the state, the generated return r, the next state S' transferred to and the value function V of the state.

S62, constructing critic and actor neural network structures, wherein the critic and actor neural network structures comprise an input layer, an hidden layer and an output layer, and determining the number of network neurons of each layer; the critic and actor neural network structures output the values of four modes and the power of the engine, and actual actions are selected as follows:

a＝(argmax(V₁、V₂、V₃、V₄),P_eng)

Q(s,a)＝r+γ·V(s')

S7, building an environment intelligent agent model, transferring the state transfer data set generated in the step S4 into an experience pool, taking the two neural networks trained in the step S6 as an initial network model, and continuously and iteratively training an energy management strategy by using a deep reinforcement learning algorithm until the algorithm converges to obtain a final environment intelligent agent model; step S7 specifically comprises the following sub-steps;

the specific strategy gradient update formula is as follows:

Example 1

In this embodiment, the method of the present invention is applied to a P2P3 serial light card, and the method is shown in fig. 2, and specifically includes the following steps:

s1, acquiring fixed-route multidimensional road condition information of a plug-in hybrid power logistics light truck in a historical driving process, wherein the method specifically comprises the following steps of:

the logistics light truck runs and collects a plurality of groups of historical vehicle speed curves and road gradient curves on a fixed route.

S2, a power system structure is shown in FIG 3, a plug-in hybrid power logistics light truck whole vehicle power system model is built based on the power system structure, and the method specifically comprises the following steps:

s21, a built whole vehicle dynamics model is as follows:

Wherein P _dem is the required power, v _h is the vehicle speed of the vehicle, m is the vehicle weight, ρ is the air density, A is the frontal area of the vehicle, C _d is the air resistance coefficient, g is the gravitational acceleration, μ _r is the rolling resistance coefficient, and θ is the road gradient.

P_dem＝P_m12+P_engη_AMT

Where P _m12 is the equivalent motor power, P _eng is the engine power, and η _AMT is the transmission efficiency.

P_m12＝P_EM+P_ISGη_AMT

Where P _EM is the EM motor power and P _ISG is the ISG motor power.

S22, a built power battery model is as follows:

Wherein I _bat is power battery current, U _OC is power battery open-circuit voltage, and R _bat is power battery internal resistance.

In the method, in the process of the invention,The change rate of the state of charge of the power battery is represented by Q _bat, which is the capacity of the power battery.

S3, pre-optimizing logic is shown in FIG. 4, and pre-optimizing processing is carried out on the coupled modes of the two motors in an off-line mode, wherein the pre-optimizing logic specifically comprises the following steps:

S31, taking the total energy conversion efficiency of the two motors as a target, and establishing a pre-optimized objective function as follows:

And S33, in each mode, calculating the optimal power distribution of the two motors under each grid point formed in the step S32 according to the objective function established in the step S31.

S4, carrying out dynamic planning calculation on the history working conditions under the same road section collected in the step S1 by combining the pre-optimized result in the step S3, wherein the method specifically comprises the following steps:

S41, selecting a plurality of groups of working conditions for the history working conditions under the same road section acquired in the step S1.

S42, respectively calculating the minimum intervention vehicle speed under each mode according to the characteristic that the minimum rotation speed of the engine is required to be larger than the idle speed, and respectively traversing the executable modes under the grid points formed by time and SOC according to the minimum intervention vehicle speed condition.

And S43, when the pure electric mode (mode 2), the first-gear series-parallel mode (mode 3) and the second-gear series-parallel mode (mode 3) are traversed, the power of the two motors is distributed according to the pre-optimized result of the step S3.

S5, determining state variables required by the reinforcement learning algorithm, wherein the action variables and the reward functions are respectively as follows:

s＝(P_dem,SOC,d,θ,v,v₁,v₂)

Wherein P _dem is the power required by the whole vehicle, SOC is the battery charge state, d is the driving distance, θ is the current road gradient, v is the current vehicle speed, v ₁ is the previous second vehicle speed, and v ₂ is the previous two seconds vehicle speed.

a＝(mode,P_eng)

Where mode is the selected mode and P _eng is engine power.

r＝-(C_fuel·price_fuel+Δ_SOC·Q_bat·price_electric)

S6, pretraining critic and actor networks by using the state transition data set generated in the step S4, wherein the pretraining comprises the following steps;

S61, collecting the data set generated in the step S4, wherein the data set comprises: the state s at each grid point, the optimal action a at that state, the generated return r, the next state s' to transition to, the value function V of the state.

S62, as shown in FIG. 5, a critic and actor network structure is constructed, wherein the network structure comprises an input layer, an hidden layer and an output layer, and the number of network neurons of each layer is determined.

Further, actor network outputs the value and engine power of four modes, and the actual actions are selected as follows:

a＝(argmax(V₁、V₂、V₃、V₄),P_eng)

Actor the network outputs V1, V2, V3, V4, peng five values, the first four of which are the values of the pattern, the largest of which is selected as the selected pattern, and finally the previous action variable a= (mode, peng) is formed.

Q(s,a)＝r+γ·V(s')

Where r is the return in the dataset generated in step 4 and V (s ') is a function of the value of the next state s' in the dataset generated in step 4.

S64, using (S; a) and Q (S, a) in the step S63 and the step S62 as input and output samples of critic networks, S and a as input and output samples of actor networks, and pretraining the critic and actor networks by using a gradient descent method.

S7, building an environment-agent model, and continuously and iteratively training an energy management strategy by using a deep reinforcement learning algorithm, wherein the method comprises the following steps of:

s71, combining the whole vehicle environment module with the DDPG algorithm module to construct an interactive algorithm.

And S72, migrating the state transition data set generated in the step S4 into an experience pool, and taking the two neural networks pre-trained in the step S6 as initial critic and actor networks to complete the construction of the intelligent agent module.

S73, dividing the plurality of groups of historical vehicle speed curves acquired in the step S1 into a training set and a testing set.

And S74, in each training, defining the real-time state parameter of the whole vehicle and the corresponding rewarding value as input parameters of a neural network in the intelligent agent module, taking the control variable output by the neural network as the input parameters of the whole vehicle model in the environment module, generating a new rewarding value after the vehicle executes a control command, and storing the experience obtained in the step in an experience pool.

S75, the agent updates through the strategy gradient, so that the learning updating step of the neural network is realized, and a specific strategy gradient updating formula is as follows:

And S76, repeatedly iterating and verifying on the test set until the ideal effect is learned, and storing the global neural network persistence model after training is finished.

And S8, burning the converged actor neural network model into a light truck HCU controller, and performing online application on a fixed line.

Under the loop experiment scene, after actor network built based on tensorflow is converged, the hardware reproduces the network structure in the simulink platform and extracts the parameters into the network built by the simulink. And encapsulating the network built by the simulink into an HCU module, adding the HCU module into a speedgoat controller, outputting the control quantity in real time as a controller for hardware in-loop test, and performing in-loop experiments by interacting with a light card model.

Example 2

In this embodiment, the part of road used by 6th IFAC E-COSM 2021 to challenge the race is the fixed route for the light truck to travel, so as to optimize the plug-in hybrid power logistics light truck. The whole method comprises the following steps:

S1, acquiring multi-dimensional road condition information of a fixed route of the plug-in hybrid power logistics light truck in the history driving process, wherein the part of road sections used by the 6th IFAC E-COSM 2021 for challenging the racing is the fixed route of the light truck driving, as shown in FIG. 6. The total length of the road section is 10.16km, 11 signal intersections are all arranged, and the distance between traffic lights is different from 1288m to 247 m. The phase cycle time of each traffic light was 120 seconds, and the duration of the red and green lights was half of that of the yellow light, respectively, without consideration.

S2, a plug-in hybrid power logistics light truck whole vehicle power system model is built based on whole vehicle parameters and a power system structure, and main parameters of the light truck in the embodiment are shown in a table 1.

TABLE 1 main parameters of P2-P3 series-parallel hybrid light truck

S3, pre-optimizing the coupled modes of the two motors, wherein two motor efficiency map diagrams of the light truck in the example are shown in fig. 7a and 7b, and pre-optimizing results in different modes are shown in fig. 8, 9 and 10 respectively.

And S4, collecting the history working conditions under the same road section shown in FIG 6, and carrying out dynamic planning calculation by combining the pre-optimized result in the step S3. The history of the present embodiment collected under the road section shown in fig. 6 is shown in fig. 11. The energy distribution result of the dynamic programming energy management strategy under a certain working condition is shown in fig. 12.

s＝(P_dem,SOC,d,θ,v,v₁,v₂)

a＝(mode,P_eng)

Where mode is the selected mode and P _eng is engine power.

r＝-(C_fuel·price_fuel+Δ_SOC·Q_bat·price_electric)

S6, pre-training critic and actor networks by using a state transition data set generated by dynamic programming calculation.

And S7, building an environment-agent model, and continuously and iteratively training an energy management strategy by using a deep reinforcement learning algorithm. The convergence characteristics of the training process in this example are shown in fig. 13.

And S8, burning the converged actor neural network model into a light truck HCU controller, and performing online application on a fixed line. The HCU controller employed in this example is Speedgoat real-time controller.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or equally substituted without departing from the spirit and scope of the present invention, and it should be covered in the scope of the appended claims.

Claims

1. A hybrid electric vehicle system control method based on deep reinforcement learning is characterized in that: the method specifically comprises the following steps:

s21, a whole vehicle power system model is as follows:

P_dem＝P_m12+P_engη_AMT

P_m12＝P_EM+P_ISGη_AMT

wherein P _EM is EM motor power, and P _ISG is ISG motor power;

S22, building a power battery model:

S33, under each mode, calculating the optimal power distribution of the two motors under each grid point formed in the step S32 according to the objective function established in the step S31, wherein the modes comprise a series mode, a pure electric mode, a parallel first gear mode and a parallel second gear mode;

the step S4 specifically includes the following substeps:

s41, selecting a plurality of groups of working conditions as multidimensional road condition information for the history working conditions under the same road section acquired in the step S1;

s43, when the pure electric mode, the first-gear series-parallel mode and the second-gear series-parallel mode are traversed, the power of the two motors is distributed according to the pre-optimized result of the step S3;

s＝(P_dem,SOC,d,θ,v,v₁,v₂)

a＝(mode,P_eng)

Where mode is the selected mode and P _eng is engine power;

r＝-(C_fuel·price_fuel+Δ_SOC·Q_bat·price_electric)

The step S7 specifically includes the following substeps:

S75, repeatedly iterating until the algorithm converges to obtain a final environment intelligent agent model;

2. The deep reinforcement learning-based hybrid electric vehicle system control method according to claim 1, characterized in that: the fixed-route multidimensional road condition information obtained in the step S1 comprises a plurality of groups of historical vehicle speed curves and road gradient curves, wherein the historical vehicle speed curves and the road gradient curves are collected when the electric vehicle runs on the fixed route.

3. The deep reinforcement learning-based hybrid electric vehicle system control method according to claim 1, characterized in that: the step S6 specifically comprises the following substeps;

S61, acquiring a state variable S, an optimal action variable a, a reward function r and a value function V of a next state variable S' transferred to, which are generated in the step S4, at each grid point of the data set;

Q(s,a)＝r+γ·V(s′)

wherein r is a reward function in the data set generated in the step S4, and V (S ') is a value function of a next state variable S' in the data set generated in the step S4;

S64, taking (S, a) in the step S61 as an input sample of the critic network, taking Q (S, a) in the step S63 as an output sample of the critic network, wherein S is an input sample of the actor network, and a is an output sample of the actor network, and pretraining the critic and actor networks by using a gradient descent method.

4. The deep reinforcement learning-based hybrid electric vehicle system control method according to claim 3, characterized in that: in step S62, the neural network structures critic and actor output the values of the four modes and the engine power, and the actual action variables are selected as follows:

a＝(argmax(V₁、V₂、V₃、V₄),P_eng)

5. The deep reinforcement learning-based hybrid electric vehicle system control method according to claim 1, characterized in that: the specific policy gradient update formula in step S74 is as follows:

Where r is the reward function, s and s' are the current state variable and the next state variable, a is the action variable, θ _Q and θ _μ are the critic and actor network parameters at the current time, And/>The network parameters are respectively the target critic and actor, gamma and mu are weight parameters, and alpha and tau are learning rate and target network update rate.