CN118569350A

CN118569350A - Deep reinforcement learning method for reducing vehicle super-parameter adjustment

Info

Publication number: CN118569350A
Application number: CN202411023989.8A
Authority: CN
Inventors: 赵彬; 王泽�; 刘向进; 李何为; 刘畅; 孙福弘
Original assignee: Changchun University of Technology
Current assignee: Changchun University of Technology
Priority date: 2024-07-29
Filing date: 2024-07-29
Publication date: 2024-08-30

Abstract

A deep reinforcement learning method for reducing the adjustment of super parameters of a vehicle relates to the technical field of the adjustment of the reduction of the super parameters of an automatic driving vehicle based on deep reinforcement learning. The vehicle operation stability nonlinear rewarding learning architecture of deep reverse reinforcement learning is provided, and an automatic driving integrated control strategy of self-vehicle nonlinear operation stability rewarding and driving rule rewarding under a highway scene is further provided, so that the super-parameter setting of a vehicle operation stability rewarding function in the integrated control strategy training process is greatly reduced. The method comprises the following steps: acquiring expert strategies required by deep reverse reinforcement learning training; obtaining a vehicle steering stability reward using deep reverse reinforcement learning; using a multi-process asynchronous method to accelerate the training speed of deep reverse reinforcement learning; vehicle integration control in high-speed complex scenes. The invention can realize the integrated control of the vehicle stability maintaining capability in a high-speed scene on the basis of not depending on a large number of stability super-parameter settings of the vehicle.

Description

Deep reinforcement learning method for reducing vehicle super-parameter adjustment

Technical Field

The invention provides a deep reinforcement learning method for reducing super-parameter adjustment of a vehicle, and relates to the technical field of reducing super-parameter adjustment of an automatic driving vehicle based on deep reinforcement learning.

Background

With the rapid development of economic globalization and rapid expansion of urban ization, the automobile conservation amount is increased sharply, but the incidence of traffic accidents is increased correspondingly, and automatic driving is expected to make great contribution in the aspects of reducing traffic accidents, improving the road use efficiency, relieving traffic jams and the like, however, the technology faces a plurality of technical challenges in the implementation process. In order to ensure the driving safety of an automatic driving vehicle in a highway environment, an algorithm based on deep reinforcement learning needs to be constructed, and a reward function is designed to evaluate the influence of super parameters of the vehicle in the training process, such as the influence of super parameters of the vehicle speed, the yaw angle, the steering wheel angle and the like on the steering stability of the vehicle, so that the vehicle is ensured to be stable, meanwhile, complex tasks such as lane keeping, lane changing overtaking and the like are also realized rapidly, most of the steering stability reward functions at present depend on empirical assumption and a vehicle dynamics model, a large number of the super parameter reward functions are set to be difficult, and particularly, the non-linear reward function of the super parameters of the vehicle, under the composite working condition, of the tire is more difficult to adjust.

Disclosure of Invention

In order to solve the problems, the invention provides a vehicle control stability nonlinear rewarding learning framework for deep reverse reinforcement learning, and further provides an automatic driving integrated control strategy for rewarding the vehicle nonlinear control stability and rewarding the driving rule under the expressway scene, so that the super-parameter setting of a vehicle control stability rewarding function in the training process of the integrated control strategy is greatly reduced.

According to the invention, the deep reinforcement learning method for reducing the adjustment of the super parameters of the vehicle comprises the following specific related steps:

s1, acquiring expert strategies required by deep reverse reinforcement learning training;

S2, obtaining vehicle steering stability rewards by using deep reverse reinforcement learning;

s3, accelerating the training speed of deep reverse reinforcement learning by using a multi-process asynchronous method;

and S4, vehicle integration control in the high-speed complex scene.

The step S1 specifically comprises the following steps:

Step S11: the original expert data is different vehicle slip rates under different front wheel angles, longitudinal vehicle speeds and road friction coefficients Corresponding vehicle steering stability evaluation value，The larger the value is, the better the steering stability of the vehicle is, and each set of fixed values of different front wheel rotation angles, longitudinal vehicle speeds and road friction coefficients are one working condition, and each working condition corresponds to oneTo obtain expert strategy in probability distribution function form, the original expert data is pre-processed to prevent value overflowOversized, resulting in numerical computation overflows, in each operating conditionAll that is required to subtract the minimum under the current working conditionThe value, set expression is as follows:

step S12: using the softmax method, each working condition is based on different conditions Value, conversion to probability distributionAnd the sum of probabilities is 1, the set expression is as follows:

step S13: normal distribution using parameterization To approximate the representationI.e.Obeys normal distribution,AndFor the mean and variance of the normal distribution, the specific approximation method is realized by minimizing the KL divergence, and the set expression is shown as the following formula:

step S14: specific calculation parameters In the way of using the following gradient descent algorithm to complete the alignmentIs used for the optimal solution of (a),

In the middle ofThe learning rate in the parameter updating process is used;

step S15: when probability distribution function And the original probability distributionPeak probability between the two, at optimum slip rateWhen there is an error, it is necessary to performMake corrections toIs corrected to a probability distributionThe corresponding optimal slip ratio, while keeping the variance unchanged, probability distribution after correctionI.e. the final expert strategy。

The step S2 specifically includes the following:

Step S21: calculating feature expectations of expert strategies and learning strategies AndFor a group of expert policy dataThe set expression is as follows:

Here, the Represent the firstAn expert trajectory is provided for the user to select,For the state of the vehicle, the length of the expert trajectory isThus, the characteristics of the expert policy are expected to be calculated,

Here, theFor discount coefficient，As a state feature, whenWhen meeting a specific state1, Otherwise 0, learn strategyIn order to realize the feature expectation of the vehicle in the deep reinforcement learning training process, the calculation mode is consistent with the feature expectation of the expert strategy;

Step S22: obtaining vehicle stability rewards using deep inverse reinforcement learning, the stability rewards function being approximated by a neural network, i.e Here, whereAs a parameter of the neural network,For inputting the neural network, the operation stability is rewardedParameters of (2)The solving problem is converted into an optimization problem, and the set expression is as follows:

Here, the Awarding for stable operationThe probability value of the parameter, which is equivalent to a maximum a posteriori estimation problem, can be regarded as a likelihood function of expert policy dataParameter priorIn a combination of (a) and (b),

By simplifying the steps of the method, the device and the system,

Here, theIndicating rewardsA regularization function of the parameters,，And (3) withAre all from a time rangeInternal calculations, if consideredThen there is；

Step S23: for parametersAnd carrying out parameter solving by using a gradient descent algorithm, wherein the set expression is as follows:

by setting the learning rate to The best stable operation rewards can be obtained。

The step S3 specifically comprises the following steps:

Step S31: the multi-process asynchronous training fully utilizes the multi-core CPU performance of a single computer, and distributes training tasks to a plurality of CPU cores, so that the training speed is increased, the training time is reduced, and the multi-process asynchronous training learning environment is mainly divided into two parts: a main process and a plurality of sub-processes, each sub-process has two neural networks, an actor_l strategy network And critic_l value network; The main process has only one or two neural networks, and the actor_g strategy networkAnd critic_g value networkThese neural networks have the same structure but different ways of parameter updating, in whichIn order to be in the state of the vehicle,Is the slip ratio of the vehicle;

Step S32: initializing network parameters of an actor_g and a critic_g, respectively assigning the network parameters to an actor_l and a critic_l, and setting the update steps of a subprocess Maximum training round numberMaximum capacity of experience poolAnd initializing the number of steps of the motionNumber of training rounds；

Step S33: in each sub-process, the current vehicle stateUnder the following policySelecting an actionCalculating the vehicle state at the next moment of the vehicleThen pass through the stateCalculating the rewarding value of the vehicle，Data is processedPutting into experience pool, storing one piece of data in the experience pool at the first position each time, and sequentially moving the original data backwards, whereinIndicating whether the training process is finished or not, if soOtherwiseIn the sub-process, whenIn the time-course of which the first and second contact surfaces,；

Step S34: in each sub-process, each timeAnd then randomly extracting data from the experience pool to calculate an actor_l and critic_l network loss function, wherein the set expression is as follows:

Therein, wherein ，，Is a strategyIs used as a reference to the entropy of (a),Is the shear coefficient，Representing a clipping function，Is a discount factor；

Step S35: whenever the loss functions of actor_l and critic_l are calculated, the gradients of the two networks are calculated, and the parameters of actor_g and critic_g are updatedAndThe set expression is as follows:

Here, the AndFor the learning rate of the gradient descent algorithm, each sub-process can independently update the parameters of the two networks in the main process, thereby realizing asynchronous update, and each time the sub-process updates the parameters in the main process, the sub-process needs to updateThe value is again assigned to be 0,And it is necessary to assign network parameters in the main process to the network in the sub-process providing gradient information,，；

Step S36: when (when)When the training process is finishedI.e. the optimal strategy, then willReturning to step S22, adding parameters of deep reverse reinforcement learningThe updating process is realized, so that the process of accelerating deep reverse reinforcement learning training by using a multi-process asynchronous method is realized.

The step S4 specifically includes the following:

step S41: considering the problem of vehicle control in a high-speed scene as a markov problem and using tuples Indicating the integrated control of an autonomous vehicle in a high speed scenario, whereIn order to be a state space,In order to automatically drive the vehicle's action space,For the probability of a transition of the environmental state,For an immediate benefit of the vehicle's action,The goal of the vehicle is to pass through the strategy in the Markov decision process as a discount factorMaximizing own benefits；

Step S42: by designing a plurality of sub rewards, calculating a linear combination of the plurality of sub rewards to guide the vehicle to realize integrated control, and road progress rewardsSpeed rewardsCourse angle rewardsRoad center deviation rewardsDistance maintenance rewardsVehicle torque distribution rewardsSteering wheel angle rewardsVehicle stability rewardsTotal rewards；

Step S43: by creating a neural network sharing the parameters of the Actor and Critic network, the one-time output integrated control strategy of the neural network is realizedAnd a value function for evaluating the strategyThe loss function expression is as follows:

Wherein the calculation is Such asCalculation ofSuch as，As a parameter of the neural network,Is a super parameter;

Step S44: updating the parameters of the neural network is realized by using a gradient descent algorithm, and the set expression is as follows:

Wherein the method comprises the steps of Is a learning rate super-parameter.

The beneficial effects of the invention are as follows:

1. Aiming at the problem that the stability control reward function is difficult to construct in automatic driving, the vehicle stability control strategy for deep reverse reinforcement learning is provided, and the learning of the vehicle stability control reward function and the final vehicle stability control are realized through a double-layer structure of upper-layer deep reverse reinforcement learning and lower-layer deep reinforcement learning. The method not only simplifies the construction mode of the stability control rewarding function, expands the control research method of the vehicle control stability, but also provides important technical support for the safe driving of the automatic driving vehicle in the expressway scene.

2. A multi-process asynchronous parallel training method is introduced. The method realizes the parallel processing of data and the rapid updating of the operation strategy by simultaneously operating the intelligent agent on a plurality of CPU cores. By using the method, the intelligent agent can independently learn and explore in different environment examples, so that the training efficiency is remarkably improved. In addition, the multi-process learning environment and algorithm are designed in detail, so that the stability of the training process and the convergence of the strategy are ensured.

Drawings

FIG. 1 is a partially acquired expert strategy;

FIG. 2 shows the overall error variation of expert strategy and learning strategy in the deep reverse reinforcement learning training process;

FIG. 3 is a graph showing the approximation of the final partial learning strategy and expert strategy for deep reverse reinforcement learning;

FIG. 4 is a multi-process asynchronous method training framework;

FIG. 5 is a graph showing training time differences in different processes of deep reverse reinforcement learning in multi-process asynchronous training;

FIG. 6 is an evaluation of vehicle handling stability under different vehicle driving conditions in a multi-process asynchronous training;

FIG. 7 is a schematic view of a driving environment of a vehicle in a high speed scenario;

FIG. 8 is a schematic diagram of a lane keeping task in a high speed scenario;

FIG. 9 is a schematic diagram of a lane-changing cut-in task in a high-speed scenario;

FIG. 10 is a diagram of a high-speed scene integration block control algorithm neural network architecture;

FIG. 11 is a graph of convergence change of an integrated block control algorithm in a high-speed scenario;

FIG. 12 is a graph of the state change of a vehicle in a lane keeping task in a high speed scenario;

fig. 13 is a graph of a change in state of a vehicle in a lane change overtaking task in a high speed scenario.

Detailed Description

For a further understanding of the present invention, the present invention will be described in detail with reference to the drawings and examples, which are to be understood as being illustrative only and not limiting.

The specific implementation steps are as follows:

Step S1, acquiring expert strategies required by training;

and S4, vehicle integration control in the high-speed complex scene.

In step S1, in order to acquire the required expert policy, the following steps are required:

step S11: the original expert data are different front wheel steering angles Different vehicle slip rates under the longitudinal speed and road friction coefficient muCorresponding vehicle steering stability evaluation value，The larger the value is, the better the steering stability of the vehicle is, and each set of fixed values of different front wheel rotation angles, longitudinal vehicle speeds and road friction coefficients are one working condition, and each working condition corresponds to oneTo obtain expert strategy in probability distribution function form, the original expert data is pre-processed to prevent value overflowOversized, resulting in numerical computation overflows, in each operating conditionAll that is required to subtract the minimum under the current working conditionThe value, set expression is as follows:

In the middle ofThe learning rate in the parameter updating process is used;

Step S15: as shown in FIG. 1, when the probability distribution function And the original probability distributionPeak probability between the two, at optimum slip rateWhen there is an error, it is necessary to performMake corrections toIs corrected to a probability distributionThe corresponding optimal slip ratio, while keeping the variance unchanged, probability distribution after correctionI.e. the final expert strategy。

In step S2, a vehicle steering stability reward is obtained using deep reverse reinforcement learning, specifically comprising the steps of:

By simplifying the steps of the method, the device and the system,

by setting the learning rate to The optimum operation stability rewards can be obtained。

As shown in fig. 2 and 3, in the final learning result, the error and the degree of approximation between the expert strategy and the learning strategy are evaluated; in the global error variation in fig. 2, 200 are sampledThe smaller the error sum of the samples is, the better the learning effect is; in fig. 3, the actual fitting between the partial expert strategy and the learning strategy is plotted for different vehicle speeds.

In step S3, deep reverse reinforcement learning is accelerated using multi-process asynchronous training, specifically as follows:

Step S31: the multi-process asynchronous training fully utilizes the multi-core CPU performance of a single computer, and distributes training tasks to a plurality of CPU cores, so that the training speed is increased, the training time is reduced, and the multi-process asynchronous training learning environment is mainly divided into two parts: a main process and a plurality of sub-processes, each sub-process has two neural networks, an actor_l strategy network And critic_l value networkThe two networks together form a sub-network; the main process has only one or two neural networks, and the actor_g strategy networkAnd Critic g value networkTogether forming a main network, the neural networks having the same structure but different ways of updating parameters, whereinIn order to be in the state of the vehicle,Is the slip ratio of the vehicle;

Here, the AndFor the learning rate of the gradient descent algorithm, each sub-process can independently update the parameters of the two networks in the main process, thereby realizing asynchronous update, and each time the sub-process updates the parameters in the main process, the sub-process needs to updateThe value is again assigned to be 0,And it is necessary to assign network parameters in the main process to the network in the sub-process providing gradient information,，A specific multi-process asynchronous training framework is shown in fig. 4;

Step S36: when (when) When the training process is finishedI.e. the optimal strategy, then willReturning to step S22, adding parameters of deep reverse reinforcement learningUpdating process, thereby realizing the process of accelerating deep reverse reinforcement learning training by using a multi-process asynchronous method;

The training time difference of deep reverse reinforcement learning in different processes in multi-process asynchronous training is shown in fig. 5, the total time of the deep reverse reinforcement learning gradually decreases with the increase of the number of sub-processes, and the 3D graph of the steering stability of the vehicle in different vehicle states is shown in fig. 6, wherein the safety value represents the steering stability of the vehicle, and the larger the vehicle is, the better the vehicle is.

In step S4, the vehicle integration control in the high-speed complex scene is performed as follows:

step S41: considering the problem of vehicle control in a high-speed scene as a markov problem and using tuples Indicating the integrated control of an autonomous vehicle in a high speed scenario, whereIn order to be a state space,In order to automatically drive the vehicle's action space,For the probability of a transition of the environmental state,For an immediate benefit of the vehicle's action,The goal of the vehicle is to pass through the strategy in the Markov decision process as a discount factorMaximizing own benefitsThe high-speed scene environment of the vehicle running is shown in fig. 7, and is divided into lane keeping tasks shown in fig. 8 and lane changing overtaking tasks shown in fig. 9 through thinning tasks, and the width of each lane is 3.2m;

step S42: by designing a plurality of sub rewards, calculating a linear combination of the plurality of sub rewards to guide the vehicle to realize integrated control, and road progress rewards Speed rewardsCourse angle rewardsRoad center deviation rewardsDistance maintenance rewardsVehicle torque distribution rewardsSteering wheel angle rewardsVehicle steering stability rewardsTotal rewards；

Step S43: by creating a neural network sharing the Actor and Critic network parameters as shown in fig. 10, a one-time output integrated control strategy of the neural network is realizedAnd a value function for evaluating the strategyFeature extraction of state information is achieved in the hidden layer, and a loss function expression is shown as follows:

Wherein the calculation is Such asCalculation ofSuch as，As a parameter of the neural network,Is a super parameter; in fig. 11, the convergence of the algorithm during training is shown, and as the training frequency increases, the return value (i.e., the reward) obtained by the vehicle gradually increases and tends to converge, and the loss function gradually decreases and tends to converge;

Wherein the method comprises the steps of Is a learning rate super parameter;

as shown in FIG. 12, a state change curve of the vehicle in the lane keeping task at a speed of 30m/s is provided for the vehicle in the lane keeping task, wherein a reward is given for comparison with a steering stability (I.e. steady rewards in the figure) vehicle performance, especially with the same training withoutVehicle performance behavior at that time; as can be seen from fig. 12, in order to counter the off-tracking characteristics of the vehicle, a vehicle is added withThe vehicle performs better, both in reducing lane centerline deviations and in maintaining a low yaw rate.

As shown in fig. 13, to show the state change curve of the vehicle in the lane change overtaking task, it can be seen that the vehicle has the following functions in order to implement the lane change overtaking taskIt has better performance and noWhen the vehicle is in use, the racing track is easier to wash out; has the following componentsWhen the vehicle is in a state of better slip rate maintenance, the tires are not easy to slip, the acceleration change is more gentle, and no slip occursWhen the slip rate of the vehicle is maintained worse, the tires are easier to slip, and the acceleration change is very severe, which is unfavorable for maintaining the steering stability of the vehicle.

Claims

1. A deep reinforcement learning method for reducing superparameter adjustment of a vehicle, the method comprising the steps of:

Step S1, acquiring expert strategies required by training;

s2, obtaining vehicle stability rewards by using deep reverse reinforcement learning;

and S4, vehicle integration control in the high-speed complex scene.

2. The deep reinforcement learning method for reducing vehicle superparameter tuning according to claim 1, wherein said raw expert data in step S1 is different vehicle slip rates at different front wheel angles, longitudinal vehicle speeds and road friction coefficientsCorresponding vehicle steering stability evaluation value，The larger the value is, the better the steering stability of the vehicle is, and each set of fixed values of different front wheel rotation angles, longitudinal vehicle speeds and road friction coefficients are one working condition, and each working condition corresponds to oneIn use, the original expert data needs to be converted into a form described by using a probability distribution function, so the following processing is needed:

Step S11: to prevent occurrence of Oversized, resulting in numerical computation overflows, in each operating conditionAll that is required to subtract the minimum under the current working conditionThe value, set expression is as follows:

In the middle ofThe learning rate in the parameter updating process is used;

3. The method for deep reinforcement learning to reduce the adjustment of the vehicle super parameters according to claim 1, wherein the specific process of obtaining the vehicle stability rewards using deep inverse reinforcement learning in step S2 is as follows:

By simplifying the steps of the method, the device and the system,

4. The deep reinforcement learning method for reducing the adjustment of the super parameters of the vehicle according to claim 1, wherein the specific training process of accelerating the training speed of the deep inverse reinforcement learning by using the multi-process asynchronous method in the step S3 is as follows:

Step S31: the multi-process asynchronous training learning environment is constructed, the multi-core CPU performance of a single computer is fully utilized in multi-process asynchronous training, and training tasks are distributed to a plurality of CPU cores, so that training speed is increased, training time is shortened, and the multi-process asynchronous training learning environment is mainly divided into two parts: a main process and a plurality of sub-processes, each sub-process has two neural networks, an actor_l strategy network And critic_l value networkOnly one main process and two neural networks are provided, and an actor_g strategy network is providedAnd critic_g value networkThese neural networks have the same structure but different ways of parameter updating, in whichIn order to be in the state of the vehicle,Is the slip ratio of the vehicle;

5. The deep reinforcement learning method for reducing the ultra-parameter adjustment of the vehicle according to claim 1, wherein the implementation process of the vehicle integration decision in the high-speed complex scene in step S4 is as follows:

Wherein the method comprises the steps of Is a learning rate super-parameter.