CN118569350A - Deep reinforcement learning method for reducing vehicle super-parameter adjustment - Google Patents

Deep reinforcement learning method for reducing vehicle super-parameter adjustment Download PDF

Info

Publication number
CN118569350A
CN118569350A CN202411023989.8A CN202411023989A CN118569350A CN 118569350 A CN118569350 A CN 118569350A CN 202411023989 A CN202411023989 A CN 202411023989A CN 118569350 A CN118569350 A CN 118569350A
Authority
CN
China
Prior art keywords
vehicle
training
reinforcement learning
rewards
parameters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411023989.8A
Other languages
Chinese (zh)
Inventor
赵彬
王泽�
刘向进
李何为
刘畅
孙福弘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changchun University of Technology
Original Assignee
Changchun University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changchun University of Technology filed Critical Changchun University of Technology
Priority to CN202411023989.8A priority Critical patent/CN118569350A/en
Publication of CN118569350A publication Critical patent/CN118569350A/en
Pending legal-status Critical Current

Links

Landscapes

  • Feedback Control In General (AREA)

Abstract

A deep reinforcement learning method for reducing the adjustment of super parameters of a vehicle relates to the technical field of the adjustment of the reduction of the super parameters of an automatic driving vehicle based on deep reinforcement learning. The vehicle operation stability nonlinear rewarding learning architecture of deep reverse reinforcement learning is provided, and an automatic driving integrated control strategy of self-vehicle nonlinear operation stability rewarding and driving rule rewarding under a highway scene is further provided, so that the super-parameter setting of a vehicle operation stability rewarding function in the integrated control strategy training process is greatly reduced. The method comprises the following steps: acquiring expert strategies required by deep reverse reinforcement learning training; obtaining a vehicle steering stability reward using deep reverse reinforcement learning; using a multi-process asynchronous method to accelerate the training speed of deep reverse reinforcement learning; vehicle integration control in high-speed complex scenes. The invention can realize the integrated control of the vehicle stability maintaining capability in a high-speed scene on the basis of not depending on a large number of stability super-parameter settings of the vehicle.

Description

Deep reinforcement learning method for reducing vehicle super-parameter adjustment
Technical Field
The invention provides a deep reinforcement learning method for reducing super-parameter adjustment of a vehicle, and relates to the technical field of reducing super-parameter adjustment of an automatic driving vehicle based on deep reinforcement learning.
Background
With the rapid development of economic globalization and rapid expansion of urban ization, the automobile conservation amount is increased sharply, but the incidence of traffic accidents is increased correspondingly, and automatic driving is expected to make great contribution in the aspects of reducing traffic accidents, improving the road use efficiency, relieving traffic jams and the like, however, the technology faces a plurality of technical challenges in the implementation process. In order to ensure the driving safety of an automatic driving vehicle in a highway environment, an algorithm based on deep reinforcement learning needs to be constructed, and a reward function is designed to evaluate the influence of super parameters of the vehicle in the training process, such as the influence of super parameters of the vehicle speed, the yaw angle, the steering wheel angle and the like on the steering stability of the vehicle, so that the vehicle is ensured to be stable, meanwhile, complex tasks such as lane keeping, lane changing overtaking and the like are also realized rapidly, most of the steering stability reward functions at present depend on empirical assumption and a vehicle dynamics model, a large number of the super parameter reward functions are set to be difficult, and particularly, the non-linear reward function of the super parameters of the vehicle, under the composite working condition, of the tire is more difficult to adjust.
Disclosure of Invention
In order to solve the problems, the invention provides a vehicle control stability nonlinear rewarding learning framework for deep reverse reinforcement learning, and further provides an automatic driving integrated control strategy for rewarding the vehicle nonlinear control stability and rewarding the driving rule under the expressway scene, so that the super-parameter setting of a vehicle control stability rewarding function in the training process of the integrated control strategy is greatly reduced.
According to the invention, the deep reinforcement learning method for reducing the adjustment of the super parameters of the vehicle comprises the following specific related steps:
s1, acquiring expert strategies required by deep reverse reinforcement learning training;
S2, obtaining vehicle steering stability rewards by using deep reverse reinforcement learning;
s3, accelerating the training speed of deep reverse reinforcement learning by using a multi-process asynchronous method;
and S4, vehicle integration control in the high-speed complex scene.
The step S1 specifically comprises the following steps:
Step S11: the original expert data is different vehicle slip rates under different front wheel angles, longitudinal vehicle speeds and road friction coefficients Corresponding vehicle steering stability evaluation valueThe larger the value is, the better the steering stability of the vehicle is, and each set of fixed values of different front wheel rotation angles, longitudinal vehicle speeds and road friction coefficients are one working condition, and each working condition corresponds to oneTo obtain expert strategy in probability distribution function form, the original expert data is pre-processed to prevent value overflowOversized, resulting in numerical computation overflows, in each operating conditionAll that is required to subtract the minimum under the current working conditionThe value, set expression is as follows:
step S12: using the softmax method, each working condition is based on different conditions Value, conversion to probability distributionAnd the sum of probabilities is 1, the set expression is as follows:
step S13: normal distribution using parameterization To approximate the representationI.e.Obeys normal distribution,AndFor the mean and variance of the normal distribution, the specific approximation method is realized by minimizing the KL divergence, and the set expression is shown as the following formula:
step S14: specific calculation parameters In the way of using the following gradient descent algorithm to complete the alignmentIs used for the optimal solution of (a),
In the middle ofThe learning rate in the parameter updating process is used;
step S15: when probability distribution function And the original probability distributionPeak probability between the two, at optimum slip rateWhen there is an error, it is necessary to performMake corrections toIs corrected to a probability distributionThe corresponding optimal slip ratio, while keeping the variance unchanged, probability distribution after correctionI.e. the final expert strategy
The step S2 specifically includes the following:
Step S21: calculating feature expectations of expert strategies and learning strategies AndFor a group of expert policy dataThe set expression is as follows:
Here, the Represent the firstAn expert trajectory is provided for the user to select,For the state of the vehicle, the length of the expert trajectory isThus, the characteristics of the expert policy are expected to be calculated,
Here, theFor discount coefficientAs a state feature, whenWhen meeting a specific state1, Otherwise 0, learn strategyIn order to realize the feature expectation of the vehicle in the deep reinforcement learning training process, the calculation mode is consistent with the feature expectation of the expert strategy;
Step S22: obtaining vehicle stability rewards using deep inverse reinforcement learning, the stability rewards function being approximated by a neural network, i.e Here, whereAs a parameter of the neural network,For inputting the neural network, the operation stability is rewardedParameters of (2)The solving problem is converted into an optimization problem, and the set expression is as follows:
Here, the Awarding for stable operationThe probability value of the parameter, which is equivalent to a maximum a posteriori estimation problem, can be regarded as a likelihood function of expert policy dataParameter priorIn a combination of (a) and (b),
By simplifying the steps of the method, the device and the system,
Here, theIndicating rewardsA regularization function of the parameters,And (3) withAre all from a time rangeInternal calculations, if consideredThen there is
Step S23: for parametersAnd carrying out parameter solving by using a gradient descent algorithm, wherein the set expression is as follows:
by setting the learning rate to The best stable operation rewards can be obtained
The step S3 specifically comprises the following steps:
Step S31: the multi-process asynchronous training fully utilizes the multi-core CPU performance of a single computer, and distributes training tasks to a plurality of CPU cores, so that the training speed is increased, the training time is reduced, and the multi-process asynchronous training learning environment is mainly divided into two parts: a main process and a plurality of sub-processes, each sub-process has two neural networks, an actor_l strategy network And critic_l value network; The main process has only one or two neural networks, and the actor_g strategy networkAnd critic_g value networkThese neural networks have the same structure but different ways of parameter updating, in whichIn order to be in the state of the vehicle,Is the slip ratio of the vehicle;
Step S32: initializing network parameters of an actor_g and a critic_g, respectively assigning the network parameters to an actor_l and a critic_l, and setting the update steps of a subprocess Maximum training round numberMaximum capacity of experience poolAnd initializing the number of steps of the motionNumber of training rounds
Step S33: in each sub-process, the current vehicle stateUnder the following policySelecting an actionCalculating the vehicle state at the next moment of the vehicleThen pass through the stateCalculating the rewarding value of the vehicleData is processedPutting into experience pool, storing one piece of data in the experience pool at the first position each time, and sequentially moving the original data backwards, whereinIndicating whether the training process is finished or not, if soOtherwiseIn the sub-process, whenIn the time-course of which the first and second contact surfaces,
Step S34: in each sub-process, each timeAnd then randomly extracting data from the experience pool to calculate an actor_l and critic_l network loss function, wherein the set expression is as follows:
Therein, wherein Is a strategyIs used as a reference to the entropy of (a),Is the shear coefficientRepresenting a clipping functionIs a discount factor
Step S35: whenever the loss functions of actor_l and critic_l are calculated, the gradients of the two networks are calculated, and the parameters of actor_g and critic_g are updatedAndThe set expression is as follows:
Here, the AndFor the learning rate of the gradient descent algorithm, each sub-process can independently update the parameters of the two networks in the main process, thereby realizing asynchronous update, and each time the sub-process updates the parameters in the main process, the sub-process needs to updateThe value is again assigned to be 0,And it is necessary to assign network parameters in the main process to the network in the sub-process providing gradient information,
Step S36: when (when)When the training process is finishedI.e. the optimal strategy, then willReturning to step S22, adding parameters of deep reverse reinforcement learningThe updating process is realized, so that the process of accelerating deep reverse reinforcement learning training by using a multi-process asynchronous method is realized.
The step S4 specifically includes the following:
step S41: considering the problem of vehicle control in a high-speed scene as a markov problem and using tuples Indicating the integrated control of an autonomous vehicle in a high speed scenario, whereIn order to be a state space,In order to automatically drive the vehicle's action space,For the probability of a transition of the environmental state,For an immediate benefit of the vehicle's action,The goal of the vehicle is to pass through the strategy in the Markov decision process as a discount factorMaximizing own benefits
Step S42: by designing a plurality of sub rewards, calculating a linear combination of the plurality of sub rewards to guide the vehicle to realize integrated control, and road progress rewardsSpeed rewardsCourse angle rewardsRoad center deviation rewardsDistance maintenance rewardsVehicle torque distribution rewardsSteering wheel angle rewardsVehicle stability rewardsTotal rewards
Step S43: by creating a neural network sharing the parameters of the Actor and Critic network, the one-time output integrated control strategy of the neural network is realizedAnd a value function for evaluating the strategyThe loss function expression is as follows:
Wherein the calculation is Such asCalculation ofSuch asAs a parameter of the neural network,Is a super parameter;
Step S44: updating the parameters of the neural network is realized by using a gradient descent algorithm, and the set expression is as follows:
Wherein the method comprises the steps of Is a learning rate super-parameter.
The beneficial effects of the invention are as follows:
1. Aiming at the problem that the stability control reward function is difficult to construct in automatic driving, the vehicle stability control strategy for deep reverse reinforcement learning is provided, and the learning of the vehicle stability control reward function and the final vehicle stability control are realized through a double-layer structure of upper-layer deep reverse reinforcement learning and lower-layer deep reinforcement learning. The method not only simplifies the construction mode of the stability control rewarding function, expands the control research method of the vehicle control stability, but also provides important technical support for the safe driving of the automatic driving vehicle in the expressway scene.
2. A multi-process asynchronous parallel training method is introduced. The method realizes the parallel processing of data and the rapid updating of the operation strategy by simultaneously operating the intelligent agent on a plurality of CPU cores. By using the method, the intelligent agent can independently learn and explore in different environment examples, so that the training efficiency is remarkably improved. In addition, the multi-process learning environment and algorithm are designed in detail, so that the stability of the training process and the convergence of the strategy are ensured.
Drawings
FIG. 1 is a partially acquired expert strategy;
FIG. 2 shows the overall error variation of expert strategy and learning strategy in the deep reverse reinforcement learning training process;
FIG. 3 is a graph showing the approximation of the final partial learning strategy and expert strategy for deep reverse reinforcement learning;
FIG. 4 is a multi-process asynchronous method training framework;
FIG. 5 is a graph showing training time differences in different processes of deep reverse reinforcement learning in multi-process asynchronous training;
FIG. 6 is an evaluation of vehicle handling stability under different vehicle driving conditions in a multi-process asynchronous training;
FIG. 7 is a schematic view of a driving environment of a vehicle in a high speed scenario;
FIG. 8 is a schematic diagram of a lane keeping task in a high speed scenario;
FIG. 9 is a schematic diagram of a lane-changing cut-in task in a high-speed scenario;
FIG. 10 is a diagram of a high-speed scene integration block control algorithm neural network architecture;
FIG. 11 is a graph of convergence change of an integrated block control algorithm in a high-speed scenario;
FIG. 12 is a graph of the state change of a vehicle in a lane keeping task in a high speed scenario;
fig. 13 is a graph of a change in state of a vehicle in a lane change overtaking task in a high speed scenario.
Detailed Description
For a further understanding of the present invention, the present invention will be described in detail with reference to the drawings and examples, which are to be understood as being illustrative only and not limiting.
The specific implementation steps are as follows:
Step S1, acquiring expert strategies required by training;
S2, obtaining vehicle steering stability rewards by using deep reverse reinforcement learning;
s3, accelerating the training speed of deep reverse reinforcement learning by using a multi-process asynchronous method;
and S4, vehicle integration control in the high-speed complex scene.
In step S1, in order to acquire the required expert policy, the following steps are required:
step S11: the original expert data are different front wheel steering angles Different vehicle slip rates under the longitudinal speed and road friction coefficient muCorresponding vehicle steering stability evaluation valueThe larger the value is, the better the steering stability of the vehicle is, and each set of fixed values of different front wheel rotation angles, longitudinal vehicle speeds and road friction coefficients are one working condition, and each working condition corresponds to oneTo obtain expert strategy in probability distribution function form, the original expert data is pre-processed to prevent value overflowOversized, resulting in numerical computation overflows, in each operating conditionAll that is required to subtract the minimum under the current working conditionThe value, set expression is as follows:
step S12: using the softmax method, each working condition is based on different conditions Value, conversion to probability distributionAnd the sum of probabilities is 1, the set expression is as follows:
step S13: normal distribution using parameterization To approximate the representationI.e.Obeys normal distribution,AndFor the mean and variance of the normal distribution, the specific approximation method is realized by minimizing the KL divergence, and the set expression is shown as the following formula:
step S14: specific calculation parameters In the way of using the following gradient descent algorithm to complete the alignmentIs used for the optimal solution of (a),
In the middle ofThe learning rate in the parameter updating process is used;
Step S15: as shown in FIG. 1, when the probability distribution function And the original probability distributionPeak probability between the two, at optimum slip rateWhen there is an error, it is necessary to performMake corrections toIs corrected to a probability distributionThe corresponding optimal slip ratio, while keeping the variance unchanged, probability distribution after correctionI.e. the final expert strategy
In step S2, a vehicle steering stability reward is obtained using deep reverse reinforcement learning, specifically comprising the steps of:
Step S21: calculating feature expectations of expert strategies and learning strategies AndFor a group of expert policy dataThe set expression is as follows:
Here, the Represent the firstAn expert trajectory is provided for the user to select,For the state of the vehicle, the length of the expert trajectory isThus, the characteristics of the expert policy are expected to be calculated,
Here, theFor discount coefficientAs a state feature, whenWhen meeting a specific state1, Otherwise 0, learn strategyIn order to realize the feature expectation of the vehicle in the deep reinforcement learning training process, the calculation mode is consistent with the feature expectation of the expert strategy;
step S22: obtaining vehicle stability rewards using deep inverse reinforcement learning, the stability rewards function being approximated by a neural network, i.e Here, whereAs a parameter of the neural network,For inputting the neural network, the operation stability is rewardedParameters of (2)The solving problem is converted into an optimization problem, and the set expression is as follows:
Here, the Awarding for stable operationThe probability value of the parameter, which is equivalent to a maximum a posteriori estimation problem, can be regarded as a likelihood function of expert policy dataParameter priorIn a combination of (a) and (b),
By simplifying the steps of the method, the device and the system,
Here, theIndicating rewardsA regularization function of the parameters,And (3) withAre all from a time rangeInternal calculations, if consideredThen there is
Step S23: for parametersAnd carrying out parameter solving by using a gradient descent algorithm, wherein the set expression is as follows:
by setting the learning rate to The optimum operation stability rewards can be obtained
As shown in fig. 2 and 3, in the final learning result, the error and the degree of approximation between the expert strategy and the learning strategy are evaluated; in the global error variation in fig. 2, 200 are sampledThe smaller the error sum of the samples is, the better the learning effect is; in fig. 3, the actual fitting between the partial expert strategy and the learning strategy is plotted for different vehicle speeds.
In step S3, deep reverse reinforcement learning is accelerated using multi-process asynchronous training, specifically as follows:
Step S31: the multi-process asynchronous training fully utilizes the multi-core CPU performance of a single computer, and distributes training tasks to a plurality of CPU cores, so that the training speed is increased, the training time is reduced, and the multi-process asynchronous training learning environment is mainly divided into two parts: a main process and a plurality of sub-processes, each sub-process has two neural networks, an actor_l strategy network And critic_l value networkThe two networks together form a sub-network; the main process has only one or two neural networks, and the actor_g strategy networkAnd Critic g value networkTogether forming a main network, the neural networks having the same structure but different ways of updating parameters, whereinIn order to be in the state of the vehicle,Is the slip ratio of the vehicle;
Step S32: initializing network parameters of an actor_g and a critic_g, respectively assigning the network parameters to an actor_l and a critic_l, and setting the update steps of a subprocess Maximum training round numberMaximum capacity of experience poolAnd initializing the number of steps of the motionNumber of training rounds
Step S33: in each sub-process, the current vehicle stateUnder the following policySelecting an actionCalculating the vehicle state at the next moment of the vehicleThen pass through the stateCalculating the rewarding value of the vehicleData is processedPutting into experience pool, storing one piece of data in the experience pool at the first position each time, and sequentially moving the original data backwards, whereinIndicating whether the training process is finished or not, if soOtherwiseIn the sub-process, whenIn the time-course of which the first and second contact surfaces,
Step S34: in each sub-process, each timeAnd then randomly extracting data from the experience pool to calculate an actor_l and critic_l network loss function, wherein the set expression is as follows:
Therein, wherein Is a strategyIs used as a reference to the entropy of (a),Is the shear coefficientRepresenting a clipping functionIs a discount factor
Step S35: whenever the loss functions of actor_l and critic_l are calculated, the gradients of the two networks are calculated, and the parameters of actor_g and critic_g are updatedAndThe set expression is as follows:
Here, the AndFor the learning rate of the gradient descent algorithm, each sub-process can independently update the parameters of the two networks in the main process, thereby realizing asynchronous update, and each time the sub-process updates the parameters in the main process, the sub-process needs to updateThe value is again assigned to be 0,And it is necessary to assign network parameters in the main process to the network in the sub-process providing gradient information,A specific multi-process asynchronous training framework is shown in fig. 4;
Step S36: when (when) When the training process is finishedI.e. the optimal strategy, then willReturning to step S22, adding parameters of deep reverse reinforcement learningUpdating process, thereby realizing the process of accelerating deep reverse reinforcement learning training by using a multi-process asynchronous method;
The training time difference of deep reverse reinforcement learning in different processes in multi-process asynchronous training is shown in fig. 5, the total time of the deep reverse reinforcement learning gradually decreases with the increase of the number of sub-processes, and the 3D graph of the steering stability of the vehicle in different vehicle states is shown in fig. 6, wherein the safety value represents the steering stability of the vehicle, and the larger the vehicle is, the better the vehicle is.
In step S4, the vehicle integration control in the high-speed complex scene is performed as follows:
step S41: considering the problem of vehicle control in a high-speed scene as a markov problem and using tuples Indicating the integrated control of an autonomous vehicle in a high speed scenario, whereIn order to be a state space,In order to automatically drive the vehicle's action space,For the probability of a transition of the environmental state,For an immediate benefit of the vehicle's action,The goal of the vehicle is to pass through the strategy in the Markov decision process as a discount factorMaximizing own benefitsThe high-speed scene environment of the vehicle running is shown in fig. 7, and is divided into lane keeping tasks shown in fig. 8 and lane changing overtaking tasks shown in fig. 9 through thinning tasks, and the width of each lane is 3.2m;
step S42: by designing a plurality of sub rewards, calculating a linear combination of the plurality of sub rewards to guide the vehicle to realize integrated control, and road progress rewards Speed rewardsCourse angle rewardsRoad center deviation rewardsDistance maintenance rewardsVehicle torque distribution rewardsSteering wheel angle rewardsVehicle steering stability rewardsTotal rewards
Step S43: by creating a neural network sharing the Actor and Critic network parameters as shown in fig. 10, a one-time output integrated control strategy of the neural network is realizedAnd a value function for evaluating the strategyFeature extraction of state information is achieved in the hidden layer, and a loss function expression is shown as follows:
Wherein the calculation is Such asCalculation ofSuch asAs a parameter of the neural network,Is a super parameter; in fig. 11, the convergence of the algorithm during training is shown, and as the training frequency increases, the return value (i.e., the reward) obtained by the vehicle gradually increases and tends to converge, and the loss function gradually decreases and tends to converge;
Step S44: updating the parameters of the neural network is realized by using a gradient descent algorithm, and the set expression is as follows:
Wherein the method comprises the steps of Is a learning rate super parameter;
as shown in FIG. 12, a state change curve of the vehicle in the lane keeping task at a speed of 30m/s is provided for the vehicle in the lane keeping task, wherein a reward is given for comparison with a steering stability (I.e. steady rewards in the figure) vehicle performance, especially with the same training withoutVehicle performance behavior at that time; as can be seen from fig. 12, in order to counter the off-tracking characteristics of the vehicle, a vehicle is added withThe vehicle performs better, both in reducing lane centerline deviations and in maintaining a low yaw rate.
As shown in fig. 13, to show the state change curve of the vehicle in the lane change overtaking task, it can be seen that the vehicle has the following functions in order to implement the lane change overtaking taskIt has better performance and noWhen the vehicle is in use, the racing track is easier to wash out; has the following componentsWhen the vehicle is in a state of better slip rate maintenance, the tires are not easy to slip, the acceleration change is more gentle, and no slip occursWhen the slip rate of the vehicle is maintained worse, the tires are easier to slip, and the acceleration change is very severe, which is unfavorable for maintaining the steering stability of the vehicle.

Claims (5)

1. A deep reinforcement learning method for reducing superparameter adjustment of a vehicle, the method comprising the steps of:
Step S1, acquiring expert strategies required by training;
s2, obtaining vehicle stability rewards by using deep reverse reinforcement learning;
s3, accelerating the training speed of deep reverse reinforcement learning by using a multi-process asynchronous method;
and S4, vehicle integration control in the high-speed complex scene.
2. The deep reinforcement learning method for reducing vehicle superparameter tuning according to claim 1, wherein said raw expert data in step S1 is different vehicle slip rates at different front wheel angles, longitudinal vehicle speeds and road friction coefficientsCorresponding vehicle steering stability evaluation valueThe larger the value is, the better the steering stability of the vehicle is, and each set of fixed values of different front wheel rotation angles, longitudinal vehicle speeds and road friction coefficients are one working condition, and each working condition corresponds to oneIn use, the original expert data needs to be converted into a form described by using a probability distribution function, so the following processing is needed:
Step S11: to prevent occurrence of Oversized, resulting in numerical computation overflows, in each operating conditionAll that is required to subtract the minimum under the current working conditionThe value, set expression is as follows:
step S12: using the softmax method, each working condition is based on different conditions Value, conversion to probability distributionAnd the sum of probabilities is 1, the set expression is as follows:
step S13: normal distribution using parameterization To approximate the representationI.e.Obeys normal distribution,AndFor the mean and variance of the normal distribution, the specific approximation method is realized by minimizing the KL divergence, and the set expression is shown as the following formula:
step S14: specific calculation parameters In the way of using the following gradient descent algorithm to complete the alignmentIs used for the optimal solution of (a),
In the middle ofThe learning rate in the parameter updating process is used;
step S15: when probability distribution function And the original probability distributionPeak probability between the two, at optimum slip rateWhen there is an error, it is necessary to performMake corrections toIs corrected to a probability distributionThe corresponding optimal slip ratio, while keeping the variance unchanged, probability distribution after correctionI.e. the final expert strategy
3. The method for deep reinforcement learning to reduce the adjustment of the vehicle super parameters according to claim 1, wherein the specific process of obtaining the vehicle stability rewards using deep inverse reinforcement learning in step S2 is as follows:
Step S21: calculating feature expectations of expert strategies and learning strategies AndFor a group of expert policy dataThe set expression is as follows:
Here, the Represent the firstAn expert trajectory is provided for the user to select,For the state of the vehicle, the length of the expert trajectory isThus, the characteristics of the expert policy are expected to be calculated,
Here, theFor discount coefficientAs a state feature, whenWhen meeting a specific state1, Otherwise 0, learn strategyIn order to realize the feature expectation of the vehicle in the deep reinforcement learning training process, the calculation mode is consistent with the feature expectation of the expert strategy;
step S22: obtaining vehicle stability rewards using deep inverse reinforcement learning, the stability rewards function being approximated by a neural network, i.e Here, whereAs a parameter of the neural network,For inputting the neural network, the operation stability is rewardedParameters of (2)The solving problem is converted into an optimization problem, and the set expression is as follows:
Here, the Awarding for stable operationThe probability value of the parameter, which is equivalent to a maximum a posteriori estimation problem, can be regarded as a likelihood function of expert policy dataParameter priorIn a combination of (a) and (b),
By simplifying the steps of the method, the device and the system,
Here, theIndicating rewardsA regularization function of the parameters,And (3) withAre all from a time rangeInternal calculations, if consideredThen there is
Step S23: for parametersAnd carrying out parameter solving by using a gradient descent algorithm, wherein the set expression is as follows:
by setting the learning rate to The best stable operation rewards can be obtained
4. The deep reinforcement learning method for reducing the adjustment of the super parameters of the vehicle according to claim 1, wherein the specific training process of accelerating the training speed of the deep inverse reinforcement learning by using the multi-process asynchronous method in the step S3 is as follows:
Step S31: the multi-process asynchronous training learning environment is constructed, the multi-core CPU performance of a single computer is fully utilized in multi-process asynchronous training, and training tasks are distributed to a plurality of CPU cores, so that training speed is increased, training time is shortened, and the multi-process asynchronous training learning environment is mainly divided into two parts: a main process and a plurality of sub-processes, each sub-process has two neural networks, an actor_l strategy network And critic_l value networkOnly one main process and two neural networks are provided, and an actor_g strategy network is providedAnd critic_g value networkThese neural networks have the same structure but different ways of parameter updating, in whichIn order to be in the state of the vehicle,Is the slip ratio of the vehicle;
Step S32: initializing network parameters of an actor_g and a critic_g, respectively assigning the network parameters to an actor_l and a critic_l, and setting the update steps of a subprocess Maximum training round numberMaximum capacity of experience poolAnd initializing the number of steps of the motionNumber of training rounds
Step S33: in each sub-process, the current vehicle stateUnder the following policySelecting an actionCalculating the vehicle state at the next moment of the vehicleThen pass through the stateCalculating the rewarding value of the vehicleData is processedPutting into experience pool, storing one piece of data in the experience pool at the first position each time, and sequentially moving the original data backwards, whereinIndicating whether the training process is finished or not, if soOtherwiseIn the sub-process, whenIn the time-course of which the first and second contact surfaces,
Step S34: in each sub-process, each timeAnd then randomly extracting data from the experience pool to calculate an actor_l and critic_l network loss function, wherein the set expression is as follows:
Therein, wherein Is a strategyIs used as a reference to the entropy of (a),Is the shear coefficientRepresenting a clipping functionIs a discount factor
Step S35: whenever the loss functions of actor_l and critic_l are calculated, the gradients of the two networks are calculated, and the parameters of actor_g and critic_g are updatedAndThe set expression is as follows:
Here, the AndFor the learning rate of the gradient descent algorithm, each sub-process can independently update the parameters of the two networks in the main process, thereby realizing asynchronous update, and each time the sub-process updates the parameters in the main process, the sub-process needs to updateThe value is again assigned to be 0,And it is necessary to assign network parameters in the main process to the network in the sub-process providing gradient information,
Step S36: when (when)When the training process is finishedI.e. the optimal strategy, then willReturning to step S22, adding parameters of deep reverse reinforcement learningThe updating process is realized, so that the process of accelerating deep reverse reinforcement learning training by using a multi-process asynchronous method is realized.
5. The deep reinforcement learning method for reducing the ultra-parameter adjustment of the vehicle according to claim 1, wherein the implementation process of the vehicle integration decision in the high-speed complex scene in step S4 is as follows:
step S41: considering the problem of vehicle control in a high-speed scene as a markov problem and using tuples Indicating the integrated control of an autonomous vehicle in a high speed scenario, whereIn order to be a state space,In order to automatically drive the vehicle's action space,For the probability of a transition of the environmental state,For an immediate benefit of the vehicle's action,The goal of the vehicle is to pass through the strategy in the Markov decision process as a discount factorMaximizing own benefits
Step S42: by designing a plurality of sub rewards, calculating a linear combination of the plurality of sub rewards to guide the vehicle to realize integrated control, and road progress rewardsSpeed rewardsCourse angle rewardsRoad center deviation rewardsDistance maintenance rewardsVehicle torque distribution rewardsSteering wheel angle rewardsVehicle stability rewardsTotal rewards
Step S43: by creating a neural network sharing the parameters of the Actor and Critic network, the one-time output integrated control strategy of the neural network is realizedAnd a value function for evaluating the strategyThe loss function expression is as follows:
Wherein the calculation is Such asCalculation ofSuch asAs a parameter of the neural network,Is a super parameter;
Step S44: updating the parameters of the neural network is realized by using a gradient descent algorithm, and the set expression is as follows:
Wherein the method comprises the steps of Is a learning rate super-parameter.
CN202411023989.8A 2024-07-29 2024-07-29 Deep reinforcement learning method for reducing vehicle super-parameter adjustment Pending CN118569350A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411023989.8A CN118569350A (en) 2024-07-29 2024-07-29 Deep reinforcement learning method for reducing vehicle super-parameter adjustment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411023989.8A CN118569350A (en) 2024-07-29 2024-07-29 Deep reinforcement learning method for reducing vehicle super-parameter adjustment

Publications (1)

Publication Number Publication Date
CN118569350A true CN118569350A (en) 2024-08-30

Family

ID=92472889

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411023989.8A Pending CN118569350A (en) 2024-07-29 2024-07-29 Deep reinforcement learning method for reducing vehicle super-parameter adjustment

Country Status (1)

Country Link
CN (1) CN118569350A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541570A (en) * 2020-11-24 2021-03-23 北京三快在线科技有限公司 Multi-model training method and device, electronic equipment and storage medium
GB202117359D0 (en) * 2021-04-16 2022-01-12 Motional Ad Llc Techniques for navigating an autonomous vehicle based on perceived risk
CN115525019A (en) * 2022-10-13 2022-12-27 长春工业大学 Composite working condition chassis integrated control method based on operation stability probability distribution
CN116215559A (en) * 2021-12-02 2023-06-06 沃尔沃卡车集团 Redundant vehicle control system based on tire sensor load estimation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541570A (en) * 2020-11-24 2021-03-23 北京三快在线科技有限公司 Multi-model training method and device, electronic equipment and storage medium
GB202117359D0 (en) * 2021-04-16 2022-01-12 Motional Ad Llc Techniques for navigating an autonomous vehicle based on perceived risk
CN116215559A (en) * 2021-12-02 2023-06-06 沃尔沃卡车集团 Redundant vehicle control system based on tire sensor load estimation
CN115525019A (en) * 2022-10-13 2022-12-27 长春工业大学 Composite working condition chassis integrated control method based on operation stability probability distribution

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BIN ZHAO ET AL: "Learning to Evaluate Potential Safety Risk of Ego-Vehicle via Inverse Reinforcement Learning", 《IEEE IAEAC》, 25 April 2024 (2024-04-25), pages 1 - 5 *
黄黛麟: "基于多智能体强化学习的自适应交通信号控制研究", 《中国优秀硕士学位论文全文数据库工程科技Ⅱ辑》, 15 February 2023 (2023-02-15), pages 5 *

Similar Documents

Publication Publication Date Title
Hoel et al. Automated speed and lane change decision making using deep reinforcement learning
CN113076641B (en) Intelligent vehicle-to-vehicle and computer-to-vehicle cooperative steering control parallel computing method based on risk assessment
CN113733929B (en) Wheel torque coordination control method and device for in-wheel motor driven vehicle
CN103324085A (en) Optimal control method based on supervised reinforcement learning
US11887009B2 (en) Autonomous driving control method, apparatus and device, and readable storage medium
CN114488799B (en) Parameter optimization method for controller of automobile self-adaptive cruise system
CN109204390B (en) Train control method based on deep learning
CN113110052B (en) Hybrid energy management method based on neural network and reinforcement learning
CN113485125A (en) Time-lag-containing vehicle queue stability control method and system suitable for any communication topology
CN113511222A (en) Scene self-adaptive vehicle interactive behavior decision and prediction method and device
CN116185027A (en) Automatic driving vehicle lane change track planning method based on multi-player game
CN115016264A (en) Master-slave cooperative control method and device for dynamic following vehicle distance adjustment and storage medium
CN113033902B (en) Automatic driving lane change track planning method based on improved deep learning
CN117872800A (en) Decision planning method based on reinforcement learning in discrete state space
CN118569350A (en) Deep reinforcement learning method for reducing vehicle super-parameter adjustment
CN116872971A (en) Automatic driving control decision-making method and system based on man-machine cooperation enhancement
CN116639124A (en) Automatic driving vehicle lane changing method based on double-layer deep reinforcement learning
CN116702870A (en) Unmanned rewarding learning and control method based on integrated maximum entropy deep inverse reinforcement learning
CN114997048A (en) Automatic driving vehicle lane keeping method based on TD3 algorithm improved by exploration strategy
CN113788007A (en) Layered real-time energy management method and system
CN117806175B (en) Error self-learning track tracking control method and system for distributed driving vehicle model
CN117962633B (en) Electric automobile moment distribution energy-saving control method based on deep reinforcement learning
CN117787925B (en) Method, device, equipment and medium for managing hybrid power energy
CN118192236B (en) Vehicle control method and device based on model predictive control
CN116394968A (en) Multi-agent reinforcement learning-based automatic driving vehicle control method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination