CN115071758A

CN115071758A - Man-machine common driving control right switching method based on reinforcement learning

Info

Publication number: CN115071758A
Application number: CN202210758672.3A
Authority: CN
Inventors: 陈慧勤; 朱嘉祺
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2022-09-20
Anticipated expiration: 2042-06-29
Also published as: CN115071758B

Abstract

The application discloses a reinforcement learning-based man-machine driving sharing control right switching method, which is suitable for distribution of a reinforcement learning-based man-machine driving sharing control right switching system to driving weights between a driver and a driving system, and comprises the following steps: calculating a driving operation action prediction index according to the driver information and the vehicle road prediction information; and inputting the driving operation action prediction index and the comprehensive driving operation action index into the control weight switching system, and calculating the driving weight between the driver and the driving system. Through the technical scheme in the application, the risk of longitudinal and transverse synthesis of the vehicle is effectively solved, the influence of uncertainty caused by a driver is weakened, and the driver is comprehensively considered at different angles, so that the judgment error of the driver is reduced.

Description

Man-machine common driving control right switching method based on reinforcement learning

Technical Field

The application relates to the technical field of intelligent driving, in particular to a man-machine driving sharing control right switching method based on reinforcement learning.

Background

In the conventional automatic driving technology, a control right switching mode is generally adopted to correct the driving behavior of a driver so as to improve the driving safety of a vehicle.

For example, in patent CN 109795486 a, the common driving coefficient (range is 0-1) is dynamically adjusted according to the input torque Td of the driver and the time TLC from the left and right wheels to the lane boundary, so as to realize gradual transition from the driver to the auxiliary control system, and the common driving coefficient at this time is determined through fuzzy control. However, this approach, while addressing the risk of lateral deviation from driving, does not take into account the longitudinal risks during driving.

For another example, patent CN 108469806 a performs key factor construction on the current driving environment, the state of the vehicle and the driver, performs situational assessment on the key factors, and synchronously assesses the driving abilities of the automatic driving system and the driver, so as to determine whether the driving right transfer can be performed. Although a plurality of factors which may affect driving safety are considered in the scheme, the evaluation mode of the driving capacity in the driving right switching process is too complex, subjective and random factors are large, the considered data are too much, and the instantaneity and the stability are poor.

And quantifying the environmental risk as in a thesis 'human-computer co-driving model based on a driver risk response mechanism', obtaining a safety risk response strategy by fitting the environmental risk action and the driving acceleration of the driver, and flexibly switching the human-computer co-driving control right through strategy deviation. The safety control method solves the coupling problem of the state of a driver and the environmental safety, but the safety strategy is established on a large number of driving segments which cannot completely summarize all safety operations and only solves the switching problem when the driver overtakes the following vehicles on the highway. Meanwhile, the control right switching mode only considers the safety problem at the current moment and does not consider traffic hazards possibly caused in the future time period.

Therefore, the safety and stability of the control right switching scheme in the existing automatic driving need to be improved.

Disclosure of Invention

The purpose of this application lies in: how to effectively solve the risk of vehicle vertically and transversely synthesizing reduces the judgement error to the driver in order to improve the accuracy and the security of driving right switching.

The technical scheme of the application is as follows: the utility model provides a reinforcement learning-based man-machine common driving control right switching method, which comprises the following steps: the method is suitable for the distribution of the man-machine common driving control right switching system based on reinforcement learning to the driving weight between the driver and the driving system, and comprises the following steps: calculating a driving operation action prediction index according to the driver information and the vehicle road prediction information; and inputting the driving operation action prediction index and the comprehensive driving operation action index into a control right switching system, and calculating the driving weight between the driver and the driving system.

In any of the above technical solutions, further, the driver information at least includes a driver state, a driver intention, a driver style, and a driver subconscious driving influence deviation, the vehicle path prediction information at least includes a predicted vehicle path risk degree and a predicted vehicle path risk threshold,

the calculation formula of the driving operation action prediction index is as follows:

in the formula (I), the compound is shown in the specification,

for predicting an index of driving performance, Z _t Delay of the driver 'S state response, σ is the driver' S subconscious driving influence deviation, δ is the driver 'S intention, S is the driver' S style, v _risk To predict the degree of risk of the vehicle road, A _arisk To predict a threshold vehicle route hazard.

In any of the above technical solutions, further, the calculation formula of the driver subconscious driving influence deviation σ is:

R _d ＝|d-q _ki |

wherein sigma is the driver subconscious driving influence deviation, sum is the collected traffic scene number, and D _i For a series of subconscious driving strengths in a traffic scene time period, rho', tau and omega are undetermined parameters, alpha is subconscious side weight, beta is personal safety tendency weight of a driver, d is the current transverse position of a vehicle, and q is the current transverse position of the vehicle _ki Is the fitted lateral position of the vehicle under this scenario (label), a is the vehicle acceleration, R _d Is a location parameter.

In any one of the above technical solutions, further, the driver information at least includes a driver state, a driver intention, and a driver style, and the calculation process of the comprehensive driving operation action index specifically includes:

determining current vehicle path information according to the position of a current vehicle in a road, wherein the current vehicle path information at least comprises a current vehicle path danger degree and a current vehicle path danger threshold;

determining a comprehensive driving operation action index by combining an environmental response factor and a piecewise function according to the driver information and the current vehicle path information, wherein the calculation formula of the comprehensive driving operation action index is as follows:

in the formula (I), the compound is shown in the specification,

for the comprehensive driving operation action index, z ₁ For driver state, gamma is the environmental response factor, H _x,y For the current vehicle road risk, σ is a road correction parameter, a _pre For real-time operation of the quantized parameters, risk is the current vehicle route risk threshold.

In any one of the above technical solutions, further, determining the current vehicle path information according to the position of the current vehicle on the road specifically includes:

determining the position of the current vehicle in the road, wherein the position at least comprises the distance between the current vehicle and the front vehicle and the transverse position of the current vehicle;

determining a longitudinal vehicle road danger value according to the distance between the current vehicle and the front vehicle;

determining a transverse vehicle road danger value according to the transverse position of the current vehicle;

calculating the current vehicle road danger degree according to the longitudinal vehicle road danger value and the transverse vehicle road danger value, wherein the corresponding calculation formula is as follows:

in the formula, H _x,y The current degree of danger of the vehicle road is,

the risk distance influence factors of different road sections have the value range of [1,10 ]]，y ₁ As longitudinal road hazard value, y ₂ Is a lateral vehicle road danger value;

and calculating current vehicle road danger thresholds of different scenes according to the current vehicle road danger degree, and recording the current vehicle road danger threshold and the current vehicle road danger degree as current vehicle road information.

In any of the above technical solutions, further, the calculation formula of the environmental response factor γ is:

wherein M is the vehicle mass, M is the vehicle type and purpose correction parameter, k ₁ In order to correct the parameters for the dynamics,

representing the desired speed and direction of speed, v, of the vehicle _limleast (t) is the minimum velocity value, k ₂ The parameters are corrected for the traffic scene,

as a vehicle interaction force parameter, k ₃ A correction parameter for the degree to which the pedestrian complies with the traffic regulations,

is a pedestrian interaction force parameter, k ₄ The parameters are corrected for the complexity of the surrounding physical environment,

as an environmental interaction force parameter, k ₅ A correction parameter for the degree of influence of the traffic regulations,

are rule parameters.

In any one of the above technical solutions, further, calculating a driving weight between the driver and the driving system specifically includes: step 9.1, using the Z-score standardized formula, predicting the index of the driving operation action at the current moment

And comprehensive driving operation action index

Normalizing, and calculating the prediction index of the driving operation from the beginning to the current driving operation during the driving

And comprehensive driving operation action index

Mean and standard deviation of; step 9.2, driving operation action prediction index after Z-Score standardization

And comprehensive driving operation action index

Inputting the current corresponding mean value and standard deviation as input parameters into a human-computer co-driving control right switching system based on reinforcement learning to judge whether weight distribution conditions are met, if so, executing the step 9.3, and if not, acquiring driver information and vehicle path prediction information again; step 9.3, based on the Q learning algorithm, adjusting the learning state in the Q learning algorithm by using the input parameters, and performing the driving weight of the driver according to the action in the value maximum value of the next state in the Q learning algorithmAnd assigning, wherein the driving weight of the driving system is the difference between 1 and the driving weight of the driver.

In any one of the above technical solutions, further, the weight assignment condition specifically includes: 5 times in succession of the first parameter

And a second parameter

Are both less than or equal to a first trigger threshold; or, the second parameter is continued for 3 times

Less than or equal to a second trigger threshold; or, the first parameter is continued for 3 times

Less than or equal to a second trigger threshold, wherein the first parameter

Predicting index for currently inputted driving operation action

And a driving operation action prediction index corresponding to all the inputs from the driving behavior to the current time

By the number of standard deviations, second parameter

For the currently inputted comprehensive driving operation action index

And the comprehensive driving operation action index of all the input from the driving behavior to the current time

By the number of standard deviations.

The beneficial effect of this application is:

according to the technical scheme, the risk of longitudinal and transverse integration of the vehicle is effectively solved, the influence of uncertainty caused by a driver is weakened, the driver is comprehensively considered from different angles, so that the judgment error of the driver is reduced, the method is suitable for multiple traffic scenes, traffic dangers possibly caused in future time periods are comprehensively considered, the accuracy and the safety of driving right switching are further improved, finally, all factors are integrated into two index input switching systems, the data volume is small and accurate, and the real-time performance is higher.

In the preferred implementation mode of the application, the influence of experience and subconscious of a driver on driving is considered, the judgment burden of a switching system is reduced, and the real-time performance is better. And the risk that other vehicles may cause the vehicle can be predicted in advance, and the rear-end collision and collision in the driving process are avoided.

Drawings

The advantages of the above and/or additional aspects of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flow chart of a reinforcement learning-based human-machine co-driving control right switching method according to an embodiment of the present application;

FIG. 2 is a diagram of relative positions of roads and relative safe positions according to one embodiment of the present application;

FIG. 3 is a schematic diagram of a model-free reinforcement learning process according to an embodiment of the present application;

fig. 4 is a schematic diagram illustrating an overall structure of a reinforcement learning-based human-machine co-driving control right switching mechanism according to an embodiment of the present application;

FIG. 5 is a diagram illustrating Q-tables in a Q-learning algorithm in reinforcement learning according to an embodiment of the present application.

Detailed Description

In order that the above objects, features and advantages of the present application can be more clearly understood, the present application will be described in further detail with reference to the accompanying drawings and detailed description. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, however, the present application may be practiced in other ways than those described herein, and therefore the scope of the present application is not limited by the specific embodiments disclosed below.

As shown in fig. 1, the present embodiment provides a method for switching a driving control right of a human-computer based on reinforcement learning, including:

step 1, constructing a simulator based on a real vehicle road environment, and constructing vehicle road scenes of various situations in the simulator;

further, step 1 is realized by:

step 1.1, simulator hardware needs to have a camera for acquiring a human image of a driver and a driving operation environment for simulating a real vehicle;

step 1.2, constructing a large number of typical traffic environments which can be met in the real world, wherein the typical traffic environments comprise a full-type road section car following scene, a full-type road section overtaking scene, a road intersection scene, a congestion road section scene and the like;

step 1.3, inserting a certain number of dangerous traffic scenes and accident simulation scenes in the construction of different typical traffic environment scenes.

Step 2, continuously collecting vehicle information of surrounding roads, current driver state, driver intention, driver style and action information, control weight distribution information and relevant information of an automatic driving system, and calculating to obtain subconscious driving influence deviation of the driver;

further, step 2 may include the following processes:

step 2.1, the driver needs to complete a complete driving process in different scenes;

step 2.2, under the condition of no intervention of a control right switching system, a driver needs to drive normally in a certain amount of different driving scenes, collect and record the driving operation and road conditions of the driver in the driving process, obtain the style of the driver through statistical analysis, and calculate the subconscious driving influence deviation of the driver (the subconscious driving operation influence of the driver is described by the acceleration and deceleration caused by the experience accumulated by the driver in different driving scenes and the change of the transverse position of the road):

wherein, the calculation formula of the driver subconscious driving influence deviation is as follows:

R _d ＝|d-q _ki |

Specifically, the driver subconscious driving influence deviation does not consider the influence of other traffic participants, and only considers from the perspective of personal safety. Based on the maximum entropy principle, a maximum entropy method related to the subconscious of the driver is established.

Firstly, an entropy function is constructed:

where H (x) is entropy, a measure of uncertainty in the thing; p is a radical of _k Is a probability distribution; c is a constant, depending on the measure of entropy, here taken to be 1.

At the entropy functionIn particular, it is desirable that the driver's subconscious driving influence deviation, i.e. how subconscious the environment is currently in, has an influence on the behavior, but due to the probability distribution p _k Is a decimal fraction of 0 to 1, such that log ₂ p _k Is a negative number, so a non-negative integer q is introduced in the embodiment _i Substituting probability distribution p in entropy function _k 。

Defining a parameter q _i The method is characterized in that relative safety positions of different road scenes are provided, wherein the relative positions are shown in fig. 2, a transverse coordinate axis is established by taking the left side of a road as an origin, the half width of a single lane of the road is taken as a driving position, the road is divided into eight areas, and the relative safety positions are that more than half of vehicles are located at the position when the vehicles normally drive.

Roads in different scenes have large difference, and specific positions which can be used for completely representing road paths cannot be obtained accurately, so that a mode that ln replaces log with base 2 is adopted, and q is used as a base _i The main value is an integer greater than one, so the negative sign of the original entropy function needs to be removed, and the difference can be expressed by the following correction entropy:

secondly, establishing a constraint condition of the correction entropy: first, the road conditions are constrained, and each driver will choose the side with good road conditions. Second, under the constraints of traffic regulations, drivers tend to drive more as specified by traffic regulations. Thirdly, the constraint of traffic demand, namely whether the driver needs to overtake, follow or go straight in the road scene, is as follows:

constraint 1:

constraint 2:

constraint 3: b (q) _i )∈S

In the formula, A _min 、A _max The lower limit and the upper limit of the road traffic capacity score are set;

interference coefficients for unfamiliar degrees of different road sections; b is a traffic demand impact weight; b is the maximum boundary of the traffic rule; b (q) _i ) For the determination of traffic demands, i.e. knowing whether a traffic demand is overtaking, following or going straight, q is estimated by the demand _i (ii) a And S is a traffic demand set, and all normal driving behavior position results are in the set.

Three constraint conditions and different road scenes are set, the correction entropy is used for calculation, and the relative safe position q of the different road scenes is obtained when the value of the correction entropy E is maximum _i For a relative safety position q _i Clustering is performed, labels are marked for each type (such as overtaking, following and going straight), and then the relative safety position q is determined _i Fitting to obtain a fitted transverse position q _ki This position is the safe position that this driver is most inclined to walk under different labels.

In summary, fitting the lateral position q _ki For correcting the relative safe position q when the entropy E value is maximum under the constraint condition _i 。

D _i The calculation formula of a series of subconscious driving strengths in one time period of the traffic scene is as follows:

R _d ＝|d-q _ki |

where d is the current lateral position of the vehicle and q is _ki The fitting transverse position of the vehicle in the scene (label) is shown as a, the acceleration of the vehicle is shown as a, the subconscious side weight is shown as alpha, the personal safety tendency weight of a driver is shown as beta, rho' is shown as a undetermined parameter, the value of the undetermined parameter meets the change trend of subconscious driving strength in different traffic scenes, and the change trend is as follows:

when R is _d More than or equal to Z (Z is a safety value and a set value, and the values of different roads are differentSample), subconscious Driving Strength D _i When the value of (a) is larger, the values of the undetermined parameters rho', tau and omega follow R _d And | a | increases, i.e., becomes more and more unsafe, at which time the greater the strength of the subconsciously driven operational action;

when R is _d <Z、D _i When the value of (a) is small, the values of the undetermined parameters rho', tau and omega follow R _d And | a | decreases, i.e., becomes more and more safe, at which time the strength of the subconsciously driven operational action is less.

sum is the number of collected traffic scenes, and the result sigma of the averaging of the intensity is the subconscious driving influence deviation of the driver.

Step 2.3, simulating conditions possibly encountered in the real driving process by a driver, such as dangerous states of fatigue, emotional excitement, distraction and the like, and normal driving;

and 2.4, collecting data to obtain the speed, distance and road surface information of surrounding vehicles, the brake, accelerator and steering wheel data of the own vehicle, the driving weight distribution and the intention and operation data of a driver in a driving system, and obtaining the state and intention information of the driver in a data statistical processing mode.

Step 3, obtaining interaction force and environmental response factor gamma with each peripheral unit according to the collected peripheral road information and vehicle state;

further, step 3 is realized by:

and 3.1, the environmental response factor gamma is an interaction force under the influence of the interaction of the vehicle and the vehicle road environment, and particularly responds to different units. The environmental response factor γ was calculated using the following formula:

the formula:

v _limleast (t) is the minimum speed value of the speed limit and the vehicle speed in the current time period scene;

m is the mass of the vehicle;

m is a vehicle type and a target correction parameter;

representing the desired speed and direction of speed of the vehicle,

it is derived from Newton's second law and kinematic formula.

k ₁ Correcting parameters for dynamics;

k ₂ parameters are corrected for traffic scenarios, such as highway segments, congested road segments, etc.,

is an interaction force with other vehicles, wherein the vehicle interaction force parameter

Comprises the following steps:

θ _1l is the angle between the direction of travel of the vehicle and the direction of travel of other vehicles, Δ v _1l /Δμ _1l The expression that u is the safe distance and ρ is the distance to other vehicles in the ratio of the speed difference to the distance difference indicates that a distance greater than the safe distance represents an attractive force, the attractive force is smaller as the distance is closer to the safe distance, and the attractive force is converted into a repulsive force when the distance is smaller than the safe distance, and the repulsive force is larger as the distance is closer to other vehicles. The vehicles in the transverse parallel position and the parallel advancing direction do not have interaction force, and the absolute value of the interaction force of the longitudinal same laneAnd max.

k ₃ A correction parameter for the degree to which the pedestrian complies with the traffic regulations,

is an interaction force with a pedestrian, wherein the pedestrian interaction force parameter

Comprises the following steps:

v is the current speed of the vehicle, θ _1j The angle between the center of the vehicle head and the pedestrian r _1j Is the distance difference, t _1j The formula shows that when the vehicle speed is 0, no interaction force exists between the vehicle and the pedestrian, the closer the vehicle and the pedestrian are, the smaller the angle difference is, the shorter the estimated meeting time is, and the higher the vehicle speed is, the larger the repulsive force is caused.

k ₄ The parameters are corrected for the complexity of the surrounding physical environment,

is an interaction force with a surrounding physical environment such as a non-moving object like a building, wherein the environmental interaction force parameter

Comprises the following steps:

t is the volume of the non-moving object, the larger the volume is, the larger the repulsive force is, when the volume is smaller than or equal to the passable size of the vehicle, the interaction force is attractive force, when the volume is larger than the passable size of the vehicle, when the collision time T is _1R The smaller the repulsion force, the greater the repulsion force when the vehicle mass is greater, and the greater the vehicle speed, the greater the exclusion forceLarge, at a speed of 0, there is no interaction force.

k ₅ In order to reflect the attention degree of the vehicle to the traffic regulation as the correction parameter of the influence degree of the traffic regulation,

acting as a resistance to traffic regulations, wherein the regulation parameters

Comprises the following steps:

v _lim the maximum speed is limited for the traffic regulations and the traffic signs, the lower the limited speed is, the larger the resistance is, and when the traffic regulations and the traffic signs require parking, the resistance is infinite under the red light condition.

Step 4, carrying out normalization preprocessing according to the collected current action information, namely the brake, the accelerator and the steering wheel, so as to obtain a real-time operation quantization parameter a _pre ；

Further, step 4 is implemented by:

step 4.1, extracting brake force, accelerator force and steering wheel angle data through a sensor;

and 4.2, normalizing the three data by using min-max standardization:

braking:

accelerator:

steering wheel corner:

the value is the current value, min is the minimum value, and max is the maximum value;

the operation specification can know that the accelerator and the brake are mutually exclusive operations, so the normalization results are combined as follows:

longitudinal operation interval: [ -1:1 ];

the transverse operation interval: [ -1:1 ];

step 4.3, for the longitudinal operation interval and the transverse operation interval: -1:1, constructing bijections from-1: 1, -1:1 to-1:

longitudinal value of (0. a) ₁ a ₂ a ₃ a ₄ …) and has a transverse value of (0. b) ₁ b ₂ b ₃ b ₄ …), constructing a crossover method, segmenting the two types of decimal, segmenting after all non-0 digits, and performing crossover recombination on the segmented segments to obtain a one-dimensional real-time operation quantization parameter a _pre 。

And 5, determining current vehicle path information according to the position of the current vehicle on the road, wherein the current vehicle path information at least comprises the current vehicle path danger degree and the current vehicle path danger threshold.

Further, step 5 is implemented by:

and 5.1, determining the position of the current vehicle in the road, wherein the position at least comprises the distance between the current vehicle and the front vehicle and the transverse position of the current vehicle, and determining a longitudinal vehicle road danger value according to the distance between the current vehicle and the front vehicle.

The risk of the longitudinal position is inversely proportional to the distance from the tail of the front vehicle, namely the closer the distance from the tail of the front vehicle, the greater the risk, the longitudinal vehicle road danger function is established, the tail of the front vehicle is used as the original point to set a coordinate axis, and the specified normal safe distance is zeta ₁ . Setting a minimum safety distance eta ₁ ，η ₁ Is set to the maximum deceleration braking to the just-no-collision distance of the preceding vehicle.

y ₁ Is a longitudinal vehicle road risk value,

x ₁ is the distance from the front vehicle;

and 5.2, determining a transverse vehicle road danger value according to the transverse position of the current vehicle:

and (3) establishing a transverse vehicle road danger function by taking the vehicle head central point as an original point:

y ₂ ＝0.5cos[(π/T)x ₂ ]-0.5,－T≤x ₂ ≤T

y ₂ is a value of the risk of the lateral vehicle,

x ₂ is the current lateral position;

t is the distance from the center line of the lane to the sideline;

step 5.3, calculating to obtain the current vehicle road danger degree H _x,y ：

The risk distance influence factors of different road sections have the value range of [1,10 ]]When the value is 1, the current road section and the driving state are standard traffic road sections and driving environments under the regulation of the intersection standard. When the value is 10, the conditions that the current driving environment is severe, the road traffic capacity is extremely poor and rear-end accidents happen frequently around the road, such as a heavy fog and frozen road section, are indicated.

Step 5.4, calculating the current vehicle road danger threshold values of different scenes:

risk＝ωγH _x,y

omega is a scene impact parameter. Environmental response factor gamma

Step 6, quantizing the parameters a according to the collected driver state, the driver intention and the real-time operation _pre Determining and obtaining self synthesis by combining the environmental response factor gamma, the current vehicle road danger degree and the current vehicle road danger threshold value in a piecewise function modeIndex of driving operation action

The corresponding calculation formula is:

z ₁ the driver states are different, the environmental response degrees are different,

gamma is an environmental response factor and is a specific factor,

delta is the intention of the driver, representing the degree to which the current operation coincides with the recognized intention of the driver,

H _x,y the current degree of danger of the vehicle road is,

sigma is a road correction parameter, and the road correction parameter,

a _pre in order to operate the quantization parameter in real-time,

risk is the current roadway hazard threshold.

Step 7, obtaining a predicted vehicle road risk degree and a predicted vehicle road risk threshold according to the interaction force growth rate;

specifically, the interaction force in step 3 is inversely proportional to the distance, and the faster the interaction force increases and the more danger is likely to occur, so the following derivation formula can be derived:

A _arisk ＝ρa _risk

in the formula, v _f Is the rate of increase of single unit interaction force, v _risk To predict the degree of risk of the vehicle road, a _f Acceleration for single unit interaction force increase, a _risk For the sum of the acceleration increases for all peripheral units of interaction force, A _arisk In order to predict the threshold value of the vehicle road danger, rho is a vehicle road danger influence factor and is determined by the complexity of the current road, and the value range is [0,1]]。

Step 8, calculating a driving operation action prediction index according to the driver information and the vehicle path prediction information

The driver information at least comprises a driver state, a driver intention, a driver style and a driver subconscious driving influence deviation, the vehicle path prediction information at least comprises a predicted vehicle path danger degree and a predicted vehicle path danger threshold, and a calculation formula of the driving operation action prediction index

Comprises the following steps:

in the formula, sigma is the subconscious driving influence deviation of the driver, and the subconscious driving influence deviation of the historical driver under the most similar scene is obtained by comparing traffic scenes;

s is the style of the driver, namely the quantitative evaluation of the style of the driver of [0,10] is obtained through the style test of the driver, and the style of the driver is less than 1, so that the style of the driver is extremely unsuitable, and the time delay of the intention of the driver and the operation reaction of the driving state is influenced.

Z _t The state operation reaction of the driver is delayed, the delay is a set value, and the larger the delay is, the smaller the prediction index of the driving operation action is;

delta is the intention of the driver, and different driver intention paths have larger influence on the driver subconscious driving influence deviation.

Step 9, predicting the driving operation action index

And comprehensive driving operation action index

And inputting the driving weight into a man-machine driving sharing control weight switching system based on reinforcement learning, and calculating and adjusting driving weight weights respectively needed by a driver and a driving system.

Specifically, as shown in fig. 3 and 4, the driving operation action prediction index

Various risk factors representing the future time period influence the operation safety degree, and the influence of other units after the operation of the vehicle is emphasized. Integrated driving operation action index

Representing that each risk factor influences the operation safety degree at the current time, and emphasizing whether the current position is safe or not and whether the current state can be effectively driven or not;

step 9.1, using the Z-score standardized formula, predicting the index of the driving operation action at the current moment

And comprehensive driving operation action index

And comprehensive driving operation action index

Mean and standard deviation of.

Step 9.2, driving operation action prediction index after Z-Score standardization

And comprehensive driving operation action index

And inputting the current corresponding mean value and standard deviation as input parameters into a human-computer co-driving control right switching system based on reinforcement learning to judge whether weight distribution conditions are met, if so, executing the step 9.3, and if not, re-acquiring driver information and vehicle path prediction information.

Specifically, the parameter Z-Score represents the number of the sampled sample values differing from the data mean value by several standard deviations, so as to predict the index of the driving operation action

As an example, the first parameter

Predicting index for currently inputted driving operation action

(sample value sampling) and index of prediction of driving operation behavior for all inputs from driving behavior to current time

Is different by the number of standard deviations that are prediction indexes of all the inputted driving operation actions from the start of the driving behavior to the current time

Standard deviation of (2). Second parameter

Similarly, no further description is given.

In this embodiment, the weight distribution conditions in the control weight switching system include three types:

(1) 5 times in succession of the first parameter

And a second parameter

Are both less than or equal to a first trigger threshold;

specifically, the first parameter is judged

And a second parameter

Whether the first trigger threshold values are all smaller than or equal to the first trigger threshold value, wherein the value of the first trigger threshold value can be-3, namely when the input driving operation action prediction index is input

And comprehensive driving operation action index

Whether all are less than or equal to 3 standard deviations from the current corresponding mean, the corresponding formula is:

and is

Such a situation indicates that the current state does not satisfy the safe state and that the safe state is not satisfied in the future, and therefore, the control right switching system is triggered to start operating. Under the condition, the index of five inputs in succession is (

And

) All satisfy

And is

In time, the driving weights respectively required by the driver and the driving system need to be adjusted.

(2) Continuous 3 times of the second parameter

Less than or equal to a second trigger threshold;

specifically, the value of the second trigger threshold may be-4. When the inputted comprehensive driving operation action index

The current mean value is less than or equal to 4 standard deviations, i.e. the second parameter

When the current state does not meet the safety state, the driving system needs to be intervened emergently, and the control right switching system is triggered to start working. Under the condition, when the indexes are input three times in succession

Satisfies the second parameter

And adjusting the driving weight respectively required by the driver and the driving system.

(3) 3 times in succession of the first parameter

Less than or equal to a second trigger threshold;

specifically, when the driving operation is inputtedIndex of action prediction

The current mean is less than or equal to 4 standard deviations, i.e. the first parameter

And when the future state does not meet the safe state and the state is safe after the intervention of a driver cannot be corrected by self, triggering the control right switching system to start working. Under this condition, when the index is inputted three times in succession

Satisfies the first parameter

And 9.3, based on the Q learning algorithm, adjusting the learning state in the Q learning algorithm by using the input parameters, and assigning a driving weight of the driver according to the action in the value maximum value of the next state in the Q learning algorithm, wherein the driving weight of the driving system is the difference between 1 and the driving weight of the driver.

In this embodiment, the driving right weight algorithm respectively required for the driver and the driving system in the control right switching system is a Q learning algorithm, and the specific training process is as follows:

(1) the transition rule for Q learning is:

Q(state,action)＝R(state,action)+Gamma*MaxQ(next state,all actions)

that is, Q (state, action) + Gamma max [ Q (next state, all actions) ]

Gamma is a discount factor (discount factor), and the larger the discount factor is, the greater the MaxQ plays a role. Here, the value (R) before the eye, and the value in memory can be understood. MaxQ refers to the value in memory, and it refers to the maximum value of the value in the action of the next state in memory.

(2) A "matrix Q" is added as a learning-intensive agent, i.e. the brain of the driving right switching system, i.e. something learned empirically. The rows of the "matrix Q" represent the current state of the driving right switching system and the columns represent the possible actions of the next state (link between nodes). The driving right switching system is initialized to 0, i.e., the "matrix Q" is initialized to zero. The "matrix Q" can only start with one element. If a new state is found, the "matrix Q" is updated, which is referred to as unsupervised learning.

(3) The driving weight of the driver is power (driver), the driving weight of the driving system is power (system) 1-power (driver), the control right switching system adjusts the driving weight of the driver by using a reinforcement learning Q learning algorithm, the Q learning action is directly assigned to the driving weight of the driver, the value range of the weight value is [0,1], and the step length is 0.05.

(4) The Q learning state is set to 0,1, 2, 3, 4, 5. Wherein:

0 represents

And is

1 represents

2 represents

3 represents

4 represents

5 represents

And is

(5) After triggering the control right switching system to start working, obtaining one of the

initial states

0,1 and 2, when the state is still one of the

states

0,1 and 2 through the action (adjusting the weight) in the step (3), rewarding to be-1, updating a matrix Q, and assigning the element in the matrix corresponding to the state and the use action to be-1;

when the state reaches the

states

3 and 4 through the action in the step (3), the reward is 1, the matrix Q is updated, and the element in the corresponding matrix of the state and the use action is assigned as 1;

when the state reaches the state 5 through the action in (3), the reward is 100, the matrix Q is updated, the element in the corresponding matrix of the state and the use action is assigned as 100, the state 5 is the target state, and finally the Q-table is obtained as shown in fig. 5 (the element is not assigned).

(6) Selecting a road environment, applying (1) to (5), and obtaining a Q-table of an initial "matrix Q" as a selection of MaxQ (next state, all actions) in Q learning Q (state, action) ═ R (state, action) + Gamma (next state, all actions), wherein Gamma is selected according to the road similarity degree [0,1], and R (state, action) is a value of a state obtained by the current road environment: the reward is-1 when the state is 0,1, 2, 1 when the state is 3, 4, and 100 when the state is 5.

When the control right switching system calculates the driving weight, the weight which can be adjusted and the reward obtained by the achieved state are calculated in advance according to the Q-table of the similar road section, and the reward is MaxQ (next state) in the formula. Therefore, Q (state, action) is the sum of the R (state, action) value in the current road environment and MaxQ (next state, all actions).

When Q (state, action) is maximum, the action next state in MaxQ (next state, all actions) is the weight that needs to be adjusted, and is recorded as the driver driving weight power (driver).

(7) And updating the Q-table according to the calculated Q (state) value.

(8) The switching of weights is stopped when state 5 is reached.

The technical scheme of the application is explained in detail by combining the attached drawings, the application provides a reinforcement learning-based man-machine driving control right switching method, the method is suitable for a reinforcement learning-based man-machine driving control right switching system to distribute driving weights between a driver and a driving system, and the method comprises the following steps: calculating a driving operation action prediction index according to the driver information and the vehicle road prediction information; and inputting the driving operation action prediction index and the comprehensive driving operation action index into the control weight switching system, and calculating the driving weight between the driver and the driving system. Through the technical scheme in the application, the risk of longitudinal and transverse integration of the vehicle is effectively solved, the influence of uncertainty caused by a driver is weakened, and the driver is comprehensively considered from different angles, so that the judgment error of the driver is reduced.

The steps in the present application may be sequentially adjusted, combined, and subtracted according to actual requirements.

The units in the device can be merged, divided and deleted according to actual requirements.

Although the present application has been disclosed in detail with reference to the accompanying drawings, it is to be understood that such description is merely illustrative and is not intended to limit the application of the present application. The scope of the present application is defined by the appended claims and may include various modifications, adaptations, and equivalents of the invention without departing from the scope and spirit of the application.

Claims

1. A reinforcement learning-based man-machine driving control right switching method is applicable to distribution of driving weights between a driver and a driving system by a reinforcement learning-based man-machine driving control right switching system, and comprises the following steps:

calculating a driving operation action prediction index according to the driver information and the vehicle road prediction information;

and inputting the driving operation action prediction index and the comprehensive driving operation action index into the control weight switching system, and calculating the driving weight between the driver and the driving system.

2. The reinforcement learning-based human-machine co-driving control right switching method according to claim 1, wherein the driver information at least includes a driver state, a driver intention, a driver style, and a driver subconscious driving influence deviation, the vehicle path prediction information at least includes a predicted vehicle path risk and a predicted vehicle path risk threshold,

in the formula (I), the compound is shown in the specification,

predicting an index, Z, for the driving maneuver _r For a driver state operation response delay, σ is the driver subconscious Driving influence deviation, δ is the driver intent, S is the driver style, v _risk For the prediction of the degree of risk of the vehicle road, A _arisk And the predicted vehicle road danger threshold value is used.

3. The reinforcement learning-based man-machine driving sharing control right switching method according to claim 2, wherein the calculation formula of the driver subconscious driving influence deviation σ is as follows:

R _d ＝|d-q _ki |

4. The reinforcement learning-based human-computer co-driving control right switching method as claimed in claim 1, wherein the driver information at least includes a driver state, a driver intention, and a driver style, and the calculation process of the comprehensive driving operation action index specifically includes:

determining the comprehensive driving operation action index by combining an environmental response factor and adopting a piecewise function mode according to the driver information and the current vehicle path information, wherein the calculation formula of the comprehensive driving operation action index is as follows:

in the formula (I), the compound is shown in the specification,

is the index of the integrated driving maneuver z ₁ Is the driver state, gamma is the environmental response factor, H _x,y For the current vehicle road risk, sigma is a road correction parameter, a _pre And for real-time operation of the quantized parameters, risk is the current vehicle road risk threshold.

5. The reinforcement learning-based man-machine co-driving control right switching method as claimed in claim 4, wherein the determining current vehicle path information according to the current vehicle position in the road specifically comprises:

determining a position of a current vehicle in a road, including at least a distance to a preceding vehicle of the current vehicle and a lateral position of the current vehicle;

calculating the current vehicle road risk degree according to the longitudinal vehicle road risk value and the transverse vehicle road risk value, wherein the corresponding calculation formula is as follows:

in the formula, H _x,y In order to obtain the current vehicle road risk degree,

the value range of the risk distance influence factors of different road sections is [1,10 ]]，y ₁ Is said longitudinal road hazard value, y ₂ The value is the lateral vehicle road danger value;

and calculating current vehicle road danger thresholds of different scenes according to the current vehicle road danger degrees, and recording the current vehicle road danger thresholds and the current vehicle road danger degrees as the current vehicle road information.

6. The reinforcement learning-based man-machine co-driving control right switching method as claimed in claim 4, wherein the environmental response factor γ is calculated by the formula:

wherein M is the vehicle mass, M is the vehicle type andtarget correction parameter, k ₁ In order to correct the parameters for the dynamics,

are rule parameters.

7. The reinforcement learning-based human-computer co-driving control right switching method according to any one of claims 1 to 6, wherein the calculating the driving weight between the driver and the driving system specifically includes:

And comprehensive driving operation action index

Standardizing, and calculating the operation steps from the start to the current operation in the current drivingIndex of measurement

And comprehensive driving operation action index

Mean and standard deviation of (d);

And comprehensive driving operation action index

Inputting the current corresponding mean value and standard deviation as input parameters into a human-computer co-driving control right switching system based on reinforcement learning to judge whether weight distribution conditions are met, if so, executing the step 9.3, and if not, acquiring driver information and vehicle path prediction information again;

8. The reinforcement learning-based human-computer co-driving control weight switching method as claimed in claim 7, wherein the weight distribution condition specifically includes:

5 times in succession of the first parameter