CN116872971A

CN116872971A - Automatic driving control decision-making method and system based on man-machine cooperation enhancement

Info

Publication number: CN116872971A
Application number: CN202310986664.9A
Authority: CN
Inventors: 程吉禹; 丁俊锋; 陈佳铭; 张伟; 宋然; 李晓磊
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2023-08-07
Filing date: 2023-08-07
Publication date: 2023-10-13

Abstract

The application discloses an automatic driving control decision-making method and system based on man-machine cooperation enhancement, comprising the following steps: acquiring mixed data; the mixed data comprise driving demonstration data of a driver, vehicle self-driving data and monitoring and correcting actions of taking over control by a monitoring driver when the vehicle is self-driving; based on the supervision and correction actions, predicting driving simulation data under the current supervision and correction actions, and scoring the current supervision and correction actions to determine the weight occupied by the difference between the driving data after the control is taken over by the current supervision and correction actions and the vehicle self-running data when the control decision model is trained; training the control decision model based on the mixed data, and introducing corresponding weight when the supervised driver takes over control, so as to obtain the control strategy of automatic driving according to the trained control decision model.

Description

Automatic driving control decision-making method and system based on man-machine cooperation enhancement

Technical Field

The application relates to the technical field of artificial intelligence, in particular to an automatic driving control decision method and system based on man-machine cooperation enhancement.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The automatic driving technology has great potential in reducing the occurrence probability of traffic accidents, improving the driving safety of roads, relieving traffic jams and the like; the real road environment is complex, the environment information is various, and other organisms in the road show high uncertainty, which causes great challenges to the control decision of automatic driving.

In order to improve the robustness and self-adaptive capacity of an automatic driving control decision algorithm and the decision capability of processing more complex problems, the method better deals with complex and changeable road environments, and in recent years, students train the behavior decision of intelligent vehicles by adopting a deep reinforcement learning mode to replace the traditional rule-based or efficiency-based method. The automatic driving decision algorithm based on the deep reinforcement learning method training solves the problems of complex rule design, time-consuming parameter adjustment and the like faced by the traditional method, and has better performance.

However, the method for training the intelligent vehicle based on the deep reinforcement learning method has the same limitation, such as careful design of the reward function required in the reinforcement learning training process, otherwise, the problem of sparse training reward may be faced; in addition, the intelligent vehicle trained based on deep reinforcement learning is difficult to show a highly intelligent and highly complex behavior strategy, such as lane changing in advance, overtaking in time, waiting for yielding and the like. And people are taken as higher intelligent animals, so that the human-computer collaborative paradigm is excellent in these aspects, and the introduction of human intelligence into the learning process of the intelligent body becomes an important research direction for solving the difficult problem of complex automatic driving decision.

The existing method adopting man-machine cooperation mainly comprises the modes of behavior cloning, inverse reinforcement learning, reinforcement learning based on human feedback and the like. The behavior cloning is trained in a mode of simulating learning, a simulation system and a sensor are adopted to collect driving data of a real human driver before training, and the collected data are used as a data set for supervision training, so that the intelligent vehicle learns the driving behavior of the human driver; however, behavior cloning not only requires a large amount of real data, but also has poor generalization ability of a trained model, and is difficult to deal with situations which do not occur in expert data.

The inverse reinforcement learning method also needs to collect driving data of the human driver in advance, but is different from behavior cloning, the collected data is directly used as an expert data set for training, the inverse reinforcement learning is usually performed by means of the behavior data of the expert driver, a potential reward function of the behavior of the expert driver is deduced, and the reward function is used for further reinforcement learning. Because of the uncertainty of surrounding participants in a complex traffic scene, and the decisions made by different expert drivers under the same conditions may be quite different, the inferred reward function is too instructive to adapt to the strategy.

The reinforcement learning based on human feedback adopts a reinforcement learning training paradigm of human in a loop, human experts directly participate in the reinforcement learning training process, feedback signals are given according to the behavior decision performance of the intelligent vehicle in the training process, the feedback signals further adjust reinforcement learning network parameters, the feedback signals mainly comprise two feedback signals based on evaluation and feedback signals based on preference, but the feedback signals are often too sparse and insufficient in information transmission, and the effect is limited.

Disclosure of Invention

In order to solve the problems, the application provides an automatic driving control decision method and system based on man-machine cooperation enhancement, which adopt a man-machine cooperation enhancement training framework to enable expert drivers to jointly play roles before and in a reinforcement learning training loop, and use driver driving demonstration data, vehicle self-driving data and driver supervision correction data to jointly participate in training, so that the training speed and accuracy are improved.

In order to achieve the above purpose, the present application adopts the following technical scheme:

in a first aspect, the present application provides an automatic driving control decision method based on man-machine cooperation enhancement, including:

acquiring mixed data; the mixed data comprise driving demonstration data of a driver, vehicle self-driving data and monitoring and correcting actions of taking over control by a monitoring driver when the vehicle is self-driving;

based on the supervision and correction actions, predicting driving simulation data under the current supervision and correction actions, and scoring the current supervision and correction actions to determine the weight occupied by the difference between the driving data after the control is taken over by the current supervision and correction actions and the vehicle self-running data when the control decision model is trained;

training the control decision model based on the mixed data, and introducing corresponding weight when the supervised driver takes over control, so as to obtain the control strategy of automatic driving according to the trained control decision model.

As an alternative embodiment, the driver driving demonstration data and the vehicle self-running data both comprise states and actions; the state comprises image data, radar monitoring data, vehicle speed and vehicle tire direction of each decision time step, and the actions and the supervision and correction actions comprise steering wheel steering and braking force.

As an alternative embodiment, in training the control decision model, an actor-critic reinforcement learning paradigm is used to train the control decision model, including a value functionOptimizing objective and policy function pi (a _t |s _t The method comprises the steps of carrying out a first treatment on the surface of the θ) optimization objectives;

wherein the value functionThe optimization objective is expressed as:

wherein ,is an actor network model parameter; />Final execution actions for the vehicle in the training data; alpha is the weight introduced when the supervised driver takes over control; i(s) _t ) As a binary function, expressed in state s _t The lower supervision driver takes over control, if not takes over, the control is 0, and if taking over, the control is 1; a, a _n,t and a_h,t Respectively expressed in state s _t Get off the vehicle and supervise the actions of the driver; r(s) _t ,a _t ) Is a reward function; a, a ^′ S for predicting the next moment of action _t+1 Is the state at time t+1.

Alternatively, the reward function is designed to give a reward value in the event of a vehicle reaching an end point, a collision with an obstacle, and supervised driver intervention.

As an alternative embodiment, the policy function pi (a _t |s _t The method comprises the steps of carrying out a first treatment on the surface of the θ) optimization objective is expressed as a function of the maximum value:

max _θ E[Q(s _t ,a _t )-βlogπ(a _t |s _t ；θ)]

wherein θ is a critic network model parameter; beta is a super parameter; a, a _t Is the action of the vehicle in the training data.

As an alternative embodiment, the driving simulation data is predicted by a strategy generator under the current supervision and correction action, and the current supervision and correction action is scored by a discriminator.

As an alternative embodiment, when training the control decision model based on the mixed data, three types of data in the mixed data are sampled proportionally, so that the three types of data together play a role in training the control decision model.

In a second aspect, the present application provides an automatic driving control decision system based on man-machine cooperation enhancement, comprising:

an acquisition module configured to acquire hybrid data; the mixed data comprise driving demonstration data of a driver, vehicle self-driving data and monitoring and correcting actions of taking over control by a monitoring driver when the vehicle is self-driving;

the judging module is configured to predict driving simulation data under the current supervision and correction action based on the supervision and correction action, and score the current supervision and correction action so as to determine the weight occupied by the difference between the driving data after the control is taken over by the current supervision and correction action and the vehicle self-driving data when the control decision model is trained;

the decision module is configured to train the control decision model based on the mixed data, and introduce corresponding weights when the supervised driver takes over control, so that the control strategy of automatic driving is obtained according to the trained control decision model.

In a third aspect, the application provides an electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the method of the first aspect.

In a fourth aspect, the present application provides a computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of the first aspect.

Compared with the prior art, the application has the beneficial effects that:

(1) According to the application, a man-machine cooperative reinforcement training framework is adopted, so that expert drivers play roles together before a reinforcement learning training loop and in the training loop, the problems of poor generalization capability, unstable strategy, inadaptation and the like of the traditional behavior cloning and inverse reinforcement learning are solved, human intelligence is integrated into reinforcement learning model training by combining reinforcement learning, and the performance of an automatic driving control decision algorithm network model is effectively improved.

(2) The application stores experience obtained by the intelligent vehicle exploration environment, online demonstration action of a supervising driver and driver driving demonstration data obtained before a reinforcement learning training loop starts in a data buffer zone so as to form a rich data buffer zone comprising vehicle self-driving data, driver supervision and correction data and driver driving demonstration data in the reinforcement learning loop, adopts an offline learning reinforcement learning training mode in a network model parameter updating stage, randomly samples the data in proportion, participates in training together, acts on gradients together, better blends human intelligence into a reinforcement learning training model while improving the training speed of the network model, improves the expression of an automatic driving control decision algorithm, and effectively solves the problems of sparse rewards and slow training in the automatic driving control decision algorithm training process.

(3) In the process of supervising the intelligent vehicle by the supervising driver online, since there may be a case that the supervising corrective action provided by the supervising driver is not an optimal decision in the current scene or the action decisions provided by different facets may be completely different for the same scene, the instability in this case may adversely affect the model parameter update convergence process. Therefore, the application provides a dynamic weight learning mode based on a discrimination model, and adopts a generated countermeasure imitation learning mode to train a discriminator, so as to be used as a basis for scoring the driver supervision and correction data, and dynamically adjusts the weight of the driver supervision and correction data participating in calculating gradients according to the basis, thereby effectively solving the problems of non-optimal online data and unstable online action decision.

Additional aspects of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application.

Fig. 1 is a flow chart of an automatic driving control decision method based on man-machine cooperation enhancement provided in embodiment 1 of the present application.

Detailed Description

The application is further described below with reference to the drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application. As used herein, unless the context clearly indicates otherwise, the singular forms also are intended to include the plural forms, and furthermore, it is to be understood that the terms "comprises" and "comprising" and any variations thereof are intended to cover non-exclusive inclusions, such as, for example, processes, methods, systems, products or devices that comprise a series of steps or units, are not necessarily limited to those steps or units that are expressly listed, but may include other steps or units that are not expressly listed or inherent to such processes, methods, products or devices.

Embodiments of the application and features of the embodiments may be combined with each other without conflict.

Example 1

The embodiment provides an automatic driving control decision method based on man-machine cooperation enhancement, as shown in fig. 1, comprising the following steps:

The embodiment is oriented to a traffic road environment, a large number of vehicles controlled by a given algorithm are distributed in the environment, and an example task is to train a network model to control an intelligent vehicle to safely and efficiently travel from a starting point to an end point, wherein the safe and efficient definition is that the intelligent vehicle is shortest in time on the premise of not colliding with other vehicles in a road and obstacles in the road.

In this embodiment, before the reinforcement learning training is started, a human expert driver controls the vehicle to run on the road, and records driving demonstration data of the driver, including image data shot by a camera, radar monitoring data, current vehicle speed, vehicle tire direction and driving demonstration actions of each decision time step, wherein the driving demonstration actions are continuous variables of two dimensions, and represent steering force and braking force of the steering wheel respectively.

The collected driving demonstration data of the driver are filtered, the data with poor driving demonstration action decisions of the driver are removed, the driving demonstration action decisions comprise scenes in which collision occurs, situations in which the vehicle faces dangerous situations although the collision is not caused due to improper operation, and the driving demonstration action deviation of the driver is large in the same scene, and stability and high efficiency of action decisions are ensured.

In the embodiment, in the reinforcement learning training stage, vehicle self-running data and a driver supervision and correction action for supervising the driver to take over control when the vehicle is self-running are also acquired;

the vehicle self-running data are data which are interacted and explored by the intelligent vehicle and the environment in the simulation environment, and also comprise image data shot by a camera, radar monitoring data, current vehicle speed, vehicle tire direction, steering wheel steering and braking force;

the driver supervises the corrective action as: when the action exhibited by the intelligent vehicle is not in line with the expectation, the supervising driver can interrupt the exploration of the intelligent vehicle on line and give out corresponding correct driving actions, including steering wheel steering and braking force.

In the process of supervising the intelligent vehicle by the supervising driver online, because of providing demonstration online in real time, there may be a supervision and correction action provided by the supervising driver which is not an optimal decision in the current scene; in addition, the action decisions provided by the driver for the same scene at different time facets may be completely different, and the instability in this case is likely to adversely affect the model parameter update convergence process, and ultimately affect the decision performance. Therefore, the embodiment provides a dynamic weight learning mode based on a discriminant model, so as to solve the problems of non-optimal online data and unstable online action decision.

Specifically:

a learning strategy generator G and a discriminator D, which use the driving demonstration data of the driver in a manner of generating countermeasure imitation learning (GAIL); the training goal is to obtain a generator G that minimizes maxV (G, D) ^* ：

G ^* ＝argminmaxV(G,D)

wherein ,P_data and P_G Driving simulation data respectively representing the driving demonstration data of the driver and the driving simulation data generated by the strategy generator G; x is training sample data.

Unlike the existing method which uses a strategy generator as the final control decision, the present embodiment uses a discriminator D as a scoring device in which the driver drives the presentation data P _data The sample label of the driving simulation data generated by the strategy generator is 1, the sample label of the driving simulation data generated by the strategy generator is 0, so that the driver supervision and correction data of the supervising driver is scored, and the higher the score of the discriminator D (x) is, the closer the action decision is to the driving demonstration data of the driver under the current scene; and because the driving demonstration data of the driver is filtered and the data with poor action decisions of the driver and the data with larger action deviation of the driver in the same scene are removed, the higher the score is, the better the action decisions are in the scene.

The discriminator is used as a scoring device to act in a subsequent reinforcement learning training stage, and when the gradient is calculated subsequently, the weight of the related data participating in gradient calculation when a supervised driver takes over control is dynamically adjusted according to the score of the discriminator D, so that the problems of non-optimal online data and unstable online action decision can be solved well.

In this embodiment, during the reinforcement learning training phase, the intelligent vehicle randomly explores the environment and accumulates experience, while the intelligent vehicle freely explores the environment, a supervising driver supervises the intelligent vehicle in real time, and when the action shown by the intelligent vehicle deviates from the expectation of the supervising driver, the intelligent vehicle can directly interrupt the exploration behavior of the intelligent vehicle and replace the control right of the vehicle, release the control right of the intelligent vehicle after providing action demonstration on line, and the intelligent vehicle continuously explores the environment freely.

Data P in learning process _data+ Represented as {(s) _t ,a _n,t ,a _h,t ,I(s _t ) … }, wherein a _n,t and a_h,t Respectively expressed in state s _t Descending the intelligent vehicle and supervising the actions of the driver; i(s) _t ) As a binary function, expressed in state s _t Whether the supervising driver takes over the control right of the intelligent vehicle or not is judged, if the supervising driver does not take over the control right of the intelligent vehicle at the moment t, I(s) _t ) Is 0, a _h,t Is empty; state s _t The system comprises image data, radar monitoring data, current vehicle speed and vehicle tire direction which are shot by a current camera, wherein the actions are steering wheel steering and braking force.

In addition, the driver driving demonstration data P collected before the reinforcement learning training loop _data Is mixed into the data buffer to participate in training, and the portion of data is also represented in the data storage form described above. Differently, the partial data is for s in arbitrary state _t ，I(s _t ) All are 1, a _n,t Is empty.

The rewarding function required by the intelligent vehicle in the training process is defined as:

in the model parameter updating stage, the sample data of the mixed data buffer area is randomly extracted by adopting an offline learning mode to calculate the gradient, and in the stage, the driving demonstration data of the driver, the self-driving data of the vehicle and the supervision and correction data of the driver participate in gradient calculation together.

In this embodiment, the training is performed using an actor-critic reinforcement learning paradigm, and the actor output actions maximize the cumulative expected rewards obtained, and the critic network scores the actor actions by using and π(a_t |s _t The method comprises the steps of carrying out a first treatment on the surface of the θ) are expressed as a value function and a policy function, where +.>And θ represent their network model parameters, respectively.

The value function optimization objective function is expressed as:

wherein ,for the final execution of the intelligent vehicle in the sampled data, this represents the final execution because there is a case of supervising the driver's intervention; when the sampling data comes from the driver driving demonstration data P _data Action a when it is _n,t Obeying policy function distribution, i.e. a _n,t ～π(.|s _t ；θ)；a ^′ Is the action predicted based on the policy function at the next moment.

The objective function includesAndtwo parts, the former is mixed data P _data+ and P_data Time difference error ofConstrained by the reward function; the latter term is mainly a human supervision signal, and the learning target is to reduce the difference between the supervision and correction actions and the intelligent automobile actions when supervising the intervention of a driver and the Q value, so as to promote the intelligent automobile to learn the human strategy; alpha is the weight of the objective function occupied by the postamble, < ->And the judgment result is calculated by a discriminator D trained before the reinforcement learning training loop, so that the non-optimal influence of the on-line demonstration data is eliminated.

The policy function optimization objective is a maximum value function expressed as:

max _θ E[Q(s _t ,a _t )-βlogπ(a _t |s _t ；θ)]

a _t ～π(.|s _t ；θ)

wherein beta is a super parameter.

After the gradient is calculated, the network model parameters are updated by using a gradient descent algorithm. After long-time training, rewards obtained by the intelligent vehicle gradually tend to be stable, the algorithm gradually converges, the intelligent vehicle can well complete an automatic driving control decision task, and the training task is completed. The training time and the training step number required by the training algorithm to achieve convergence are far smaller than those of the reinforcement learning method, and the algorithm performance is better.

In the training process, experience obtained by the intelligent vehicle exploring environment and online demonstration actions of a supervising driver are respectively stored in different data buffer areas for updating and using subsequent network model parameters; in addition, expert data collected before the reinforcement learning training loop begins is also added to the data buffer in training; the reinforcement learning loop forms a rich data buffer zone comprising vehicle self-driving data, driver supervision and correction data and driver driving demonstration data, adopts an off-line learning mode in a network model parameter updating stage, randomly samples the data in proportion, acts on gradients together, better integrates human intelligence into the reinforcement learning training model while improving the training speed of the network model, and improves the performance of an automatic driving control decision algorithm.

Example 2

The embodiment provides an automatic driving control decision system based on man-machine cooperation enhancement, which comprises:

It should be noted that the above modules correspond to the steps described in embodiment 1, and the above modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1. It should be noted that the modules described above may be implemented as part of a system in a computer system, such as a set of computer-executable instructions.

In further embodiments, there is also provided:

an electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the method described in embodiment 1. For brevity, the description is omitted here.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include read only memory and random access memory and provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.

A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method described in embodiment 1.

The method in embodiment 1 may be directly embodied as a hardware processor executing or executed with a combination of hardware and software modules in the processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method. To avoid repetition, a detailed description is not provided herein.

Those of ordinary skill in the art will appreciate that the elements of the various examples described in connection with the present embodiments, i.e., the algorithm steps, can be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

While the foregoing description of the embodiments of the present application has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the application, but rather, it is intended to cover all modifications or variations within the scope of the application as defined by the claims of the present application.

Claims

1. An automatic driving control decision-making method based on man-machine cooperation enhancement is characterized by comprising the following steps:

2. An automated driving control decision method based on human-machine cooperative reinforcement as defined in claim 1, wherein the driver driving demonstration data and the vehicle self-driving data both include states and actions; the state comprises image data, radar monitoring data, vehicle speed and vehicle tire direction of each decision time step, and the actions and the supervision and correction actions comprise steering wheel steering and braking force.

3. The automatic driving control decision method based on man-machine cooperative reinforcement as claimed in claim 1, wherein the training of the control decision model is performed by using an actor-critic reinforcement learning paradigm comprising a value functionOptimizing objective and policy function pi (a _t |s _t The method comprises the steps of carrying out a first treatment on the surface of the θ) optimization objectives;

wherein the value functionThe optimization objective is expressed as:

4. A method of automatically driving control decision based on human-machine co-enhancement as claimed in claim 3, wherein said reward function is designed to give a reward value in case of vehicle reaching the end point, collision of the vehicle with an obstacle and supervised driver intervention.

5. A method of automatic driving control decision-making based on man-machine co-enhancement as claimed in claim 3, characterized in that the strategy function pi (a _t |s _t The method comprises the steps of carrying out a first treatment on the surface of the θ) optimization objective is expressed as a function of the maximum value:

max _θ E[Q(s _t ,a _t )-βlogπ(a _t |s _t ；θ)]

6. An automated driving control decision method based on human-machine co-enhancement as claimed in claim 1, wherein the driving simulation data is predicted with a strategy generator under the current supervised corrective action and the current supervised corrective action is scored with a discriminator.

7. An automatic driving control decision method based on man-machine cooperative reinforcement as claimed in claim 1, wherein when the control decision model is trained based on the mixed data, three kinds of data in the mixed data are sampled proportionally so that the three kinds of data together play a role in the training of the control decision model.

8. An automatic driving control decision system based on man-machine cooperation enhancement, comprising:

9. An electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the method of any one of claims 1-7.

10. A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of any of claims 1-7.