CN116880218A

CN116880218A - Robust driving strategy generation method and system based on driving style misunderstanding

Info

Publication number: CN116880218A
Application number: CN202311141653.7A
Authority: CN
Inventors: 王越; 张冬堃; 崔瑜翔; 王云凯; 熊蓉
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-09-06
Filing date: 2023-09-06
Publication date: 2023-10-13
Anticipated expiration: 2043-09-06
Also published as: CN116880218B

Abstract

The invention discloses a robust driving strategy generation method and system based on misunderstanding of driving style, and belongs to the field of artificial intelligence and automatic driving. Firstly, continuously interacting a background strategy network with an automatic driving simulator, collecting samples of an intelligent agent, and updating parameters of the background evaluation function network and the background strategy network until training of the background strategy network is completed; and then fixing the trained background strategy network, and continuously interacting with the automatic driving simulator by combining the automatic driving strategy network and the misunderstanding strategy network, wherein the acquired samples of the automatic are used for updating parameters of the automatic evaluation function network, the misunderstanding evaluation function network, the automatic driving strategy network and the misunderstanding strategy network until the training of the automatic driving strategy network is completed. The invention provides various countermeasure training data for the self-vehicle strategy network, and improves the robustness of the self-vehicle driving strategy to traffic flow variation.

Description

Robust driving strategy generation method and system based on driving style misunderstanding

Technical Field

The invention belongs to the field of artificial intelligence, and particularly relates to a method and a system for generating a robust driving strategy through a neural network and reinforcement learning in an automatic driving scene.

Background

In recent years, reinforcement learning technology has achieved great success in game tasks such as go and jacali, and has also been widely used in the fields of robotics and autopilot. However, deploying driving strategies learned by reinforcement learning techniques to dense traffic flows in the real world remains a problem to be solved, with the challenge that, while the driving strategy is sufficiently trained, it can significantly degrade when encountering traffic flows of different behavioral patterns than the training environment. Since traffic flow behavior in the real world cannot be predicted when training driving strategies, it is particularly important to be able to implement zero-shift driving strategies in different traffic flows.

To address this challenge, one approach is to enrich the behavior patterns by introducing interactions between agents and then learn from the traffic flow constituted by these agents the own vehicle (egocar) driving strategy. Such work explores two methods of parsing and multi-agent reinforcement learning to build traffic flows. The method is focused on improving traffic flow efficiency or introducing cognition priori to improve intelligent agent behavior diversity, but cannot simultaneously consider efficiency and diversity, and more importantly, the method does not pay attention to zero-order migration evaluation on a self-vehicle driving strategy. Another approach attempts to improve the zero-order migration capability of self-driving strategies by adding safety-critical driving data. Such methods typically employ robust challenge reinforcement learning techniques that utilize agents in the traffic stream to attack the vehicle to cause its failure. This approach works well in sparse traffic situations but does not scale into dense traffic flows. On the one hand, increasing the number of antibodies makes it difficult for the vehicle to resist such strong environmental disturbances, and on the other hand, this approach prevents the antibodies from fulfilling their respective objectives, disrupting the interactive behaviour inside the traffic flow.

Disclosure of Invention

The invention aims to solve the defect of poor robustness of a driving strategy generated in traffic flow in the prior art, and provides a robust driving strategy generation method and system based on misunderstanding of driving style.

The specific technical scheme adopted by the invention is as follows:

in a first aspect, the present invention provides a robust driving strategy generation method based on driving style misunderstanding, which includes:

s1: randomly initializing network parameters of a background strategy network of multiple agents and a corresponding background evaluation function network, utilizing the background strategy network to interact with an automatic driving simulator, collecting sample pairs of all agents and storing the sample pairs in a first buffer storage, wherein each sample pair of each agent comprises current observation, current action, rewarding rewards, future observation and driving preference values;

s2: randomly extracting a batch of sample pairs of the intelligent agents from a first buffer memory, firstly calculating a weighting weight according to a driving preference value of each extracted sample pair, updating the reward weights of all the other intelligent agents in the current batch to the reward rewards of the intelligent agents, then respectively calculating a loss function based on all the samples after the reward rewards are updated, further updating parameters of a background evaluation function network and a background strategy network, finally, re-using the updated background strategy network to interact with an automatic driving simulator, and collecting new sample pairs to update to the first buffer memory to finish a round of iteration; the training of the background strategy network is continuously iterated until the training of the background strategy network is completed, so that the background strategy network can generate driving behaviors of different styles according to different driving preference values;

S3: fixing a trained background strategy network, randomly initializing a self-vehicle driving strategy network, a misunderstanding strategy network, network parameters of a self-vehicle evaluation function network and a misunderstanding evaluation function network which correspond to the self-vehicle driving strategy network, and then utilizing the background strategy network, the self-vehicle driving strategy network and the misunderstanding strategy network to interact with an automatic driving simulator, collecting a self-vehicle sample pair and storing the self-vehicle sample pair in a second buffer storage, wherein each self-vehicle sample pair comprises current observation, self-vehicle action, misunderstanding action, return rewards and future observation;

s4: randomly extracting a batch of sample pairs of the self-vehicle from the second buffer storage, firstly respectively calculating a loss function based on the extracted sample pairs, further updating parameters of a self-vehicle evaluation function network, a misunderstanding evaluation function network, a self-vehicle driving strategy network and a misunderstanding strategy network, then reusing the updated self-vehicle driving strategy network and the misunderstanding strategy network to interact with an automatic driving simulator, and collecting new sample pairs of the self-vehicle to update to the second buffer storage to finish a round of iteration; and continuously iterating until the training of the self-vehicle driving strategy network is completed, enabling the self-vehicle driving strategy network to continuously receive the self-vehicle observation and generate the self-vehicle action in the traffic flow with unknown behaviors, and realizing robust driving.

As a preference of the first aspect, the autopilot simulator models a partially observable random game based on driving preferences, and the information includes a joint driving preference space, a set of agent numbers, a state space, a joint action space, a joint observation space, a state transition probability, a set of individual reward functions, an initial state distribution, a discount factor, and a time range of one game round; the driving preference of each agent is an angular value in the range of 0 ° to 90 °.

As a preference of the above first aspect, the sample pair data space of each agent includes a state space, an observation space, an action space, and a rewards function;

in the state space, the system state of each time step comprises static elements and dynamic elements, wherein the static elements comprise a lane center line, road shoulders and global paths of all the intelligent agents, and the dynamic elements comprise the current postures and speeds of all the intelligent agents;

in the observation space, the observation of each intelligent agent at any moment can only partially observe the system state, each intelligent agent can not acquire the global paths of other intelligent agents, all vectors in the system state are converted into the own coordinate system of each intelligent agent, and the observation of each intelligent agent comprises the historical pose and speed information of the intelligent agent;

In the motion space, the motion comprises a reference speed and steering;

in the rewarding function, the individual rewarding functions of all the intelligent agents are identical in structure and are obtained by weighted summation of a dense rewarding function for exciting rapid driving and a sparse rewarding function for punishing catastrophic failure.

As a preference of the first aspect, the catastrophic failure includes: collisions with other agents, deviations from drivable areas, departure from global paths, and entry into errant lanes; if any agent encounters a catastrophic failure, it will be terminated and removed from the environment.

As a preference of the first aspect, in S2, when updating the self-rewarding rewards of each extracted sample pair, the self-rewards of the samples are used as a first weighted item, the average value of the rewards of all other agents in the current batch is used as a second weighted item, and the weighted sum result of the two weighted items is used as the new rewards of the samples pair, wherein the weights of the first weighted item and the second weighted item are respectively the cosine value and the sine value of the driving preference value of the samples pair.

As a preferred aspect of the first aspect, in S2, a specific method for updating parameters of the background evaluation function network and the background policy network is as follows:

S201: for each intelligent agent sample pair after rewarding and rewarding update, firstly inputting future observation and driving preference values of all intelligent agents in the sample pair into a background strategy network to calculate future actions, then inputting the future observation, the driving preference values of all intelligent agents and the future actions into a background evaluation function network to obtain target evaluation values, inputting the current observation, the driving preference values of all intelligent agents and the current actions into the background evaluation function network to obtain current evaluation values, and finally calculating time difference errors according to the current evaluation values, the target evaluation values and rewarding and taking the time difference errors as loss functions of the background evaluation function network; carrying out parameter updating on the background evaluation function network by using the sum of the loss function values of all the intelligent agents in the whole batch;

s202: inputting the current observation and all the driving preference values of the agents into a background strategy network for each sample pair of the agents after rewarding and rewarding updating, calculating to obtain new current actions and corresponding logarithmic probability densities, inputting the new current actions and the current observation into a background evaluation function network at the same time to obtain a background evaluation value, and finally weighting and summing the negative numbers of the background evaluation values and the logarithmic probability densities to be used as a loss function of the background strategy network; and updating parameters of the background strategy network by using the sum of the loss function values of all the agents in the whole batch.

As a preferred aspect of the first aspect, the input of the misunderstanding policy network is an observation of a vehicle, the output is a false preference of the vehicle, and the false preference of the vehicle is required to be embedded into the combined driving preference of all the agents to input the background policy network, so as to generate misunderstanding actions.

As a preferable mode of the first aspect, in S4, a specific method for updating parameters of the vehicle evaluation function network, the misunderstanding evaluation function network, the vehicle driving strategy network, and the misunderstanding strategy network is as follows:

s401: for a sample pair of the self-vehicle extracted in the current batch, inputting future observation in the sample pair into a self-vehicle driving strategy network to calculate non-self-vehicle actions, inputting the future observation and the future self-vehicle actions into a self-vehicle evaluation function network to obtain a target self-vehicle evaluation value, inputting the current observation and the current self-vehicle actions into the self-vehicle evaluation function network to obtain a current self-vehicle evaluation value, calculating a self-vehicle time difference error according to the current self-vehicle evaluation value, the target self-vehicle evaluation value and a return reward, and using the self-vehicle time difference error as a loss function of the self-vehicle evaluation function network to update parameters of the self-vehicle evaluation function network;

s402: inputting future observations in a sample pair into a misunderstanding strategy network to calculate future misunderstanding actions aiming at a sample pair of the current batch of extracted self vehicles, inputting the future observations and the future misunderstanding actions into a misunderstanding evaluation function network to acquire a target misunderstanding evaluation value, inputting the current observations and the current misunderstanding actions into the misunderstanding evaluation function network to acquire a current misunderstanding evaluation value, calculating misunderstanding time difference errors according to the current misunderstanding evaluation value, the target misunderstanding evaluation value and negative return rewards, and using the misunderstanding time difference errors as a loss function of the misunderstanding evaluation function network to update parameters of the misunderstanding evaluation function network;

S403: aiming at the sample pairs of the self-vehicles extracted in the current batch, inputting the current observation into a self-vehicle driving strategy network, calculating to obtain new current self-vehicle actions and corresponding logarithmic probability densities, inputting the new current self-vehicle actions and the current observation into a self-vehicle evaluation function network at the same time to obtain a self-vehicle evaluation value, weighting and summing the negative numbers of the self-vehicle evaluation values and the logarithmic probability densities to serve as a loss function of the self-vehicle driving strategy network, and updating parameters of the self-vehicle driving strategy network;

s404: aiming at the sample pairs of the self-vehicle extracted from the current batch, inputting the current observation into a misunderstanding strategy network, calculating to obtain new current misunderstanding actions and corresponding logarithmic probability densities, inputting the new current misunderstanding actions and the current observation simultaneously into a misunderstanding evaluation function network to obtain misunderstanding evaluation values, taking the weighted sum of the negative numbers of the misunderstanding evaluation values and the logarithmic probability densities as a loss function of the misunderstanding strategy network, and updating parameters of the misunderstanding strategy network.

As a preferable mode of the first aspect, the background strategy network, the self-driving strategy network, the misunderstanding strategy network, the background evaluation function network, the self-driving evaluation function network, and the misunderstanding evaluation function network all use a VectorNet neural network model.

In a second aspect, the present invention provides a robust driving strategy generation system based on driving style misunderstanding, comprising:

the intelligent agent sample pair acquisition module is used for randomly initializing network parameters of a background strategy network and a corresponding background evaluation function network of a plurality of intelligent agents, interacting with the automatic driving simulator by utilizing the background strategy network, acquiring sample pairs of all the intelligent agents and storing the sample pairs in a first buffer storage, wherein each sample pair of the intelligent agents comprises current observation, current action, rewarding rewards, future observation and driving preference values;

the background strategy network training module is used for randomly extracting a batch of sample pairs of the intelligent agents from the first buffer storage, firstly calculating a weighting weight according to a driving preference value of each extracted sample pair, updating the reward rewards of all other intelligent agents in the current batch to the reward rewards of the intelligent agents, then respectively calculating a loss function based on all samples after the reward rewards are updated, further updating parameters of a background evaluation function network and a background strategy network, finally, reusing the updated background strategy network to interact with an automatic driving simulator, and collecting new sample pairs to update to the first buffer storage to complete one round of iteration; the training of the background strategy network is continuously iterated until the training of the background strategy network is completed, so that the background strategy network can generate driving behaviors of different styles according to different driving preference values;

The self-vehicle sample pair acquisition module is used for fixing a trained background strategy network, randomly initializing a self-vehicle driving strategy network, a self-vehicle misunderstanding strategy network, network parameters of a self-vehicle evaluation function network and a self-vehicle misunderstanding evaluation function network which are respectively corresponding to the self-vehicle driving strategy network, interacting with an automatic driving simulator by using the background strategy network, the self-vehicle driving strategy network and the misunderstanding strategy network, acquiring a self-vehicle sample pair and storing the self-vehicle sample pair in a second buffer storage, wherein each self-vehicle sample pair comprises current observation, self-vehicle action, misunderstanding action, rewarding and future observation;

the self-driving strategy network training module is used for randomly extracting a batch of self-driving sample pairs from the second buffer storage, respectively calculating a loss function based on the extracted samples to update parameters of a self-driving evaluation function network, a misunderstanding evaluation function network, a self-driving strategy network and a misunderstanding strategy network, then reusing the updated self-driving strategy network and the misunderstanding strategy network to interact with an automatic driving simulator, and collecting new self-driving sample pairs to update to the second buffer storage to complete a round of iteration; and continuously iterating until the training of the self-vehicle driving strategy network is completed, enabling the self-vehicle driving strategy network to continuously receive the self-vehicle observation and generate the self-vehicle action in the traffic flow with unknown behaviors, and realizing robust driving.

Compared with the prior art, the invention has the following beneficial effects:

1. the background strategy model adopted by the invention makes decisions according to the driving preferences of all the agents, and the generated traffic flow improves the overall efficiency under the condition of specific various driving preferences.

2. According to the invention, false preference is generated by introducing the misunderstanding strategy, so that misunderstanding of traffic flow on the vehicle strategy is caused, unreasonable and unrealistic countermeasure actions of an intelligent body in the traffic flow on the vehicle are avoided, and efficiency is improved while the countermeasure actions of the traffic flow are ensured.

3. According to the invention, the self-vehicle and misunderstanding strategies are trained simultaneously, various countermeasure training data are provided for the self-vehicle strategy, the robustness of the self-vehicle driving strategy to traffic flow change is improved, and the zero-order migration performance of the self-vehicle driving strategy in other traffic flows is improved.

Drawings

Fig. 1 is a schematic step diagram of a robust driving strategy generation method based on driving style misunderstanding in the present invention.

Fig. 2 is a block diagram of a robust driving strategy generation system based on driving style misunderstanding in the present invention.

Detailed Description

In order that the above objects, features and advantages of the invention will be readily understood, a more particular description of the invention will be rendered by reference to the appended drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be embodied in many other forms than described herein and similarly modified by those skilled in the art without departing from the spirit of the invention, whereby the invention is not limited to the specific embodiments disclosed below. The technical features of the embodiments of the invention can be combined correspondingly on the premise of no mutual conflict.

In the description of the present invention, it should be understood that the terms "first" and "second" are used solely for the purpose of distinguishing between the descriptions and not necessarily for the purpose of indicating or implying a relative importance or implicitly indicating the number of features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature.

As shown in fig. 1, in a preferred embodiment of the present invention, a robust driving strategy generation method based on driving style misunderstanding is provided, and implementation steps thereof include S1 to S4. The implementation of the steps of the method is described below.

S1: network parameters of a background strategy network of multiple agents and a corresponding background evaluation function network are randomly initialized, the background strategy network is utilized to interact with an automatic driving simulator, sample pairs of all agents are collected and stored in a first buffer memory, and the sample pairs of each agent comprise current observation, current action, rewarding rewards, future observation and driving preference values.

It should be noted that, the background policy network adopted in the present invention is a neural network, which is used for outputting the actions of the intelligent agent according to the input observation of the intelligent agent; the background evaluation function network corresponds to the background strategy network and is another neural network for evaluating the output of the background strategy network so as to train the background strategy network. The form of the neural network that can be used by the background policy network and the background evaluation function network is not limited, and in embodiments of the invention both can be implemented using a vector net neural network model.

In an embodiment of the present invention, the autopilot simulator is modeled as a partial observable random game based on driving preferences, the information in the partial observable random game including joint driving preference space, a set of agent numbers, state space, joint action space, joint observation space, state transition probabilities, a set of individual rewards functions, an initial state distribution, discount factors, and a time range of one game round. Specifically, the driving preference of each agent is an angle in the range of 0 ° to 90 °.

In an embodiment of the invention, the driving preference based partially observable random game ultimately modeled by the autopilot simulator is formally defined as. Wherein the method comprises the steps ofRepresenting the number of agents;a set representing agent numbers;is a state space of the device and is a state space,representing the state of the entire environment;representation ofThe joint action space of the intelligent agents,representing an agentAn optional action space;representation ofJoint action of individual agents, whereinRepresenting an agentActs of (a);representing the combined observation space and the time of the observation,representing an agentA visible state space;representing observed joint states, whereinRepresenting an agent Is a part of the observation of (1);representing a joint driving preference space,is an intelligent bodyIs a driving preference space of (1);representing joint driving preferences, whereinRepresenting an agentIs a driving preference for (a);representing state transition probabilities, where；Representing a set of individual reward functions specific to the agent,is an intelligent bodyA specific individual rewarding function, anFor all ofAre all bounded (i.e. each intelligentThe volume being only from the reward function to which it belongsTo obtain individual rewards);representing an initial state distribution;representing a discount factor;representing the time frame of a single game round.

In addition, for a single agent, the space to which the data in each agent's sample pair belongs includes a state space, an observation space, an action space, and a rewards function. The state space, observation space, action space, and rewards function of a single agent sample pair are defined as follows:

(1) In the state space, the system state for each time stepThe system comprises static elements and dynamic elements, wherein the static elements comprise a lane center line, road shoulders and global paths of all the agents, and the dynamic elements comprise the current postures and speeds of all the agents.

In an embodiment of the invention, the state is within a given time step The specific definition of the static elements and the dynamic elements contained is as follows: the centerline in a static element is defined as being made up of a series of polylinesEach fold lineIs composed of a series of vectorsWherein the kth static directionMeasuring amountWhereinIs the location of the point of sale,is the direction of the light beam, the direction,is a folding lineIs provided for the lane width of (a). The definition of the road shoulder is similar to the definition of the center line and is also composed of a series of folding lines, but the difference between the definition of the center line is that the width of each folding line in the lane is 0. Each agentThe global path of (2) remains unchanged in one round, also consisting of a series of vectors. Each agentIs defined as a vectorWhereinRepresenting an agentIs a speed of (2);

(2) In the observation space, each agentAt any timeIs of (2)The system state can be only partially observedAnd there are two limitations: in one aspect, each agentThe global paths of other agents cannot be obtained; on the other hand, system statusAll vectors of (a) are converted into the own coordinate system of each agent (namely the system stateIs converted to relative to the current poseVector of (d) a). In addition, the historical pose and speed of the agent are included in the observation to capture the high-order dynamic information of the agent. Thus, in an embodiment of the invention, each agent Is represented by a series of broken linesWhereinIs the historical time range, vectorIs defined asWhereinRepresenting vectorsThe time distance between the corresponding historical moment and the current moment, i.e. the vectorFor the moment of timeIs a vector of (a).

(3) In the action space, actionIncluding reference speed and steering. In an embodiment of the present invention,is thatWhereinIs the reference velocity of the wheel,is the turn.

(4) In the rewarding function, the individual rewarding functions of all the intelligent agents are identical in structure, and one dense rewarding function for exciting fast driving is adoptedAnd a sparse reward function for punishing catastrophic failureAnd (5) obtaining weighted summation.

It should be noted that a catastrophic failure refers to a catastrophic failure of an agent during operation. In embodiments of the present invention, catastrophic failures include collisions with other agents (Collision), deviations from drivable zone (Off Road), deviations from global path (Off Route), and entry into errant lanes (wrng Lane). If an agentA catastrophic failure is encountered and it will be terminated and removed from the environment.

In addition, the dense reward functionThe agent should be encouraged to drive at as high a speed as possible to increase the operational efficiency of the traffic stream in a form such that the higher the driving speed, the greater the value of the dense reward function. In an embodiment of the invention, a dense bonus function Is defined as follows:

wherein the method comprises the steps ofAndrepresenting the current speed and maximum speed of the agent, respectively. It follows that this bonus encourages high speed driving and penalizes low speed driving.

In addition, the sparse reward functionCatastrophic failure of the agent during operation should be avoided as much as possible, so that if any catastrophic failure occurs, a negative form of sparse reward function value needs to be assigned to the agent as a penalty. In an embodiment of the invention, a sparse reward functionIs defined as follows:

wherein the method comprises the steps ofIs an indication function, a symbolRepresenting a logical OR, a table of the above typeIndicating to the agent if any of the catastrophic failures in the pattern occurDispensingIndicating a function value of 0 if no catastrophic failure occurs.

Thus, a dense bonus function is providedSparse reward functionWeighted summation to form the agentAn individual reward function of (a), the formula of which is:

it should be noted that in actual implementation, it should be setFar greater thanI.e.To avoid the agent sacrificing safety for fast driving.

S2: randomly extracting a batch of sample pairs of the intelligent agents from a first buffer memory, calculating a weighting weight according to a driving preference value of each extracted sample pair, updating the reward weights of all other intelligent agents in the current batch to the reward rewards of the intelligent agents, calculating a loss function based on all samples after the reward rewards are updated respectively, further updating parameters of a background evaluation function network and a background strategy network, and finally, reusing the updated background strategy network to interact with an automatic driving simulator, collecting new sample pairs, and updating the new sample pairs to the first buffer memory, thereby finishing a round of iteration. And iterating until training of the background strategy network is completed, so that the background strategy network can generate driving behaviors of different styles according to different driving preference values.

In the iteration process of S2, the sample pair in the first buffer memory is updated after each iteration round for the next iteration round to be extracted, so that the background strategy network and the corresponding background evaluation function network are continuously iterated and trained. In each iteration, after a new batch sample pair is extracted, the rewards of all the agents in the whole batch are required to be updated mutually, and then the parameters of the background evaluation function network and the background strategy network are updated by the agent samples after the rewards are updated.

In the embodiment of the present invention, in the step S2, a specific method for rewarding the update per se for each sample is as follows:

the method comprises the steps of symmetrically taking a sample of a current self-reported reward to be updated as a current sample pair, taking the self-reported reward of the current sample pair as a first weighted item, taking the average value of the reported rewards of all other agents in the current batch as a second weighted item, taking the weighted summation result of the two weighted items as a new reported reward of the current sample pair, and replacing the original reported reward of the current sample pair. When the first weighted item and the second weighted item are subjected to weighted summation, the weight of the first weighted item is the cosine value of the driving preference value of the current sample to the second weighted item, and the weight of the second weighted item is the sine value of the driving preference value of the current sample to the second weighted item.

In the embodiment of the present invention, in the step S2, a specific method for updating parameters of the background evaluation function network and the background policy network is as follows:

s201: for each intelligent agent sample pair after rewarding and rewarding update, firstly inputting future observation and driving preference values of all intelligent agents in the sample pair into a background strategy network to calculate future actions, then inputting the future observation, the driving preference values of all intelligent agents and the future actions into a background evaluation function network to obtain target evaluation values, inputting the current observation, the driving preference values of all intelligent agents and the current actions into the background evaluation function network to obtain current evaluation values, and finally calculating time difference errors according to the current evaluation values, the target evaluation values and rewarding and taking the time difference errors as loss functions of the background evaluation function network; and updating parameters of the background evaluation function network by using the sum of the loss function values of all the agents in the whole batch.

The method for updating parameters of the background evaluation function network and the background policy network specifically belongs to the prior art in the neural network, and can be realized by adopting gradient descent and other modes, and the description is not made. After the background strategy model is continuously iterated until training is finished, a certain number of intelligent agents are arranged in the automatic driving simulator, wherein each intelligent agent has different global paths, poses and driving preference values, and the background strategy model can drive each intelligent agent to continuously receive observation generation actions so as to realize running of traffic flow.

S3: and (3) fixing the trained background strategy network, randomly initializing the self-vehicle driving strategy network, the misunderstanding strategy network, the corresponding network parameters of the self-vehicle evaluation function network and the misunderstanding evaluation function network, and then utilizing the background strategy network, the self-vehicle driving strategy network and the misunderstanding strategy network to interact with an automatic driving simulator, collecting and storing a self-vehicle sample pair in a second buffer storage, wherein each self-vehicle sample pair comprises current observation, self-vehicle action, misunderstanding action, return rewards and future observation.

The self-driving strategy network and the misunderstanding strategy network are both in the form of a neural network, and the evaluation functions corresponding to the self-driving strategy network and the misunderstanding strategy network are respectively in the form of a self-driving evaluation function network and a misunderstanding evaluation function network, and the self-driving evaluation function network and the misunderstanding evaluation function network are both in the form of the neural network. The form of the neural network which can be adopted by the self-driving strategy network, the misunderstanding strategy network, the self-driving evaluation function network and the misunderstanding evaluation function network is not limited, and in the embodiment of the invention, the four networks can be realized by adopting a VectorNet neural network model. The self-driving strategy network and the misunderstanding strategy network are used for evaluating the output of the strategy network according to the observed output action, and the self-driving evaluation function network and the misunderstanding evaluation function network are respectively used for training the self-driving strategy network and the misunderstanding strategy network.

In particular, in order to solve the misunderstanding of traffic flow on the self-vehicle strategy, the misunderstanding strategy network is introduced to introduce interference into the training process of the self-vehicle driving strategy network, so that the output of the self-vehicle driving strategy network is more robust, unreasonable and unrealistic countermeasure behaviors of an intelligent agent in the traffic flow on the self-vehicle are avoided, and the efficiency is improved while the countermeasure behaviors of the traffic flow are ensured. In the embodiment of the invention, the input of the misunderstanding strategy network is the observation of the own vehicle, and the output is the false preference of the own vehicleAnd the output false preference of the own vehicle is required to be embedded into the joint driving preference of all the agents as the joint driving preferenceIs input into the background policy network, thereby producing a misinterpreted action.

S4: randomly extracting a batch of sample pairs of the self-vehicle from the second buffer storage, firstly respectively calculating a loss function based on the extracted sample pairs, further updating parameters of a self-vehicle evaluation function network, a misunderstanding evaluation function network, a self-vehicle driving strategy network and a misunderstanding strategy network, then reusing the updated self-vehicle driving strategy network and the misunderstanding strategy network to interact with an automatic driving simulator, collecting new sample pairs of the self-vehicle, updating the new sample pairs into the second buffer storage, and completing one round of iteration. And continuously iterating until the training of the self-vehicle driving strategy network is completed, enabling the self-vehicle driving strategy network to continuously receive the self-vehicle observation and generate the self-vehicle action in the traffic flow with unknown behaviors, and realizing robust driving.

In the embodiment of the present invention, in the step S4, the specific method for updating parameters of the vehicle evaluation function network, the misunderstanding evaluation function network, the vehicle driving strategy network and the misunderstanding strategy network is as follows:

It should be noted that, similarly, the above-mentioned methods for specifically updating parameters of the vehicle evaluation function network, the misunderstanding evaluation function network, the vehicle driving strategy network, and the misunderstanding strategy network belong to the prior art in the neural network, and may be implemented by adopting a gradient descent method, etc., which is not described in detail. The method is characterized in that after the training of the self-vehicle evaluation function network, the misunderstanding evaluation function network, the self-vehicle driving strategy network and the misunderstanding strategy network is completed, the self-vehicle driving strategy network can be applied to actual self-vehicle intelligent agent driving control, and the method comprises the following steps: and setting a certain number of agents in the simulator, wherein each agent has different global paths, poses and driving preference values, taking over the self-vehicle agent by the self-vehicle driving strategy model, taking over the other agents by the background strategy model trained in the step S2 or the existing model, continuously receiving and observing the self-vehicle driving strategy model to generate self-vehicle actions, interacting with other agents in the environment, and keeping the robustness of different unknown background strategy models, thereby realizing zero-order migration of the self-vehicle strategy.

In addition, based on the same inventive concept as the robust driving strategy generation method based on driving style misunderstanding provided in the above embodiment, another preferred embodiment of the present invention provides a robust driving strategy generation system based on driving style misunderstanding, as shown in fig. 2, which includes the following functional modules:

Because the principle of solving the problem of the robust driving strategy generation system based on driving style misunderstanding is similar to that of the robust driving strategy generation method based on driving style misunderstanding in the above embodiment of the present application, the specific implementation forms of each module of the system in this embodiment may be referred to the specific implementation forms of the method parts shown in S1 to S4, and the repetition is omitted.

In addition, in the system provided in the above embodiment, each module is executed as a program module executed in sequence, so that it is essentially a flow of data processing. It will be clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the system described above may refer to the corresponding process in the foregoing method embodiment, which is not described herein again. In the embodiments provided in the present application, the division of steps or modules in the method and the system is only one logic function division, and there may be another division manner in actual implementation, for example, multiple modules or steps may be combined or may be integrated together, and one module or step may also be split.

The following description will show, by way of a specific example, the application effect of the robust driving strategy generating method based on driving style misunderstanding described in S1 to S4 in the above embodiment on a specific simulator, so as to facilitate understanding of the essence and effect of the present invention.

Examples

In the robust driving strategy generation method based on driving style misunderstanding in this embodiment, the background strategy network, the self-vehicle driving strategy network, the misunderstanding strategy network, the background evaluation function network, the self-vehicle evaluation function network and the misunderstanding evaluation function network are all constructed by adopting a VectorNet neural network model. The steps of the method for generating the robust driving strategy are shown in the foregoing steps S1 to S4, and the specific implementation of the steps of the framing in this embodiment is described below, and may be decomposed into steps 1 to 13:

step 1: network parameters of a multi-agent background strategy network and a background evaluation function network are randomly initialized, the background strategy network is utilized to interact with an automatic driving simulator, sample pairs of all agents are collected, and each sample pair of the agents comprises current observation, current action, rewarding rewards, future observation and driving preference values and are stored in a first buffer memory for training of subsequent steps. In this embodiment, the autopilot simulator is modeled as a partially observable random game based on driving preferences formally defined as the foregoing The specific definition of each information is as described above. Meanwhile, the state space, the observation space, the action space and the rewarding function included in the data space in the sample pair of each agent are as described above, and are not repeated.

Step 2: randomly extracting batch sample pairs from the first buffer memory maintained in the step 1, and for each sample pair of the intelligent agents, recalculating the rewarding rewards of the intelligent agents according to the driving preference values and the rewarding rewards of all the intelligent agents, and replacing the rewarding rewards of the original sample pair.

In step 2 of the present embodiment, the agent is recalculatedThe reward formula form of (a) is:

wherein, the intelligent agentDriving preference of (c)。

Step 3: and (3) inputting future observations and all the driving preference values of the agents in the sample pairs into a background strategy network for calculating future actions for each agent sample pair recalculated in the step (2), inputting the future observations, all the driving preference values of the agents and the future actions into a background evaluation function network for obtaining a target evaluation value, inputting the current observations, all the driving preference values of the agents and the current actions into the background evaluation function network for obtaining a current evaluation value, calculating time difference errors according to the current evaluation value, the target evaluation value and the rewards, taking the difference errors as a loss function of the background evaluation function network, and updating parameters of the background evaluation function network.

In this embodiment, the specific implementation substep of the step 3 includes:

step 3-1: based on future observationsWith federation preferencesDetermining a mean value of a background policy networkAnd variance ofAnd sampling to obtain future actions：

Wherein:representing a gaussian distribution of the light,is an intelligent bodyPolicy network parameters of (a);

step 3-2: respectively calculating a current evaluation value and a target evaluation value:

wherein:is an intelligent bodyBackground evaluation function network parameters of (a);

step 3-3: calculating a time difference error:

step 3-4: calculating a loss function of a background evaluation function network shown in the following formula, so as to update parameters of the background evaluation function network:

step 4: and (3) inputting the current observation and all the driving preference values of the intelligent agents into a background strategy network for each intelligent agent sample pair recalculated in the step (2), calculating to obtain new current actions and the logarithmic probability density of the actions, inputting the current actions and the current observation into a background evaluation function network at the same time to obtain evaluation values, taking the weighted sum of the negative evaluation values and the logarithmic probability density as a loss function of the background strategy network, and updating parameters of the background strategy network.

In this embodiment, the specific implementation substeps of this step 4 include:

Step 4-1: calculating a current action：

Step 4-2: calculating a current evaluation value:

step 4-3: calculating a loss function of the background strategy network shown in the following formula, so as to update parameters of the background strategy network:

wherein:is a positive weight coefficient.

Step 5: and (3) interacting with the simulator by using the updated background strategy network in the step (4), collecting a new sample pair, and adding the new sample pair into the first buffer storage maintained in the step (1).

Step 6: and (5) continuously and iteratively executing the steps 2-5, and giving different driving preference values after training, wherein the background strategy network can generate driving behaviors of different styles. After training, a certain number of agents are set in the simulator, wherein each agent has different global paths, poses and driving preference values, and each agent is driven by a background strategy model to continuously receive observation generation actions so as to realize the running of traffic flow.

Step 7: giving the background strategy network trained in the step 6, randomly initializing network parameters of a self-vehicle driving strategy network, a misunderstanding strategy network and a corresponding evaluation function network (namely the self-vehicle evaluation function network and the misunderstanding evaluation function network), wherein the output of the misunderstanding strategy network is self-vehicle false preference The dummy preferences are used as joint preferencesIs input into the background policy network, thereby creating a misinterpretation. The background strategy network, the self-driving strategy network and the misunderstanding strategy network are used for interaction with the automatic driving simulator, and the self-driving sample pairs are collected and comprise current observation, self-driving actions, misunderstanding actions, rewards and future observations and are stored in a newly built second buffer storage for training of subsequent steps.

Step 8: and (3) randomly extracting batch sample pairs from the second buffer memory maintained in the step (7), inputting future observations in the sample pairs into a self-driving strategy network to calculate non-self-driving actions, inputting the future observations and the future self-driving actions into a self-driving evaluation function network to obtain target self-driving evaluation values, inputting the current observations and the current self-driving actions into the self-driving evaluation function network to obtain current self-driving evaluation values, calculating self-driving time difference errors according to the current self-driving evaluation values, the target self-driving evaluation values and rewards, and using the calculated self-driving time difference errors as a loss function of the self-driving evaluation function network to update parameters of the self-driving evaluation function network.

In this embodiment, the specific implementation substeps of this step 8 include:

Step 8-1: based on future observationsDetermining an average value of a self-vehicle driving strategy networkAnd variance ofAnd sampling to obtain the motion of the vehicle：

Wherein:representing a gaussian distribution of the light,policy network parameters for the own vehicle;

step 8-2: respectively calculating a current vehicle evaluation value and a target vehicle evaluation value:

wherein:network parameters are self-vehicle evaluation functions;

step 8-3: calculating a time difference error:

step 8-4: calculating a loss function of the vehicle evaluation function network shown in the following formula, and updating parameters of the vehicle evaluation function network:

step 9: and 8, inputting future observations in the sample pairs into a misunderstanding strategy network to calculate future misunderstanding actions, inputting the future observations and the future misunderstanding actions into a misunderstanding evaluation function network to obtain target misunderstanding evaluation values, inputting the current observations and the current misunderstanding actions into the misunderstanding evaluation function network to obtain current misunderstanding evaluation values, calculating misunderstanding time difference errors according to the current misunderstanding evaluation values, the target misunderstanding evaluation values and negative return rewards, and using the misunderstanding time difference errors as loss functions of the misunderstanding evaluation function network to update parameters of the misunderstanding evaluation function network.

In this embodiment, the specific implementation substeps of this step 9 include:

Step 9-1: based on future observationsDetermining a mean value of a misunderstanding policy networkAnd variance ofAnd sampling to obtain future misunderstanding actions：

Wherein:representing a gaussian distribution of the light,policy network parameters are misunderstood;

step 9-2: respectively calculating a current misunderstanding evaluation value and a target misunderstanding evaluation value:

wherein:evaluating the function network parameters for misunderstanding;

step 9-3: calculating a time difference error:

step 9-4: calculating a loss function of a misunderstanding evaluation function network shown in the following formula, and carrying out parameter updating on the misunderstanding evaluation function network:

step 10: and 8, inputting the current observation into the self-driving strategy network for randomly extracting batch sample pairs, calculating to obtain new current self-driving action and logarithmic probability density of the action, inputting the current self-driving action and the current observation into the self-driving evaluation function network at the same time to obtain a self-driving evaluation value, taking weighted summation of the negative evaluation value and the logarithmic probability density as a loss function of the self-driving strategy network, and updating parameters of the self-driving strategy network.

In this embodiment, the specific implementation substeps of this step 10 include:

step 10-1: calculating current self-vehicle action：

Step 10-2: calculating a current vehicle evaluation value:

Step 10-3: calculating a loss function of the vehicle strategy network shown in the following formula, and updating parameters of the vehicle strategy network:

step 11: and 8, inputting the current observation into the misunderstanding strategy network for the batch of sample pairs randomly extracted in the step 8, calculating to obtain new current misunderstanding action and logarithmic probability density of the action, inputting the current misunderstanding action and the current observation into a misunderstanding evaluation function network at the same time to obtain a misunderstanding evaluation value, taking weighted summation of a negative evaluation value and the logarithmic probability density as a loss function of the misunderstanding strategy network, and updating parameters of the misunderstanding strategy network.

In this embodiment, the specific implementation substeps of this step 11 include:

step 11-1: calculating a current misinterpreted action：

Step 11-2: calculating a current misunderstanding evaluation value:

step 11-3: calculating a loss function of the misunderstanding strategy network shown in the following formula, and updating parameters of the misunderstanding strategy network:

step 12: and (3) using the updated self-vehicle driving strategy network and the misunderstanding strategy network in the step (10) and the step (11) to interact with the simulator, collecting a new sample pair, and adding the new sample pair into the second buffer storage maintained in the step (7).

Step 13: and (3) continuously and iteratively executing the steps 8-12, and setting a certain number of agents in the simulator after training, wherein each agent has different global paths, pose and driving preference values. The self-driving strategy model takes over the self-driving intelligent agent, other intelligent agents take over by the background strategy model trained in the step 5 or the existing model, the self-driving strategy model continuously receives observation to generate self-driving actions, interacts with other intelligent agents in the environment, keeps the robustness of different unknown background strategy models, and realizes zero-order migration of the self-driving strategy.

The steps 1 to 13 are specific implementation steps of the method in this embodiment, and in order to verify performance, the embodiment uses the disclosed calla simulator as an autopilot simulator to perform experiments. The CARLA simulator is intended to support development, training and verification of an autopilot system. In addition to the open source code and protocols, calla provides a rich and diverse urban layout, architecture, vehicles, support for flexible sensor suite specifications, environmental conditions, complete control and map generation for all static and dynamic traffic participants.

In order to objectively evaluate the performance of the present invention, the present embodiment selects a crossroad without signal lights as a test scene, generates 200 different initial conditions (with different global paths, initial poses and driving preferences) as test cases offline, uses 5 different random seed tests for each test case, and averages the obtained results. Meanwhile, three background strategy algorithms of IDM, CARLA-TM and CoPO in the prior art are compared.

In this embodiment, three indexes are used to measure the performance of different background strategy algorithms, including success rate, collision rate and efficiency. For the test of the background strategy, the success rate is the ratio of the number of the agents successfully reaching the target point to the total number, the collision rate is the ratio of the number of the agents which collide to the total number, and the efficiency is the average speed of all the agents at all times; and the success rate of the test of the driving strategy of the self-vehicle is whether the self-vehicle reaches a target point, the collision rate is whether the self-vehicle collides, and the efficiency is the average speed of the self-vehicle at all moments.

The experimental results obtained in this embodiment are shown in table 1 and table 2, and the results show that the generation method provided by the invention can make the traffic flow formed by the background strategy have higher success rate and efficiency, and the self-vehicle driving strategy network trained on the basis of the background strategy network has higher success rate and efficiency under the condition of facing unknown traffic flow behaviors, which indicates that the robustness is stronger and zero-order migration can be realized.

Table 1 shows a comparison of different background strategies

Table 2 shows a comparison of different self-vehicle driving strategies against IDM and CARLA-TM background strategies

The above embodiment is only a preferred embodiment of the present invention, but it is not intended to limit the present invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, all the technical schemes obtained by adopting the equivalent substitution or equivalent transformation are within the protection scope of the invention.

Claims

1. A robust driving strategy generation method based on driving style misunderstanding, comprising:

2. The driving style misunderstanding-based robust driving strategy generation method of claim 1, wherein the autopilot simulator models a partially observable random game based on driving preferences, the information comprising joint driving preference space, a set of agent numbers, state space, joint action space, joint observation space, state transition probabilities, a set of individual reward functions, initial state distributions, discount factors, and a time range of one game round; the driving preference of each agent is an angular value in the range of 0 ° to 90 °.

3. The driving style misunderstanding-based robust driving strategy generation method according to claim 2, wherein the sample pair data space of each agent includes a state space, an observation space, an action space, and a rewards function;

in the motion space, the motion comprises a reference speed and steering;

4. A driving style misunderstanding-based robust driving strategy generation method according to claim 3, wherein the catastrophic failure comprises: collisions with other agents, deviations from drivable areas, departure from global paths, and entry into errant lanes; if any agent encounters a catastrophic failure, it will be terminated and removed from the environment.

5. The driving style misunderstanding-based robust driving strategy generation method according to claim 1, wherein in the step S2, when each sampled pair updates its own rewards, the samples ' own rewards are used as a first weighted item, the average value of the rewards of all other agents in the current batch is used as a second weighted item, and the weighted sum result of the two weighted items is used as the samples ' own new rewards, wherein the weights of the first weighted item and the second weighted item are respectively the cosine value and the sine value of the samples ' own driving preference value.

6. The robust driving strategy generation method based on driving style misunderstanding according to claim 1, wherein in S2, the specific method for updating parameters of the background evaluation function network and the background strategy network is as follows:

7. The robust driving strategy generation method based on driving style misunderstanding as claimed in claim 1, wherein the input of the misunderstanding strategy network is the observation of the own vehicle, the output is the false preference of the own vehicle, and the false preference of the own vehicle is required to be embedded into the combined driving preference of all the agents to input the background strategy network, thereby generating misunderstanding actions.

8. The robust driving strategy generation method based on driving style misunderstanding according to claim 1, wherein in S4, the specific method for updating parameters of the vehicle evaluation function network, the misunderstanding evaluation function network, the vehicle driving strategy network and the misunderstanding strategy network is as follows:

9. The robust driving strategy generation method based on driving style misunderstanding according to claim 1, wherein the background strategy network, the self-driving strategy network, the misunderstanding strategy network, the background evaluation function network, the self-driving evaluation function network and the misunderstanding evaluation function network all adopt a VectoNet neural network model.

10. A robust driving strategy generation system based on driving style misunderstanding, comprising: