CN117227763B

CN117227763B - Automatic driving behavior decision method and device based on game theory and reinforcement learning

Info

Publication number: CN117227763B
Application number: CN202311490770.4A
Authority: CN
Inventors: 吕杨; 吕强; 苗乾坤
Original assignee: Neolix Technologies Co Ltd
Current assignee: Neolix Technologies Co Ltd
Priority date: 2023-11-10
Filing date: 2023-11-10
Publication date: 2024-02-20
Anticipated expiration: 2043-11-10
Also published as: CN117227763A

Abstract

The invention discloses an automatic driving behavior decision method and device based on game theory and reinforcement learning, and relates to the technical field of automatic driving. The method comprises the following steps: calculating the average probability of selecting robbing of a plurality of other intelligent agents at the historical moment based on the state information of the own vehicle at the historical moment and the state information of the plurality of other intelligent agents, and calibrating the parameters of the game model; calculating the probability of selecting robbing of each other intelligent agent at the historical moment based on the calibrated game model; training a reinforcement learning model based on state information of the own vehicle at the historical moment, state information of a plurality of other intelligent agents, a longitudinal action set of the own vehicle and probability of selecting robbing of each other intelligent agent at the historical moment, a preset state transition model and a preset rewarding function; and inputting the state information of the own vehicle at the current moment, the state information of other intelligent agents and the probability of selecting robbing of other intelligent agents into a trained reinforcement learning model to obtain the longitudinal action of the own vehicle at the next moment. This embodiment is applicable in different scenarios.

Description

Automatic driving behavior decision method and device based on game theory and reinforcement learning

Technical Field

The invention relates to the technical field of automatic driving, in particular to an automatic driving behavior decision method and device based on game theory and reinforcement learning.

Background

In order to improve the self-adaptive capability of the automatic driving vehicle to the dynamic complex environment, it is generally required to predict the future interaction intention of other agents based on the state information of the other agents, so that the automatic driving vehicle can make reasonable coping actions according to the interaction intention.

The prior art generally uses mathematical models to describe the driving environment and then uses optimization algorithms to solve the optimal solution. The method not only needs to describe the driving environment through a mathematical model, but also needs to assume that the driving environment is static, and is difficult to be applied to some non-static and nonlinear scenes.

Disclosure of Invention

In view of the above, the embodiment of the invention provides an automatic driving behavior decision method and device based on game theory and reinforcement learning, which can be suitable for different scenes without describing driving environment through a mathematical model.

In a first aspect, an embodiment of the present invention provides an autopilot behavior decision method based on game theory and reinforcement learning, including:

acquiring state information of a self-vehicle at a historical moment and state information of a plurality of other intelligent agents related to the self-vehicle at the historical moment;

Calculating the average probability of selecting robbing of a plurality of other intelligent agents at the historical moment based on the state information of the self-vehicle at the historical moment and the state information of the plurality of other intelligent agents;

calibrating parameters of a game model based on the average probability of selecting robbing of a plurality of other agents at the historical moment, the state information of the own vehicle at the historical moment and the state information of the plurality of other agents;

calculating the probability of selecting robbing of each other intelligent agent at the historical moment based on the state information of the own vehicle at the historical moment, the state information of the plurality of other intelligent agents and the calibrated game model;

training a reinforcement learning model based on the state information of the own vehicle at the historical moment, the state information of the plurality of other intelligent agents, a preset longitudinal action set of the own vehicle and the probability of selecting robbing of each other intelligent agent at the historical moment, a preset state transition model and a preset rewarding function;

determining other intelligent agents associated with the own vehicle at the current moment based on the position of the own vehicle at the current moment;

acquiring state information of a vehicle at the current moment and state information of other intelligent agents at the current moment;

calculating the probability of selecting robbing of other intelligent agents at the current moment based on the state information of the own vehicle at the current moment, the state information of other intelligent agents at the current moment and the calibrated game model;

And inputting the state information of the own vehicle at the current moment, the state information of other intelligent agents at the current moment and the probability of selecting robbery of the other intelligent agents at the current moment into a trained reinforcement learning model to obtain the longitudinal action of the own vehicle at the next moment.

In a second aspect, an embodiment of the present invention provides an autopilot behavior decision device based on game theory and reinforcement learning, including:

the system comprises an acquisition module, a control module and a control module, wherein the acquisition module is configured to acquire state information of a vehicle at a historical moment and state information of a plurality of other intelligent agents associated with the vehicle at the historical moment;

the calibration module is configured to calculate the average probability of selecting robbing of the plurality of other intelligent agents at the historical moment based on the state information of the vehicle at the historical moment and the state information of the plurality of other intelligent agents; calibrating parameters of a game model based on the average probability of selecting robbing of a plurality of other agents at the historical moment, the state information of the own vehicle at the historical moment and the state information of the plurality of other agents;

the training module is configured to calculate the probability of selecting robbing of each other intelligent agent at the historical moment based on the state information of the own vehicle at the historical moment, the state information of the plurality of other intelligent agents and the calibrated game model; training a reinforcement learning model based on the state information of the own vehicle at the historical moment, the state information of the plurality of other intelligent agents, a preset longitudinal action set of the own vehicle and the probability of selecting robbing of each other intelligent agent at the historical moment, a preset state transition model and a preset rewarding function;

The prediction module is configured to determine other intelligent agents associated with the own vehicle at the current moment based on the position of the own vehicle at the current moment; acquiring state information of a vehicle at the current moment and state information of other intelligent agents at the current moment; calculating the probability of selecting robbing of other intelligent agents at the current moment based on the state information of the own vehicle at the current moment, the state information of other intelligent agents at the current moment and the calibrated game model; and inputting the state information of the own vehicle at the current moment, the state information of other intelligent agents at the current moment and the probability of selecting robbery of the other intelligent agents at the current moment into a trained reinforcement learning model to obtain the longitudinal action of the own vehicle at the next moment.

In a third aspect, an embodiment of the present invention provides an electronic device, including:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method as described in any of the embodiments above.

In a fourth aspect, an embodiment of the present invention provides a computer readable medium, on which a computer program is stored, characterized in that the program, when executed by a processor, implements a method as in any of the embodiments described above.

One embodiment of the above invention has the following advantages or benefits: the game model determines the interaction intention of other agents based on the game theory, namely the probability of the other agents selecting robbing, and the benefits of different strategies are considered, so that the accuracy of the recognition of the interaction intention can be improved. The reinforcement learning model learns the behavior characteristics of the agent through interaction with the environment, and the driving environment does not need to be described in advance through a mathematical model. The reinforcement learning model can be applied to various interaction environments such as nonlinearity, non-static and the like, and the requirements of different scenes are met. The reinforcement learning model can consider the long-term return of decision behaviors, can improve self decision strategies in the continuous interaction process with the environment, gradually approaches to the optimal strategy, and has strong adaptability.

Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a flow chart of a method for automated driving behavior decision making based on game theory and reinforcement learning provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of a conflict area provided by one embodiment of the present invention;

FIG. 3 is a schematic diagram of a pedestrian crossing probability over time provided by an embodiment of the present invention;

FIG. 4 is a schematic representation of pedestrian speed over time as provided by one embodiment of the invention;

FIG. 5 is a schematic illustration of a longitudinal position of a pedestrian as a function of time provided by an embodiment of the invention;

FIG. 6 is a schematic representation of vehicle acceleration over time as provided by one embodiment of the present invention;

FIG. 7 is a schematic representation of a vehicle speed versus time provided by one embodiment of the present invention;

FIG. 8 is a schematic illustration of a longitudinal distance of a host vehicle from a collision zone provided in accordance with one embodiment of the present invention;

FIG. 9 is a schematic diagram of an autopilot decision device based on game theory and reinforcement learning according to one embodiment of the present invention;

fig. 10 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

As shown in fig. 1, an automatic driving behavior decision method based on game theory and reinforcement learning according to an embodiment of the present invention includes:

step 101: and acquiring the state information of the own vehicle at the historical moment and the state information of a plurality of other intelligent agents related to the own vehicle at the historical moment.

Specifically, the state information of the own vehicle and the state information of other agents at the historical moment can be obtained from a data pool, and the state information of the own vehicle and the state information of other agents at the decision time can be stored in the data pool, and the actual decision behavior, such as the longitudinal action of the own vehicle, can also be stored. Other agents associated with the host vehicle may be determined by the location of the host vehicle and the perceived range of the host vehicle. The status information may include location and speed, etc.

Step 102: based on the state information of the own vehicle at the historical moment and the state information of the plurality of other agents, the average probability of selecting to rob by the plurality of other agents at the historical moment is calculated.

The average probability is obtained by counting the behavior results of a plurality of other agents, for example, 6 selection lines among 10 other agents and 4 selection lines are found, and the average probability is 60%.

Step 103: and calibrating parameters of the game model based on the average probability of selecting robbing of the plurality of other agents at the historical moment, the state information of the own vehicle at the historical moment and the state information of the plurality of other agents.

The gaming model can take a variety of implementations and will be described in detail in the following embodiments.

Step 104: based on the state information of the own vehicle at the historical moment, the state information of a plurality of other agents and the calibrated game model, the probability of selecting robbery of each other agent at the historical moment is calculated.

Step 105: the reinforcement learning model is trained based on state information of the vehicle at the historical moment, state information of a plurality of other agents, a preset longitudinal action set of the vehicle and probability of selecting robbing of each other agent at the historical moment, a preset state transition model and a preset rewarding function.

The reinforcement learning model may be an MDP (Markov Decision Process, markov decision process) The solution Markov decision process may employ an A3C (Asynchronous Advantage Actor-Critic) algorithm. According to the embodiment of the invention, the movement direction action of the vehicle is considered, and the longitudinal action set of the vehicle can comprise various actions such as acceleration, deceleration and uniform speed. Preferably, the acceleration is within the range of [ -5,3]. For example, in one scenario, the longitudinal motion of the vehicle is taken as an acceleration of 2m/s ² 。

The A3C algorithm solves the calculation process parameters including the number of workers, reduction coefficients, exploration degree, maximum number of rounds, maximum number of operation steps of single round, actor network learning rate and critic network learning rate.

The reinforcement learning model may also be a DQN (Deep Q-Network), DDPG (Deep Deterministic Policy Gradient, depth deterministic policy gradient) algorithm.

Step 106: and determining other intelligent agents associated with the own vehicle at the current moment based on the position of the own vehicle at the current moment.

Step 107: and acquiring the state information of the own vehicle at the current moment and the state information of other intelligent agents at the current moment.

The state information of the own vehicle and other intelligent agents at the current moment is acquired through a perception module installed on the own vehicle, and the information is stored in a data pool so as to be used for calibrating a game model and training a reinforcement learning model offline. In an actual application scene, the game model and the offline training reinforcement learning model can be calibrated online at preset time intervals, and the specific time intervals can be adjusted according to service requirements.

Step 108: and calculating the probability of selecting robbing of other intelligent agents at the current moment based on the state information of the own vehicle at the current moment, the state information of other intelligent agents at the current moment and the calibrated game model.

Step 109: and inputting the state information of the own vehicle at the current moment, the state information of other intelligent agents at the current moment and the probability of selecting robbing of the other intelligent agents at the current moment into a trained reinforcement learning model to obtain the longitudinal action of the own vehicle at the next moment.

In the embodiment of the invention, the game model determines the interaction intention of other agents based on the game theory, namely the probability of the other agents selecting robbery, and the benefits of different strategies are considered, so that the accuracy of the interaction intention recognition can be improved. The reinforcement learning model learns the behavior characteristics of the agent through interaction with the environment, and the driving environment does not need to be described in advance through a mathematical model. The reinforcement learning model can be applied to various interaction environments such as nonlinearity, non-static and the like, and the requirements of different scenes are met. The reinforcement learning model can consider the long-term return of decision behaviors, can improve self decision strategies in the continuous interaction process with the environment, gradually approaches to the optimal strategy, and has strong adaptability.

In one embodiment of the present invention, determining other agents associated with the own vehicle at the current time based on the location of the own vehicle at the current time includes:

the current position of the own vehicle and the positions of other intelligent agents related to the own vehicle satisfy the following formula (1):

（1）

wherein,andas the position coordinates of the own vehicle i at the current moment,andas the position coordinates of other agents j at the current moment,is the perception range of the own vehicle.

Whether the own vehicle is associated with other intelligent agents or not can be determined through the absolute value distance between the own vehicle and the other intelligent agents, if the absolute value distance between the own vehicle and the other intelligent agents is not larger than the perception range of the own vehicle, the two intelligent agents are associated, otherwise, the two intelligent agents are not associated. In the practical application scene, other distance measures such as Euclidean distance and the like can be used.

The embodiment of the invention can accurately determine other intelligent agents associated with the own vehicle, and improves the decision accuracy and safety.

In one embodiment of the present invention, calculating the probability of selecting a robber by other agents at the current time based on the state information of the own car at the current time, the state information of the other agents at the current time, and the calibrated game model includes:

based on the state information of the own vehicle at the current moment, the state information of other intelligent agents at the current moment, the parameters of the calibrated game model and a preset payment matrix, calculating the benefits of other intelligent agents for selecting the robbed-free vehicle, the benefits of other intelligent agents and the self-vehicle for selecting the robbed-free vehicle at the same time, the benefits of other intelligent agents for selecting the robbed-free vehicle at the same time, and the benefits of other intelligent agents and the self-vehicle for selecting the robbed-free vehicle at the same time;

according to the benefits of other intelligent agents in selecting to rob the self-vehicle when the self-vehicle is selected to let, the benefits of other intelligent agents and the self-vehicle in selecting to let the self-vehicle when the self-vehicle is selected to rob, the benefits of other intelligent agents and the self-vehicle in selecting to rob the self-vehicle when the self-vehicle is selected to rob the self-vehicle and the calibrated game model, the probability of the other intelligent agents in selecting to rob the vehicle at the current moment is calculated.

The game model provided by the embodiment of the invention can be used for simulating the decision-making process of both parties when other intelligent agents and the own vehicle interact, and the other intelligent agents and the own vehicle can pass through the conflict area before each other or select to wait and let the other intelligent agents and the own vehicle pass through the conflict area before each other.

The conflict area is determined by the shape and path of the vehicle and other agents. As shown in fig. 2, CD is a path of another vehicle, AB is a path of an own vehicle, a rectangular dotted line frame indicates a shape of the other vehicle, and an area surrounded by boundary lines 1 to 4 is a collision area.

The game model relates to agents, strategies and benefits, wherein the two agents are a vehicle and other agents respectively, and the strategies comprise robbing and yielding.

The gaming model decouples the benefits into the sum of two utilities:

(1) The risk perception effect is used for representing unpleasant experience generated when two intelligent agents collide, and is modeled as 1/TTC (Time to Collision);

(2) The time delay utility is used for representing the time loss generated by one agent selecting the row and is equal to the time required by the other agent to pass through the conflict area in the current state.

The game benefits of the self-vehicle and other intelligent agents should satisfy the following principles:

(1) If the own vehicle and other intelligent agents select to rob at the same time, the own vehicle and other intelligent agents lose the risk perception utility and part of the time delay utility;

(2) When other intelligent agents select to rob and the own vehicle selects to let, the other intelligent agents acquire time delay effect and risk perception effect, the own vehicle acquires risk perception effect, and the own vehicle loses the time delay effect due to the selection waiting;

(3) When other intelligent agents select to let go and the own vehicle select to rob, the own vehicle obtains time delay effect and risk perception effect, other intelligent agents obtain risk perception effect, and other intelligent agents lose the time delay effect due to waiting for selection;

(4) When both the vehicle and other agents choose to leave, they will both get risk-aware utility, but lose time-delayed utility due to the choice waiting.

Specifically, the payment matrix is shown in table 1. As can be seen from Table 1, the Nash equilibrium solution for gaming is not unique, and two strategies of ' other agents rob, own car let and ' other agents let and own car rob ' exist. The combination of these two strategies is advantageous from a stability point of view and is therefore unstable during evolution. Based on Nash equilibrium principle, there is no pure strategy Nash equilibrium solution, and there is a mixed dominant strategy combination of the solution of game model. Therefore, the embodiment of the invention aims at the decision of other agent behaviors and solves the game model by using a hybrid strategy algorithm. It should be noted that the specific form of the payment matrix may be adjusted according to the requirements of the service scenario.

As can be seen from Table 1, the benefits of the own vehicle are-k when other intelligent agents rob ₁ -act ₁ The benefit of other agents is-k ₂ -act ₂ The method comprises the steps of carrying out a first treatment on the surface of the When other intelligent agents let the vehicle run and rob, the benefit of the vehicle is k ₁ +at ₁ The benefit of other agents is k ₂ -at ₂ The method comprises the steps of carrying out a first treatment on the surface of the When other intelligent agents rob and run, the benefit of the own vehicle is k ₁ -at ₁ The benefit of other agents is k ₂ +at ₂ The method comprises the steps of carrying out a first treatment on the surface of the When other intelligent agents let the vehicle go, the benefit of the vehicle is k ₁ -at ₁ The benefit of other agents is k ₂ -at ₂ 。

TABLE 1

a is a parameter of a game model, and changes along with the accumulated waiting time of other intelligent agents; c is a parameter of a game model and is used for representing a coefficient of time delay effectiveness when a host vehicle and other intelligent agents select robbery at the same time; t is t ₁ The time required for the own vehicle to pass through the conflict area, namely the time delay effect of the own vehicle; t is t ₂ The time required for other agents to pass through the conflict area, namely the time delay utility of other agents;，，k ₁ for characterising the risk-aware utility, k of a motor vehicle ₂ For characterizing the risk-aware utility of other agents,for characterizing the speed of the vehicle itself,for characterizing the distance between the vehicle and the collision zone.

In the mixed strategy solving algorithm, other agents are assumed to adopt a mixed strategy The self-vehicle adopts a mixing strategyAccording to the payment maximization principle, the desired utility function of the vehicle is equation (2):

（2）

order theFormula (3) is obtained.

（3）

And (5) further finishing to obtain a formula (4), namely a game model.

（4）

Wherein,for characterizing the probability of other agents selecting robbing,used for representing the benefits of other intelligent agents in selecting to rob and select to let the self-vehicle when the self-vehicle is in the process of letting,used for representing the benefits of other intelligent agents and the self-cars when the self-cars are allowed to choose to run simultaneously,used for representing the benefits of other intelligent agents to select the driving vehicle to rob,used for representing the benefits of other intelligent agents and the own vehicles to select the own vehicles when robbing,used for representing the probability of selecting the robbed from the vehicle.

The embodiment of the invention considers the benefits of different strategies based on the game theory, and can calculate the probability of selecting robbing of other intelligent agents more accurately.

Equation (4) is only a preferred implementation, and the game model may also be equation (5).

（5）

In one embodiment of the invention, parameters of the game model are calibrated based on a gradient descent method, wherein step sizes are selected through an Adam algorithm.

In one embodiment of the present invention, if the speed of the own vehicle at the next history time obtained by the state transition is not less than 0, the state transition model includes the following equations (6) and (7):

（6）

（7）

Otherwise, the state transition model includes equation (8) and equation (9):

（8）

（9）

wherein,for characterizing the distance between the vehicle and the collision area at the next historical moment, wherein the collision area is determined by the shape and the driving path of the vehicle and the shape and the driving path of other intelligent bodies,for characterizing the distance of the own vehicle from the collision area at the current historical moment,for characterizing the speed of the own vehicle at the current historical moment,for characterizing the time difference between the current historical time and the next historical time,for characterizing the acceleration of the own vehicle at the current historical moment,used for representing the speed of the own vehicle at the next historical moment.

In the embodiment of the invention, the minimum speed of the vehicle after deceleration is 0 in the normal running process, and if the numerical value calculated by the formula (7) is smaller than 0, the formulas (8) and (9) are selected as the state transition models.

According to the embodiment of the invention, the speed of the own vehicle at the next historical moment and the distance between the own vehicle and the conflict area can be accurately represented.

Targets of the interaction process of the vehicle with other agents include:

(1) The safety is kept, and the collision with other intelligent agents is avoided;

(2) Keeping a proper distance with other intelligent agents so as to improve safety and comfort;

(3) The efficiency is improved by passing through the conflict area in time;

(4) Avoiding rapid acceleration or rapid deceleration as much as possible and ensuring the stable running of the vehicle.

Based on the above objectives, in one embodiment of the present invention, the bonus function includes:

（10）

wherein,for the current time of the history,for the next time of the history,for the prize value at the next historical time,for the safety value, the safety of the interaction process of the self-vehicle and other intelligent agents is measured,for the purpose of separation and for measuring the safety and comfort of the interaction process of the vehicle with other agents,for efficiency value, is used for measuring the interaction efficiency of the interaction process of the self-vehicle and other agents,for a plateau value for measuring the speed change during the running of the vehicle,for a preset target reward, a reward for characterizing a collision area obtained after a vehicle passes through and exits, the collision area being determined by the shape and travel path of the vehicle, the shape and travel path of other agents,、、、andthe safe value, the separation degree, the efficiency value, the stable value and the weight of the target rewards are respectively.

The embodiment of the invention measures the interaction process from multiple dimensions, can ensure that the interaction process simultaneously considers safety, comfort, interaction efficiency and the like, and improves the quality of decision results. It should be noted that, the reward function may only include a safety value, a separation degree, an efficiency value, a stability value, and a part of items in the target reward, for example, the safety value, the efficiency value, and the stability value, and the calculation mode of each item may also be adjusted according to the actual application scenario.

In one embodiment of the present invention,as shown in formula (11).

（11）

Wherein,for characterizing the overlapping area of the collision zone of the own vehicle and the collision zones of other agents at the next historical moment,for characterizing the speed of the own vehicle at the next historical moment,for characterizing the speed of other agents at the next historical moment,is thatWhen the weight of (1)In the time-course of which the first and second contact surfaces,1, otherwiseIs 0; the collision zone of the intelligent body is a zone obtained by expanding the shape of the intelligent body by 1 time in the longitudinal running direction and the transverse running direction respectively, and the intelligent body comprises a bicycle and other intelligent bodies.

According to the embodiment of the invention, the safety of interaction between the self-vehicle and other intelligent agents can be accurately measured through the overlapping area of the collision area of the self-vehicle and the collision area of other intelligent agents, so that the safe running of the self-vehicle is ensured.

In one embodiment of the present invention,as shown in formula (12).

（12）

Wherein,for characterizing the overlapping area of the separation area of the own vehicle and the separation area of other agents at the next historical moment,for characterizing the speed of the own vehicle at the next historical moment,for characterizing the speed of other agents at the next historical moment,is thatWhen the weight of (1)In the time-course of which the first and second contact surfaces,1, otherwiseIs 0; the separation area of the intelligent body is an area obtained by expanding the shape of the intelligent body by 2 times in the longitudinal running direction and the transverse running direction respectively.

The separation zone is different from the collision zone in that the safety margin of the separation zone is larger, the expansion coefficient of the separation zone is 2, and the separation zone is mainly used for encouraging the self-vehicle to keep a reasonable distance from other intelligent bodies and avoiding the aggressive behavior.

According to the embodiment of the invention, the distance between the vehicle and other intelligent agents can be kept reasonable, and the travelling comfort is improved.

Because the faster the speed, the shorter the time to traverse the interaction region, in one embodiment of the invention,as shown in formula (13).

（13）

Wherein,used for representing the speed of the own vehicle at the next historical moment.

The embodiment of the invention can accurately measure the interaction efficiency of the vehicle and other intelligent agents.

In one embodiment of the present invention,as shown in formula (14).

（14）

Wherein,and the acceleration sensor is used for representing the acceleration of the own vehicle at the next historical moment.

In order to encourage the vehicle to drive smoothly and avoid excessive speed change, the embodiment of the invention sets a small penalty for acceleration and deceleration actions, as shown in a formula (14), so as to improve the running stability of the vehicle.

In one embodiment of the present invention, taking other agents as examples of pedestrians, the input to the reinforcement learning model is a vector of discrete values of vehicle longitudinal position, vehicle speed, pedestrian longitudinal position, pedestrian crossing speed, and pedestrian crossing probability.

Fig. 3-8 demonstrate the overall interaction of a vehicle with a road pedestrian, with an initial vehicle at a distance of 30m from the collision zone. Fig. 3 is a schematic diagram of a pedestrian crossing probability deduced based on game theory over time, fig. 4 is a schematic diagram of a pedestrian speed over time, fig. 5 is a schematic diagram of a longitudinal position of a pedestrian over time, fig. 6 is a schematic diagram of a vehicle acceleration over time, fig. 7 is a schematic diagram of a vehicle speed over time, and fig. 8 is a schematic diagram of a longitudinal distance between a vehicle and a collision area:

as can be seen from fig. 3 to 8, when the distance between the own vehicle and the collision area is 30m, the pedestrian crossing probability is large, and the pedestrian has a large probability of performing the actual crossing action. After the self-vehicle observes that the pedestrian starts to pass through, the pedestrian selects smaller longitudinal acceleration to accelerate the running, and the pedestrian quickly accelerates to pass through the conflict area immediately after the pedestrian completes the passing through, so that larger benefits are obtained.

As shown in fig. 9, an embodiment of the present invention provides an autopilot behavior decision device based on game theory and reinforcement learning, including:

an acquisition module 901 configured to acquire state information of a host vehicle at a history time and state information of a plurality of other agents associated with the host vehicle at the history time;

The calibration module 902 is configured to calculate an average probability of selecting to rob by the plurality of other agents at the historical moment based on the state information of the own vehicle at the historical moment and the state information of the plurality of other agents; calibrating parameters of a game model based on the average probability of selecting robbing of a plurality of other agents at the historical moment, the state information of the own vehicle at the historical moment and the state information of the plurality of other agents;

the training module 903 is configured to calculate the probability of each other agent selecting robbing at the historical moment based on the state information of the own vehicle at the historical moment, the state information of the plurality of other agents and the calibrated game model; training a reinforcement learning model based on state information of the vehicle at the historical moment, state information of a plurality of other agents, a preset longitudinal action set of the vehicle and probability of selecting robbing of each other agent at the historical moment, a preset state transition model and a preset rewarding function;

a prediction module 904 configured to determine other agents associated with the own vehicle at the current time based on the location of the own vehicle at the current time; acquiring state information of a vehicle at the current moment and state information of other intelligent agents at the current moment; calculating the probability of selecting robbing of other intelligent agents at the current moment based on the state information of the own vehicle at the current moment, the state information of the other intelligent agents at the current moment and the calibrated game model; and inputting the state information of the own vehicle at the current moment, the state information of other intelligent agents at the current moment and the probability of selecting robbing of the other intelligent agents at the current moment into a trained reinforcement learning model to obtain the longitudinal action of the own vehicle at the next moment.

In one embodiment of the present invention, the prediction module 904 is configured to determine a location of the own vehicle at the current moment, and a location of other agents associated with the own vehicle, so as to satisfy:

In one embodiment of the invention, the prediction module 904 is configured to determine, based on the state information of the own vehicle at the current time, the state information of other agents at the current time, the parameters of the scaled gaming model and the preset payment matrix, calculating the benefits of other intelligent agents for selecting the robbed-free vehicle, the benefits of other intelligent agents and the self-vehicle for selecting the robbed-free vehicle at the same time, the benefits of other intelligent agents for selecting the robbed-free vehicle at the same time, and the benefits of other intelligent agents and the self-vehicle for selecting the robbed-free vehicle at the same time; according to the benefits of other intelligent agents in selecting to rob the self-vehicle when the self-vehicle is selected to let, the benefits of other intelligent agents and the self-vehicle in selecting to let the self-vehicle when the self-vehicle is selected to rob, the benefits of other intelligent agents and the self-vehicle in selecting to rob the self-vehicle when the self-vehicle is selected to rob the self-vehicle and the calibrated game model, the probability of the other intelligent agents in selecting to rob the vehicle at the current moment is calculated.

The embodiment of the invention provides electronic equipment, which comprises:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of the embodiments described above.

The present invention provides a computer readable medium having stored thereon a computer program which when executed by a processor implements a method as in any of the embodiments described above.

Referring now to FIG. 10, there is illustrated a schematic diagram of a computer system 1000 suitable for use in implementing an embodiment of the present invention. The terminal device shown in fig. 10 is only an example, and should not impose any limitation on the functions and the scope of use of the embodiment of the present invention.

As shown in fig. 10, the computer system 1000 includes a Central Processing Unit (CPU) 1001, which can execute various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data required for the operation of the system 1000 are also stored. The CPU 1001, ROM 1002, and RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output portion 1007 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), etc., and a speaker, etc.; a storage portion 1008 including a hard disk or the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The drive 1010 is also connected to the I/O interface 1005 as needed. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in the drive 1010, so that a computer program read out therefrom is installed as needed in the storage section 1008.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 1009, and/or installed from the removable medium 1011. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 1001.

The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described modules may also be provided in a processor, for example, as: a processor includes a sending module, an obtaining module, a determining module, and a first processing module. The names of these modules do not in some cases limit the module itself, and for example, the transmitting module may also be described as "a module that transmits a picture acquisition request to a connected server".

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. An automatic driving behavior decision method based on game theory and reinforcement learning is characterized by comprising the following steps:

inputting the state information of the own vehicle at the current moment, the state information of other intelligent agents at the current moment and the probability of selecting robbery of the other intelligent agents at the current moment into a trained reinforcement learning model to obtain the longitudinal action of the own vehicle at the next moment;

based on the state information of the own vehicle at the current moment, the state information of other intelligent agents at the current moment and the calibrated game model, the probability of selecting robbery of the other intelligent agents at the current moment is calculated, and the method comprises the following steps:

Calculating the benefits of other intelligent agents selecting to rob the vehicle when the vehicle is selected to take the vehicle, the benefits of other intelligent agents and the vehicle when the vehicle is selected to take the vehicle at the same time, the benefits of other intelligent agents selecting to take the vehicle when the vehicle is selected to take the vehicle at the same time, and the benefits of other intelligent agents and the vehicle when the vehicle is selected to take the vehicle at the same time based on the state information of the vehicle at the current moment, the state information of other intelligent agents at the current moment, the calibrated parameters of the game model and the preset payment matrix;

according to the benefits of the other intelligent agents for selecting the robbed-free vehicle, the benefits of the other intelligent agents and the self-vehicle for selecting the robbed-free vehicle at the same time, the benefits of the other intelligent agents and the self-vehicle for selecting the robbed-free vehicle at the same time and the calibrated game model, and calculating the probability of selecting robbing by other agents at the current moment.

2. The method of claim 1, wherein,

the game model comprises:

wherein (1)>For characterizing the probability of other agents to choose to rob,/->For characterizing the earnings of said other agents selecting a robbed car to select a let-down car,/- >For characterizing the benefits of the other agents and the own vehicle to choose to let go of the own vehicle at the same time, +.>For characterizing the benefit of said other agent's choice to let a driving car choose to rob from,/->And the method is used for representing the benefits of the other intelligent agents and the own vehicle in the process of selecting the robbery time.

3. The method of claim 1, wherein,

the reward function includes:

wherein (1)>For the current history time->For the next history time +.>For the prize value for the next historical time, and (2)>For safety value, for measuring the safety of interaction process of the bicycle with other agents, +.>For the degree of separation, for measuring the safety and comfort of the interaction of a vehicle with other agents +.>For efficiency value, for measuring interaction efficiency of the self-vehicle with other agents, +.>For a plateau value, for measuring the speed change during the travel of the motor vehicle, < >>For a preset target reward, a reward obtained after the passing and driving away of the collision area is characterized by the self-vehicle, wherein the collision area is determined by the shape and driving path of the self-vehicle, the shape and driving path of other intelligent agents, and the weight of the self-vehicle is determined by the shape and driving path of other intelligent agents>、/>、/>、/>And->The safe value, the separation degree, the efficiency value, the stable value and the weight of the target rewards are respectively.

4. The method of claim 3, wherein,

wherein,for characterizing the overlapping area of the collision zone of the own vehicle and the collision zones of other agents at the next historical moment,for characterizing the speed of the own vehicle at said next history time,/->For characterizing the speed of other agents at the next history,/or->Is->Weight of->In the time-course of which the first and second contact surfaces,1, otherwise->Is 0; the collision zone of the intelligent body is a zone obtained by expanding the shape of the intelligent body by 1 time in the longitudinal running direction and the transverse running direction respectively.

5. The method of claim 3, wherein,

wherein,for characterizing the overlapping area of the separation area of the own vehicle and the separation area of other agents at the next historical moment,for characterizing the speed of the own vehicle at said next history time,/->For characterizing the speed of other agents at the next history,/or->Is->Weight of->In the time-course of which the first and second contact surfaces,1, otherwise->Is 0; the separation area of the intelligent body is an area obtained by expanding the shape of the intelligent body by 2 times in the longitudinal running direction and the transverse running direction respectively.

6. The method of claim 3, wherein,

；/>wherein (1)>For characterizing the speed of the own vehicle at said next history time,/- >And the acceleration sensor is used for representing the acceleration of the own vehicle at the next historical moment.

7. The method of claim 1, wherein,

if the speed of the own vehicle at the next historical moment obtained by the state transition is not less than 0, the state transition model comprises:

；/>otherwise, the state transition model includes:

；/>wherein (1)>For characterizing the distance of the next history from the vehicle to a collision area determined by the shape and travel path of the vehicle, the shape and travel path of other agents,/and/or>For characterizing the distance from the vehicle to the collision zone at the current history, +.>For characterizing the speed of the own vehicle at the current history instant,/->For characterizing the time difference between the current history and the next history,/and/or->For characterizing the acceleration of the own vehicle at the current history time,/->And the speed of the own vehicle at the next historical moment is used for representing the speed of the own vehicle at the next historical moment.

8. An automatic driving behavior decision device based on game theory and reinforcement learning, which is characterized by comprising:

the prediction module is configured to determine other intelligent agents associated with the own vehicle at the current moment based on the position of the own vehicle at the current moment; acquiring state information of a vehicle at the current moment and state information of other intelligent agents at the current moment; calculating the probability of selecting robbing of other intelligent agents at the current moment based on the state information of the own vehicle at the current moment, the state information of other intelligent agents at the current moment and the calibrated game model; inputting the state information of the own vehicle at the current moment, the state information of other intelligent agents at the current moment and the probability of selecting robbery of the other intelligent agents at the current moment into a trained reinforcement learning model to obtain the longitudinal action of the own vehicle at the next moment;

The prediction module is configured to calculate the benefits of other intelligent agents selecting to rob the vehicle when the other intelligent agents select to rob the vehicle, the benefits of other intelligent agents and the vehicle when the vehicle select to rob the vehicle at the same time, the benefits of other intelligent agents selecting to rob the vehicle when the vehicle select to rob the vehicle, and the benefits of other intelligent agents and the vehicle when the vehicle select to rob the vehicle at the same time based on the state information of the vehicle at the current moment, the state information of other intelligent agents at the current moment, the parameters of the calibrated game model and a preset payment matrix; according to the benefits of other intelligent agents in selecting to rob the self-vehicle when the self-vehicle is selected to let, the benefits of other intelligent agents and the self-vehicle in selecting to let the self-vehicle when the self-vehicle is selected to rob, the benefits of other intelligent agents and the self-vehicle in selecting to rob the self-vehicle when the self-vehicle is selected to rob the self-vehicle and the calibrated game model, the probability of the other intelligent agents in selecting to rob the vehicle at the current moment is calculated.

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-7.

10. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-7.