CN113221469A

CN113221469A - Inverse reinforcement learning method and system for enhancing authenticity of traffic simulator

Info

Publication number: CN113221469A
Application number: CN202110625802.1A
Authority: CN
Inventors: 薛贵荣
Original assignee: Shanghai Tianran Intelligent Technology Co ltd
Current assignee: Shanghai Tianran Intelligent Technology Co ltd
Priority date: 2021-06-04
Filing date: 2021-06-04
Publication date: 2021-08-06

Abstract

The invention provides a reverse reinforcement learning method and a system for enhancing the authenticity of a traffic simulator, which comprises the following steps: initializing a track action strategy through a generator; generating track data of a plurality of agents by combining the current environment; mixing the track data with preset expert track data, inputting the mixed track data into a discriminator, and distinguishing the expert track data by a training discriminator, wherein the training aim is to maximize a reward function; inputting the reward function into a generator, and obtaining a new track action strategy by the generator; generating trajectory data of a plurality of agents by using a new trajectory action strategy, mixing the trajectory data with preset expert trajectory data, and training a discriminator until convergence; and the traffic simulator carries out traffic simulation according to the final reward function and the track action strategy. The method can deduce the reward function of the real-world vehicle, can optimize strategies under different traffic environments, and has good expandability.

Description

Inverse reinforcement learning method and system for enhancing authenticity of traffic simulator

Technical Field

The invention relates to the field of computer software and traffic, in particular to a reverse reinforcement learning method and a reverse reinforcement learning system for enhancing the reality of a traffic simulator.

Background

In recent years, with the development of urbanization, urban traffic flow is increased year by year, population is dense and the like, so that the construction of an urban road traffic system is very complex, and a plurality of road construction problems, such as urban traffic network planning and evaluation, traffic jam and traffic stream evacuation, lane restriction and speed limitation, traffic signal control and the like, cannot be intuitively and scientifically solved.

Traffic simulators have been one of the important research hotspots in the traffic field. Microscopic traffic simulation plays an important role in the planning, design and operation of traffic systems. At present, the traffic simulator has two important functions: firstly, effect evaluation of city planning and city operation is carried out. A well-designed traffic simulator allows city operators and city planners to test policies for urban road planning, traffic control, and traffic congestion optimization by accurately inferring the possible impact that policies for construction and passage of various facilities have on the urban traffic environment. And secondly, providing the learnt data of urban traffic operation for researchers carrying out various urban intelligent algorithms. At present, a lot of work is carried out to train and test an intelligent traffic signal control strategy by using a traffic simulator, because a lot of simulation data can be generated for training a signal controller, so that the problem that real city data cannot meet the requirement of a lot of training data of a machine learning algorithm is solved.

Currently, most advanced micro traffic simulators use a Following Model (Car Following Model, CFM for short) to describe the movement of a single vehicle.

Each vehicle has certain attributes and parameters, and when a simulation system creates a vehicle, the simulation system needs to initialize the parameter values of the vehicle, and controls the vehicle by adjusting the parameters in the driving process. The commonly used parameters comprise automobile acceleration parameters and driver reaction time parameters of each automobile, and the richness and diversity of vehicle simulation are increased through different parameter settings, and urban traffic vehicle tracks can be more realistically simulated.

However, it is now conventional to use several physical and empirical formulas to define the parameters of the following model. The setting of these parameters must be carefully calibrated using traffic data. The calibrated follow-up model may be used as a strategy to provide optimal behavior of the vehicle under given environmental conditions. The optimization of this strategy is achieved by calibrating the parameters of the following model, which are obtained by analyzing the inter-observed and simulated traffic measurements.

The whole process is as follows:

step 1: generating a series of traffic flow status data by a traffic simulator;

step 2: generating a traffic vehicle control strategy pi according to the traffic flow state data_ψ；

Step 3: generating current vehicle traffic actions (starting, stopping, accelerating and decelerating) of each vehicle according to current vehicle traffic states (parameters such as vehicle speed, whether to use a traffic light, the distance between vehicles ahead and the like);

step 4: transmitting the generated traffic strategy to a traffic simulator control API;

step 5: applying the new traffic strategy and generating new traffic flow state data.

An effective traffic simulator should be able to produce accurate simulations in different traffic environments without being affected by environmental dynamics. This can be broken down into two specific challenges.

The first challenge is: the goal of conventional follow-up models is generally to simulate the follow-up behavior of a vehicle by applying physical laws and human knowledge. The movement of vehicles in the real world depends on many factors including speed, distance from neighbors, road networks, traffic lights and psychological factors of the driver. Although models that emphasize fitting different factors are continuously added to the family of the following models, such as setting a threshold value for vehicle movement according to the psychological tendency of the driver based on safe driving distance and speed, and the like. However, there is currently no general model that fully reveals the authenticity of vehicle behavior patterns in a comprehensive context. The follow-up model relies on inaccurate a priori knowledge and often fails to show true simulation effects despite correction.

The second challenge is: many studies consider learning using expert data. Due to the fact that the expert data have relative normativity and stability, the following model still has no robustness to various environment dynamics under different traffic environments. This is a challenging problem due to the non-stationarity of real world traffic dynamics. For example, weather and road conditions may change the mechanical properties and coefficient of friction of the vehicle with the road surface, and ultimately result in changes in its acceleration and braking performance. In a real scenario, a good driver will adjust his driving strategy accordingly to environmental changes and will behave differently under these dynamic changes (e.g., given the same observed speed, different accelerations will be used). However, given a fixed strategy (i.e., CFM), current simulators are generally unable to adapt the strategy to different dynamics. To simulate different traffic environments with significantly different dynamics, the following model must be recalibrated using new trajectory data associated with the environment, requiring relearning. Such relearning results in inefficiencies.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a reverse reinforcement learning method and a reverse reinforcement learning system for enhancing the authenticity of a traffic simulator.

The invention provides a reverse reinforcement learning method for enhancing the authenticity of a traffic simulator, which comprises the following steps:

an initialization step: initializing a track action strategy through a generator;

a track data generation step: generating track data of a plurality of agents according to the track action strategy and the current environment;

mixing: mixing the track data with preset expert track data to obtain mixed track data;

training: inputting the mixed track data into a discriminator, and distinguishing the expert track data by a training discriminator, wherein the training goal is to maximize a reward function;

and (3) optimizing: inputting the reward function into the generator, and obtaining a new track action strategy by the generator according to the reward function, the current environment and the track action strategy;

iteration step: generating trajectory data of a plurality of agents by using a new trajectory action strategy, mixing the trajectory data with preset expert trajectory data, and training a discriminator until convergence;

an output step: and the traffic simulator carries out traffic simulation according to the final reward function and the track action strategy.

Preferably, the method treats traffic simulation as a multi-agent control problem formally defined as a multi-tuple (M, { S } of elements_m},{A_m},T,r,γ,ρ⁰) A represented Markov decision process;

m represents a group of agents, M is one of the agents, S_mAnd A_mExpressed as the state and action, ρ, of each agent, respectively⁰Is a distribution expressing an initial state, r (s, a) is a reward function, and γ represents a long-term reward discount coefficient, and a state transition function is defined as T (s' | s, a) describing a micro traffic simulation problem in an inverse reinforcement learning manner.

Preferably, it is assumed that the strategy of action pi is based on a trajectory^*(a | s) generates M pieces of track data D ═ τ₁,τ₂,…,τ_M-a trace of n points τ ═ s₀,a₀,s₁,a₁,…,s_n,a_nAim at learning the reward function r_θ(s, a) maximizing the log likelihood of the expert trajectory data:

wherein

Is in the reward function r_θ(s, a) distribution of trajectories.

Preferably, the arbiter uses a state-action pair as input,

where f and the generator strategy pi are learned functions, the training objective is to maximize the following reward function r (a, a):

r(s,a)＝log(1-D(s,a))-logD(s,a)

in each iteration, the extracted reward value is used to guide the training of the generator strategy.

Preferably, the agent comprises a vehicle interacting with the environment.

According to the invention, the inverse reinforcement learning system for enhancing the authenticity of the traffic simulator comprises:

Preferably, the method treats traffic simulation as a multi-agent control problem formally defined as a multi-tuple (M, { S } of elements_m},{A_m},T,r，γ,ρ⁰) A represented Markov decision process;

m represents a group of agents, M is one of the agents, s_mAnd A_mExpressed as the state and action, ρ, of each agent, respectively⁰Is a distribution expressing an initial state, r (s, a) is a reward function, and γ representsThe long-term reward discount coefficient and the state transition function are defined as T (s' | s, a), and the microscopic traffic simulation problem is described in an inverse reinforcement learning mode.

wherein

Is in the reward function r_θ(s, a) distribution of trajectories.

Preferably, the arbiter uses a state-action pair as input,

where f and the generator strategy pi are learned functions, the training objective is to maximize the following reward function r (s, a):

r(s,a)＝log(1-D(s,a))-logD(s,a)

Preferably, the agent comprises a vehicle interacting with the environment.

Compared with the prior art, the invention has the following beneficial effects:

the present invention is based on an Inverse Reinforcement Learning (IRL) model, which can infer the reward function of real-world vehicles. It enables us to optimize strategies in different traffic environments.

The present invention uses a parameter sharing mechanism to extend the proposed model to a multi-agent environment, making our model have good extensibility.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a schematic diagram of the inverse reinforcement learning of the augmented reality of the traffic simulator of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

The invention adopts the consideration of the following model of the inverse reinforcement learning to solve the two challenges of the traditional following model:

for the first challenge, a direct consideration is to learn the behavior pattern of the vehicle directly from real-world observations, rather than relying on a priori knowledge that is not reliable or that is not well explored by state space. Recently, mock learning has shown the ability to learn from the demonstration example. However, direct modeling of learning methods, such as behavioral cloning, aims to extract expert strategies directly from the data. This approach may still fail in addressing the second challenge. Because the learning strategy may lose effect when the traffic environment changes dynamically, such as the weather changes or the road conditions change, the learning needs to be repeated.

Inverse Reinforcement Learning (IRL) learns not only the expert's strategy but also the reward function (e.g., driving at the most appropriate speed without colliding) from the demonstration example, which can adapt to different traffic environments. Therefore, an inverse reinforcement learning based approach is used to train vehicle simulated agents to generate accurate trajectories.

Meanwhile, parameter sharing is used for accelerating the learning of the multi-agent. Considering the complex real traffic scenario of multi-vehicle interaction, we extend IRL into the multi-agent environment of traffic simulation. A parameter sharing mechanism is combined with inverse reinforcement learning, a new inverse reinforcement learning algorithm called parameter sharing is generated, and a dynamic robust traffic simulation model is formed. Meanwhile, an online updating process is provided, and the learned reward function is used for guiding strategy learning in a new environment under the condition that no new track data is needed.

As shown in fig. 1, the present invention provides a reverse reinforcement learning method for enhancing the reality of a traffic simulator, comprising:

an initialization step: a track action policy is initialized by the generator.

A track data generation step: and generating track data of a plurality of agents according to the track action strategy and the current environment.

Mixing: and mixing the track data with preset expert track data to obtain mixed track data.

Training: and inputting the mixed track data into a discriminator, and distinguishing the expert track data by a training discriminator, wherein the training aim is to maximize a reward function.

And (3) optimizing: and inputting the reward function into the generator, and obtaining a new track action strategy by the generator according to the reward function, the current environment and the track action strategy.

Iteration step: and generating the trajectory data of a plurality of agents by using a new trajectory action strategy, mixing the trajectory data with preset expert trajectory data, and training the discriminator until convergence.

Considering the complex interaction between vehicles in traffic, traffic simulation is considered as a multi-agent control problem. Formally defining our model as a set of tuples (M, { S)_m}，{A_m}，T,r,γ,ρ⁰) Markov Decision Process (Markov Decision Process) of the representation.

Where M denotes a group of agents and M is one of the agents. S_mAnd A_mRespectively, as the state and action of each agent. Rho⁰Is a distribution that expresses the initial state. r (s, a) is a reward function, and γ represents a long-term reward discount coefficient. It is assumed that the environmental dynamics remain unchanged for a given set of expert demonstrations. The state transition function is defined as T (s' | s, a). The present invention describes the microscopic traffic simulation problem in the form of Inverse Reinforcement Learning (IRL).

Given the trajectory of the motion of an expert vehicle, the goal of the invention is to learn the reward function of the vehicle agent.

The problem of traffic simulator follow model learning is defined as follows:

suppose an expert generates a strategy according to a trajectory^*(a | s) generating M pieces of expert trajectory data D ═ τ₁，τ₂，…，τ_M-a trace of n points τ ═ s₀,a₀,a₁,a₁,…,s_n,a_nAim at learning the reward function r_θ(s, a) maximizing the log-likelihood of the expert trajectory:

wherein

Is in the reward function r_θ(s, a) distribution of trajectories.

The present invention trains a network with a discriminator-generator, which uses state-action pairs as inputs,

r(s，a)＝log(1-D(s,a))-logD(s,a)

in each iteration, the extracted reward value is used to guide the training of the generator strategy. Updating the arbiter amounts to updating the reward function and in turn the update strategy can be seen as improving the sampling distribution used to estimate the arbiter.

We describe traffic simulation as a multi-agent system problem, treating each vehicle in the traffic system as an agent interacting with the environment. The scattered parameter sharing training scheme is combined with the IRL, and a multi-vehicle simultaneous control strategy under the complex traffic environment of parameter sharing IRL (PS-IRL) learning is provided. In our algorithm, control is decentralized and learning is centralized.

The procedure of the reverse reinforcement learning system of the embodiment is as follows:

step 1: initializing a track action strategy pi (a | s);

step 2: applying to the environment, generating trajectory data D ═ τ of M agents₁,τ₂,…,τ_M}；

Step 3: mixing the expert tracks and the generated tracks, training a discriminator, and distinguishing whether the expert tracks are the expert tracks;

step 4: continuing to train the reward function r_θ(s,a)

Step 5: obtaining a new track action strategy pi;

step 6: repeatedly executing Step 2-Step 5 until convergence;

step 7: output r_θ(s, a) and π (a | s).

Those skilled in the art will appreciate that, in addition to implementing the system and its various devices, modules, units provided by the present invention as pure computer readable program code, the system and its various devices, modules, units provided by the present invention can be fully implemented by logically programming method steps in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices, modules and units thereof provided by the invention can be regarded as a hardware component, and the devices, modules and units included in the system for realizing various functions can also be regarded as structures in the hardware component; means, modules, units for performing the various functions may also be regarded as structures within both software modules and hardware components for performing the method.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A reverse reinforcement learning method for enhancing the realism of a traffic simulator, comprising:

2. The reverse-reinforcement learning method for enhancing the realism of traffic simulators as claimed in claim 1, wherein said method considers traffic simulation as a multi-agent control problem formally defined as a multi-tuple (M, { S) } of elements_m},{A_m},T,r,γ,ρ⁰) A represented Markov decision process;

3. The inverse reinforcement learning method for enhancing the realism of traffic simulators as claimed in claim 2, wherein it is assumed that pi is a trajectory action strategy^*(a | s) generates M pieces of track data D ═ τ₁,τ₂,…,τ_M-a trace of n points τ ═ s₀,a₀,s₁,a₁,…,s_n,a_nAim at learning the reward function r_θ(s, a) maximizing the log likelihood of the expert trajectory data:

wherein

Is in the reward function r_θ(s, a) distribution of trajectories.

4. The inverse reinforcement learning method for enhancing the realism of traffic simulators as claimed in claim 3, wherein said arbiter uses a state-action pair as an input,

r(s,a)＝log(1-D(s,a))-logD(s,a)

5. The inverse reinforcement learning method of enhancing the realism of a traffic simulator of claim 1, wherein the agent comprises a vehicle interacting with the environment.

6. A reverse reinforcement learning system for enhancing the realism of traffic simulators, comprising:

7. The system of claim 6, wherein the method treats traffic simulation as a multi-agent control problem formally defined as a multi-tuple (M, { S) } of elements_m},{A_m},T,r,γ,ρ⁰) A represented Markov decision process;

8. The inverse reinforcement learning system for enhancing the realism of traffic simulators as claimed in claim 7, wherein the assumption is made according to a trajectory action strategy pi^*(a | s) generates M pieces of track data D ═ τ₁,τ₂,…,τ_M-a trace of n points τ ═ s₀,a₀,s₁,a₁,…,s_n,a_nAim at learning the reward function r_θ(s, a) maximizing the log likelihood of the expert trajectory data:

wherein

Is in the reward function r_θ(s, a) distribution of trajectories.

9. The inverse reinforcement learning system for enhancing the realism of traffic simulators as claimed in claim 8, wherein said arbiter uses state-action pairs as inputs,

r(s,a)＝log(1-D(s,a))-logD(s,a)

10. The inverse reinforcement learning system for enhancing the realism of traffic simulators as recited in claim 6, wherein the agents include vehicles that interact with the environment.