CN114384901B

CN114384901B - Reinforced learning aided driving decision-making method oriented to dynamic traffic environment

Info

Publication number: CN114384901B
Application number: CN202210032222.6A
Authority: CN
Inventors: 侯卫锋; 叶建位
Original assignee: Zhejiang Zhongzhida Technology Co ltd
Current assignee: Zhejiang Zhongzhida Technology Co ltd
Priority date: 2022-01-12
Filing date: 2022-01-12
Publication date: 2022-09-06
Anticipated expiration: 2042-01-12
Also published as: CN114384901A

Abstract

The application discloses a reinforcement learning auxiliary driving decision-making method facing to a dynamic traffic environment, which comprises the following steps: extracting urban area road information from an environment road map, and abstracting the urban area road information into a road area undirected graph after simplifying preprocessing; acquiring the traffic density distribution condition of regional roads by using an API (application programming interface) for real-time road condition query; constructing a dynamic traffic model according to the traffic flow density distribution condition and the road traffic rule; obtaining the transfer cost of the vehicle between any two intersections; in the road area undirected graph, a path planning based on reinforcement learning is carried out according to a dynamic traffic model and corresponding transfer cost, and a minimum cost path from a starting point to a terminal point is solved. Therefore, the dynamic traffic environment is modeled, an auxiliary driving decision scheme is provided for a driver by using the dynamic traffic flow information, the dynamic route planning under the local information view angle is realized, and the problem that the path planning result is inaccurate easily caused by only using static information is effectively avoided.

Description

Dynamic traffic environment-oriented reinforcement learning auxiliary driving decision method

Technical Field

The invention relates to the technical field of intelligent traffic, in particular to a reinforcement learning auxiliary driving decision method for a dynamic traffic environment.

Background

An Intelligent Transportation System (ITS) aims to construct a safer, more comfortable and more stable traffic environment. Advanced Driver Assistance System (ADAS) of vehicles has also been developed rapidly in recent years, wherein driving environment information acquisition, driving environment characterization modeling and driving Assistance decision making are important research directions of driving Assistance decision making.

At present, navigation planning problems about travel are mainly to make driver auxiliary driving route decisions based on static road information, and mainly focus on path planning in a graph theory. The studies related to vehicle travel planning can be roughly divided into two categories: the first category is mainly to recommend a travel planning scheme for a vehicle cluster, optimize the overall operation target of a fleet and not reflect the driving requirements of drivers. The second category of problems is that for a single vehicle, the optimal driving route is recommended for the driver in consideration of the objectives of shortest distance, shortest time, or shortest energy consumption, but only the deterministic environment is considered.

However, most of the dynamic information of the existing map navigation software is based on iterative update of a historical database, and a data-driven method is utilized to perform road traffic modeling aiming at the batch process of traffic flow, so that road vehicle transfer rules brought by a dynamic traffic environment are not really considered, and an auxiliary driving decision result of factors such as comprehensive coverage area dynamic information and the like is difficult to provide for a driver.

Therefore, how to implement the driving assistance decision method for the dynamic traffic environment is a technical problem to be urgently solved by the technical personnel in the field.

Disclosure of Invention

In view of the above, the present invention provides a method for assisting driving decision by reinforcement learning in a dynamic traffic environment, which can provide a driving decision-assisting scheme for a driver in the dynamic traffic environment, and effectively avoid the problem that a path planning result is inaccurate due to the use of only static information. The specific scheme is as follows:

a reinforcement learning auxiliary driving decision method facing to a dynamic traffic environment comprises the following steps:

extracting urban area road information from an environmental road map, and abstracting the urban area road information into a road area undirected graph after simplified preprocessing;

acquiring the traffic density distribution condition of regional roads by using an API (application programming interface) for real-time road condition query;

constructing a dynamic traffic model according to the traffic flow density distribution condition and the road traffic rule;

obtaining the transfer cost of the vehicle between any two intersections;

and in the road area undirected graph, performing a path planning based on reinforcement learning according to the dynamic traffic model and the corresponding transfer cost, and solving a minimum cost path from a starting point to a terminal point.

Preferably, in the reinforcement learning aided driving decision method for a dynamic traffic environment provided in the embodiment of the present invention, obtaining a traffic density distribution of a regional road by using an API interface for real-time road condition query includes:

according to the longitude and latitude of each road in the range from the driving starting point to the destination, dividing the road into longitude and latitude points corresponding to the road according to certain longitude and latitude step lengths, and returning the congestion evaluation of each intersection node through an API (application program interface) for real-time road condition query;

utilizing congestion evaluation of each point in the whole environmental road map range to draw a road congestion evaluation layered thermodynamic diagram;

calculating the traffic flow density of each road according to the congestion condition of each road in the thermodynamic diagram;

and acquiring the traffic density distribution condition of the regional roads according to the calculated traffic density of each road.

Preferably, in the method for assisting driving decision by reinforcement learning oriented to a dynamic traffic environment according to the embodiment of the present invention, the following formula is used to calculate the traffic density of each road:

where ρ is _m The traffic density of the mth longitude and latitude point in the current road, N is the total number of the longitude and latitude points, c _m A road congestion evaluation value l corresponding to the mth longitude and latitude point in the road _m For the longitude and latitude step length of the road, sigma _m Is the total length of the road.

Preferably, in the method for assisting driving decision by reinforcement learning for a dynamic traffic environment provided in an embodiment of the present invention, in a process of constructing a dynamic traffic model, the method includes:

and when the traffic flow is updated every time, the information of the current intersection and the road vehicles is put into the dynamic traffic model, and the inter-intersection vehicle transfer information of each iteration is obtained through the dynamic traffic model so as to update the state information of each intersection and road vehicle after transfer.

Preferably, in the method for assisting driving decision by reinforcement learning for a dynamic traffic environment provided in an embodiment of the present invention, in a process of constructing a dynamic traffic model, the method further includes:

and normalizing the intersection turning probability in the dynamic traffic model to meet the condition that the sum of the turning-out probabilities of all intersections is 1, wherein the turning-out probability of each intersection has four directions.

Preferably, the reinforcement learning assistance for the dynamic traffic environment provided in the embodiment of the present inventionIn the driving decision method, the dynamic traffic model comprises an intersection turning probability model; the intersection turning probability model P _i,j Comprises the following steps:

wherein each road is represented by intersection labels i, j at two ends of the road;

as a function of speed and density, p _i,j Is the traffic density, beta, of the communication road between the ith intersection and the jth intersection ₁ 、β ₂ 、β ₃ 、β ₄ Are all weighting coefficients, β ₄ rand denotes a random factor, V _max Representing the road speed upper limit, p _max The upper density limit is indicated, and the corresponding speed is 0.

Preferably, in the reinforcement learning aided driving decision method for a dynamic traffic environment provided in the embodiment of the present invention, the dynamic traffic model further includes an intersection outflow vehicle model; the intersection outflow vehicle model is a discrete model based on state, the time interval of each iteration updating cycle is 1, and the number L of the intersection outflow vehicles is calculated by adopting the following formula _j ：

Wherein alpha is _i,j And the width matrix of the connecting road between the ith intersection and the jth intersection.

Preferably, in the method for assisting driving decision by reinforcement learning for a dynamic traffic environment provided by the embodiment of the present invention, obtaining a transfer cost of a vehicle between any two intersections includes:

calculating the running time of the road with two connected road ports according to the relation between the traffic density and the running speed;

calculating the waiting time of the intersection according to the relationship between the waiting time of the intersection and the number of vehicles at the intersection;

calculating the running time from the target intersection to the terminal according to the coordinates of the target intersection, the coordinates of the terminal and the running speed of the current road;

linearly weighting the driving time of the road with the two road junctions connected, the waiting time of the road junctions and the driving time from the target road junction to the terminal point to obtain the transfer cost of the two adjacent road junctions;

and obtaining the transfer cost between any two intersections according to the product of the transfer cost of two adjacent intersections and the reciprocal of the reachable matrix element.

Preferably, in the reinforcement learning aided driving decision method for a dynamic traffic environment provided in an embodiment of the present invention, the path planning based on reinforcement learning is performed, including:

and performing path planning by using a Q-learning reinforcement learning algorithm.

Preferably, in the method for assisting driving decision by reinforcement learning for a dynamic traffic environment provided in an embodiment of the present invention, in a process of performing a path planning by using a Q-learning reinforcement learning algorithm, the method includes:

defining the reinforcement learning state as the intersection serial number from the starting point to the end point of the driving;

defining the action of reinforcement learning as the state of the next moment;

defining elements in the reinforcement learning reward matrix as the transfer cost of each road of each intersection, wherein under the intersection where the reinforcement learning agent is located at present, the reinforcement learning agent can only know the information of the surrounding connected roads;

obtaining a transition probability according to the dynamic traffic model; the environmental model interacted by the reinforcement learning agent is described by the transition probability;

and updating the Q value matrix in the Q-learning reinforcement learning algorithm according to the Bellman equation.

According to the technical scheme, the reinforcement learning auxiliary driving decision method for the dynamic traffic environment comprises the following steps: extracting urban area road information from an environment road map, and abstracting the urban area road information into a road area undirected graph after simplifying preprocessing; acquiring the traffic density distribution condition of regional roads by using an API (application programming interface) for real-time road condition query; constructing a dynamic traffic model according to the traffic flow density distribution condition and the road traffic rule; obtaining the transfer cost of the vehicle between any two intersections; in the road area undirected graph, a path planning based on reinforcement learning is carried out according to a dynamic traffic model and corresponding transfer cost, and a minimum cost path from a starting point to a terminal point is solved.

According to the reinforcement learning auxiliary driving decision method for the dynamic traffic environment, provided by the invention, the dynamic traffic environment is modeled according to the environment road map and the real-time road condition query API interface data, and a driving decision assisting scheme is provided for a driver by fully utilizing simple and easily-obtained dynamic traffic flow information in the dynamic traffic environment, so that dynamic route planning under a local information view angle is realized, and the problem that a path planning result is inaccurate easily caused by only using static information is effectively avoided.

Drawings

In order to more clearly illustrate the embodiments of the present invention or technical solutions in related arts, the drawings used in the description of the embodiments or related arts will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a driving decision method assisted by reinforcement learning for a dynamic traffic environment according to an embodiment of the present invention;

FIG. 2 is a general framework diagram of a driving decision method assisted by reinforcement learning for dynamic traffic environment according to an embodiment of the present invention;

FIG. 3 is a comparison diagram of a full map and a processed binary rasterized map provided by an embodiment of the present invention;

fig. 4 is a simplified extracted road area undirected graph provided by the embodiment of the present invention;

fig. 5 is a map road congestion evaluation layered thermodynamic diagram provided by an embodiment of the present invention;

fig. 6 is a triangular matrix on the map road traffic density according to the embodiment of the present invention;

FIG. 7 is a dynamic traffic update process provided by an embodiment of the present invention;

FIG. 8 is a schematic diagram of intersection modeling provided by an embodiment of the present invention;

fig. 9 is a schematic diagram of a decision result of the Q-Learning reinforcement assisted driving provided by the embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

The invention provides a dynamic traffic environment-oriented reinforcement learning auxiliary driving decision method, which comprises the following steps as shown in figure 1:

s101, extracting urban area road information from an environment road map, and abstracting the urban area road information into a road area undirected graph after simplifying preprocessing;

specifically, according to an environment road map, extracting urban area road information, simplifying the road information, and abstracting the road information into a road area undirected graph;

s102, acquiring the traffic density distribution condition of regional roads by using an API (application program interface) for real-time road condition query;

s103, constructing a dynamic traffic model according to the traffic flow density distribution condition and the road traffic rule;

the steps are used for modeling the dynamic environment of road traffic, and the modeling process of the dynamic traffic model can be divided into an intersection turning probability model and an intersection flowing vehicle model;

s104, obtaining the transfer cost of the vehicle between any two intersections;

the steps are modeling for the vehicle running cost, and a transfer cost function of the vehicle from one intersection to another intersection can be established according to the known road and intersection information;

and S105, in the road area undirected graph, performing a path planning based on reinforcement learning according to the dynamic traffic model and the corresponding transfer cost, and solving a minimum cost path from the starting point to the end point.

Specifically, a route planning based on reinforcement learning may be performed according to the undirected connectivity graph model of the road environment obtained in step S101 and step S102 by using the local road information in step S103 and step S104, and a minimum cost route from the start point to the end point is solved.

If the starting point or the end point of the traveling crane changes, returning to the step S101;

if the starting point or the end point of the traveling crane is not changed, the step S106 is executed;

and S106, drawing the minimum cost path obtained in the step S105 in an environment road map or a road area undirected graph, thereby providing a driving assistance decision scheme for a driver.

In the reinforcement learning auxiliary driving decision method for the dynamic traffic environment provided by the embodiment of the invention, the dynamic traffic environment is modeled according to the environment road map and the real-time road condition query API interface data, and the simple and easily obtained dynamic traffic flow information is fully utilized to provide an auxiliary driving decision scheme for a driver, so that the dynamic route planning under the local information view angle is realized, and the problem that the path planning result is inaccurate easily caused by only using static information is effectively avoided.

Taking the decision-making for assisting driving from a city area a to a city area B in a certain province as an example, the decision-making problem for assisting driving can be decomposed into the following three parts:

firstly, a dynamic traffic area is defined as a part of area from an area A to an area B, so that the road is simplified to a certain extent, and a main road with large traffic flow is considered. At the same time, the area is defined without vehicle exchange with the outside, only vehicle flow and transfer within the area are considered.

Secondly, performing mathematical modeling on traffic flow distribution and road traffic rules, and considering factors such as road vehicle density, lane width, road length, destination guiding information and the like to construct a dynamic traffic model (which can be divided into an intersection turning probability model and an intersection outflow vehicle model).

And finally, based on the established dynamic traffic model, a driving scheme with the minimum cost is sought for vehicles driving from the area A to the area B with the aim of directing to a target point, avoiding congestion and the like as a comprehensive target and with the constraint of different traffic flow distribution (dynamic change) and road traffic rules.

Fig. 2 shows an overall framework diagram of a reinforcement learning aided driving decision method for a dynamic traffic environment.

Further, in a concrete implementation, in the method for assisting driving decision through reinforcement learning for a dynamic traffic environment provided by the embodiment of the present invention, the step S101 is abstracted to a road area undirected graph after simplified preprocessing, and may specifically include the following steps:

firstly, dyeing an environmental road map, and converting the environmental road map into a gray picture;

expanding and convolving the gray level picture, so that when pixels with gray values larger than 0 appear in a convolution kernel range, the pixels in the convolution kernel range are white, and filtering to obtain a binary rasterized map;

abstracting the binary rasterized map into a road area undirected graph; wherein, every crossing is abstracted to a node, if there is a road intercommunication between the crossings, abstract the road to the side.

Specifically, taking an area a as a yuquan and an area B as a hong kong as an example, as shown in fig. 3, a map full view of an area from the yuquan to the hong kong is first obtained through a certain map API, and then the area map is dyed and converted into a gray image; and expanding and convolving the gray map, so that when pixels with the gray values larger than 0 appear in the convolution kernel range, the pixels in the convolution kernel range are white (namely, can pass paths), and filtering to obtain a binary rasterized map. And then abstracting the binary rasterized map into an undirected graph. Wherein each intersection is abstracted as a node, and if there is a road communication between intersections, the road is abstracted as an edge, so as to obtain the abstract undirected graph shown in fig. 4. The horizontal and vertical coordinates of each intersection depend on its pixel location in the picture. Because all intersections are uniformly marked by the same picture, and the scales of the intersections are the same, the undirected graph can well represent the original road information graph. So far, the extraction and simplification preprocessing of the road information are completed.

Suppose that there are 30 intersections in the area, which are numbered p in sequence ₁ ～p ₃₀ I.e. P ═ P ₁ ,p ₂ ,…,p ₃₀ }; the starting point of the vehicle is Yuquan p ₃₀ The end point is hong Kong p ₁₁ (ii) a The vehicle flow at the boundary of the area is not considered, i.e. the area is assumed to be closed, and the vehicle only runs on the road in the figure. The formalized optimization target of the driving decision-making assisting method is

Wherein C is _ij The cost function is transferred for the intersection, i.e., the total cost of vehicle operation from the starting point to the destination is minimized.

Further, in a specific implementation, in the method for assisting driving decision by reinforcement learning for a dynamic traffic environment provided in the embodiment of the present invention, the step S102 obtains a traffic density distribution of an area road by using an API interface for real-time road condition query, which may specifically include the following steps:

dividing the road into longitude and latitude points corresponding to the road according to certain longitude and latitude step length according to the longitude and latitude of each road in the range from the driving starting point to the destination point, and returning congestion evaluation of each intersection node through an API (application program interface) for real-time road condition query; the congestion evaluation is classified into 5 categories, and the corresponding relationship is as follows: 0-unblocked, 1-good, 2-slow walking, 3-congested, 4-severe congestion; the specific conditions are shown in the following table I and the following table II.

Table a certain map API request parameter (excerpt)

API return parameter for a certain map of Table two (alternate)

Fifthly, drawing a road congestion evaluation layered thermodynamic diagram by using congestion evaluation of each point in the whole environment road map range;

step six, calculating the traffic flow density of each road according to the congestion condition of each road in the thermodynamic diagram; in particular, the following formula can be used to calculate the traffic density of each road:

where ρ is _m The traffic density of the mth longitude and latitude point in the current road, N is the total number of the longitude and latitude points, c _m A road congestion evaluation value l corresponding to the mth longitude and latitude point in the road _m For the longitude and latitude step length of the road, sigma _m The total length of the road.

And step seven, acquiring the traffic density distribution condition of the regional roads according to the calculated traffic density of each road.

Specifically, taking the area a as jade spring and the area B as purple hong kong as an example, according to the longitude and latitude of each road in the range from jade spring to purple hong kong, the road is divided into longitude and latitude points corresponding to the road according to certain longitude and latitude step lengths, and the congestion evaluation of the points is returned through an API; the method comprises the steps of utilizing congestion evaluation of all points in the range from the whole Yuquan to the purple gold harbor to make a road congestion evaluation layered thermodynamic diagram; and calculating the total traffic flow density of each road according to the congestion condition of each road in the thermodynamic diagram. Fig. 5 shows the mapped road congestion evaluation hierarchical thermodynamic diagram. According to the calculation, a triangular matrix on the map road traffic density shown in fig. 6 is obtained. Therefore, the reading and the arrangement of the initial traffic state information are completed, and an upper triangular matrix of the map road traffic flow density is obtained.

Therefore, an initial value of the traffic flow distribution condition of the urban area is obtained according to the environmental road map and the road congestion evaluation layered thermodynamic diagram, then a dynamic traffic model is constructed, and the model is driven according to the road dynamic change condition.

Further, in a specific implementation, in the method for assisting driving decision through reinforcement learning for a dynamic traffic environment provided in the embodiment of the present invention, the step S103 may specifically include, in the process of constructing a dynamic traffic model: and when the traffic flow is updated every time, the information of the current intersection and the road vehicles is put into the dynamic traffic model, and the inter-intersection vehicle transfer information of each iteration is obtained through the dynamic traffic model so as to update the state information of each intersection and road vehicle after transfer. It can be understood that the dynamic traffic environment modeling of the invention refers to the idea of cellular automata to dynamically simulate the traffic flow, wherein the updating of the traffic flow data and the updating of the road traffic flow rules are mutually iterated along with the program operation.

The state updating formulas of the traffic flow and the traffic flow density in the dynamic traffic model are respectively as follows:

wherein each road is represented by intersection numbers i, j at two ends of the road, and X _n×1 For a matrix of crossing vehicles, L _n×1 Matrix of number of vehicles coming out of intersection, P _n×n For the egress vehicle steering state matrix, ρ _n×n And t represents the current time as a road traffic density matrix.

Specifically, taking the area a as yuquan and the area B as hong kong as an example, the discrete modeling based on the state is performed on the dynamic traffic. Given that the hong Kong-Yuquan area has n intersections, the model basic variable information is obtained according to the actual traffic network conditions as follows: intersection vehicle number matrix X _n×1 Road traffic density matrix ρ _n×n Road length matrix Y _n×n Road widthMatrix alpha _n×n (ii) a Wherein each road is represented by intersection labels i, j at two ends of the road, and the matrix Y _n×n ,α _n×n Can be a fixed value, obtained by abstract modeling of an actual road in the early stage, X _n×1 ,ρ _n×n In order to update the variables in real time, inter-intersection vehicle transfer information of each iteration is obtained through a dynamic traffic model, and further state information of each intersection and road vehicle after transfer is updated, the specific process is shown in fig. 7, and an intersection modeling schematic diagram is shown in fig. 8.

When the traffic flow is updated every time, the information of the current intersection and the road vehicles is put into a dynamic traffic network model to obtain the information of the vehicles transferred between the intersections, including an intersection outflow vehicle number matrix L _n×1 Matrix P of steering states of outflowing vehicle _n×n . Based on the markov process, its successor state S' is completely determined by the current state S, and the traffic successor updates the state.

In specific implementation, in the method for assisting driving decision by reinforcement learning for a dynamic traffic environment provided in the embodiment of the present invention, in order to simulate a vehicle flow situation between intersections and convert a single intersection flowing out of a vehicle to a probability problem, step S103 may specifically include the following steps in a process of constructing a dynamic traffic model: the method comprises the following steps of carrying out normalization processing on the intersection turning probability in a dynamic traffic model to meet the requirement that the sum of all intersection turning probabilities is 1, wherein each intersection turning probability has four directions, and the specific formula is as follows:

in specific implementation, in the reinforcement learning aided driving decision method for a dynamic traffic environment provided by the embodiment of the present invention, the dynamic traffic model may include an intersection turning probability model and an intersection outgoing vehicle model. Wherein, the turning rules of the intersection turning probability model comprise generality and particularity. The generality means that the larger the traffic density is, the larger the probability is; and the greater the degree of the skeleton, the greater the probability. The particularity is obstacle avoidance, and the probability is higher when the driving speed is higher; and destination randomness.

Because the dynamic traffic network simulates the running state of a general traffic flow, the destination is unknown and has randomness, when a crossing turning probability model is constructed, general thinking and special condition consideration are integrated, and reasonable conjecture is made on the turning selection of the vehicles at the crossing. Based on general thinking, for a vehicle with an unknown destination, the probability that the vehicle enters a main road and the road with a large traffic flow is obviously higher than that of other roads, namely, the probability is higher when the traffic flow density is higher and the main road degree is higher. Based on special case considerations, since the vehicle destination is unknown; if the vehicle is not unique in the intersection steering mode, a road with a small traffic flow is selected possibly based on the obstacle avoidance idea, that is, the probability is larger when the driving speed is larger. On the contrary, if the turning mode of the vehicle at the intersection is unique and determined, the turning selection is only purposefully unique guide, and the model is generated by random digital simulation.

Considering the above factors, the intersection turning probability model P _i,j Can be as follows:

as a function of speed and density, p _i,j Is the traffic density, beta, of the communication road between the ith intersection and the jth intersection ₁ 、β ₂ 、β ₃ 、β ₄ Are all weighting coefficients, β ₄ rand denotes a random factor, V _max Representing the road speed upper limit, p _max The upper density limit is shown, and the corresponding speed is 0.

Specifically, the outgoing vehicle of the intersection outgoing vehicle model includes the density of the traffic flow and the vehicle running speed; the vehicle travel speed includes a base travel speed and a congestion-affected speed. The principle of the number of vehicles flowing out of the intersection is as follows: the number of vehicles flowing out of the intersection is equal to the width of a road, the density of the traffic flow, the speed of the vehicles and the time.

The crossroad outflow vehicle model is a discrete model based on state, the time interval of each iteration updating period is 1, and the number L of crossroad outflow vehicles is calculated by adopting the following formula _j ：

Further, in a specific implementation, in the above method for assisting driving decision by reinforcement learning oriented to a dynamic traffic environment provided by the embodiment of the present invention, since the operation cost is equal to the time spent, and the time spent on transferring the intersection needs to consider not only the time in the transferring process (the longer the time, the greater the cost), but also the time spent on going to the destination after transferring (the farther the distance, the greater the cost), so the step S104 obtains the transfer cost of the vehicle between any two intersections, which specifically includes the following steps:

firstly, calculating the running time of a road with two connected road openings according to the relation between the traffic flow density and the running speed;

the steps are modeling for the road running time between intersections: knowing a traffic density matrix rho and a road length matrix Y, and assuming that the road density and the running speed are in a linear relation, the running time of a road connected between two road ports is the ratio of the road length to the running speed; specifically, the inter-intersection road running time t ₁ The expression of (c) is:

wherein, y _ij Indicating the length of the road between intersections.

Secondly, calculating the waiting time of the intersection according to the relationship between the waiting time of the intersection and the number of vehicles at the intersection;

the steps are modeling for the intersection waiting time: knowing the matrix X of the number of vehicles at the intersection, assuming a roadThe intersection waiting time is in a linear relation with the number of the intersection vehicles, and the intersection waiting time is the product of k and the number of the intersection vehicles; in particular, the road waiting time t ₂ The expression of (a) is: t is t ₂ ＝k·x _i Wherein x is _i And k is a linear relation coefficient, and is the number of vehicles at the intersection corresponding to the ith intersection.

Thirdly, calculating the driving time from the target intersection to the terminal according to the coordinates of the target intersection, the coordinates of the terminal and the driving speed of the current road;

the steps are modeling for the driving time from the target intersection to the terminal: assuming that the running speed is the average running speed of the current road, the running time from the target intersection to the terminal point is the ratio of the distance to the running speed; specifically, the travel time t from the target intersection to the end point ₃ The expression of (a) is:

the distance d can be calculated according to the coordinates of the target intersection and the coordinates of the terminal point.

Fourthly, linearly weighting the driving time of the road with two connected road junctions, the waiting time of the road junctions and the driving time from the target road junction to the terminal point to obtain the transfer cost of two adjacent road junctions;

specifically, according to the calculation results of the first step to the third step, since the three partial functions have the same dimension, the linear weighting is performed on the three partial functions, and the total transfer cost expression C from the ith intersection to the jth intersection can be obtained _ij ：

And fifthly, obtaining the transfer cost between any two intersections according to the product of the transfer cost of two adjacent intersections and the reciprocal of the reachable matrix element.

Specifically, on the basis of the result obtained in the fourth step, the reachability of a non-adjacent intersection is considered, and a transfer cost expression is further processedCorrecting, and for two non-adjacent intersections, the transfer cost is expected to be infinite, and the transfer cost C between any two intersections can be obtained _ij ：

Wherein, a _ij And the intersection j and the intersection i are respectively indicated whether to be reachable, and can reach 1 and not reach 0.

Further, in a specific implementation, in the method for assisting driving decision by reinforcement learning for a dynamic traffic environment according to the embodiment of the present invention, the step S105 of performing a route planning based on reinforcement learning may specifically include: and performing path planning by using a Q-learning reinforcement learning algorithm. Therefore, in the auxiliary driving decision method part, the auxiliary driving route decision under the local information view angle is considered, and the problem that global information cannot be obtained frequently in the real situation can be solved through a Q-Learning reinforcement Learning algorithm, so that the auxiliary driving decision effect is improved.

In specific implementation, the process of performing the path planning by using the Q-learning reinforcement learning algorithm may specifically include: the reinforcement learning state S is set to { S } ₁ ,s ₂ ,...,s _m Defining the intersection sequence number from the starting point to the end point of the driving; the action A of reinforcement learning is { a } ₁ ,a ₂ ,...,a _n Defining the state of the next moment, namely the sequence number of the intersection to go to at the next moment; reward matrix to reinforce learning

The element in (1) is defined as the transfer cost of each road of each intersection, wherein under the intersection where the reinforcement learning agent is located at present, the reinforcement learning agent can only know the information of the surrounding connected roads; deriving transition probabilities from dynamic traffic models

The interactive environment model of the reinforcement learning agent is described by transition probability; according to the theory of reinforcement learningThe Bellman equation in (1) updates the Q value matrix in the Q-learning reinforcement learning algorithm. Q value matrix

The step-by-step iteration process of the middle Q value updating process is described as follows:

wherein α is the learning rate of Q-learning reinforcement learning.

It should be noted that, based on the analysis of the dynamic traffic model, the problem has been transformed to solve the minimum cost path from the starting point to the end point under the undirected connected graph model, and requires reasonable use of global or local information. Therefore, a Q-learning reinforcement learning algorithm is selected and local information is used for path planning. The prime symbols are defined as follows:

and a state s: the reinforcement learning agent is used to make decisions about the information to be used for the next time strategy to be taken. The state at the appointed time t is recorded as st, and the action set of the whole problem is S ═ S ₁ ,s ₂ ,...,s _m }。

Action a: the strategies taken by the reinforcement learning agent in different states enable the reinforcement learning agent to transition between different states by taking different actions. The action at contract time t is denoted as a _t The action set for the entire problem is a ═ a ₁ ,a ₂ ,...,a _n }。

Observation o: and the reinforcement learning agent learns the environmental information observed by the agent at different moments and different states. The observation at contract time t is denoted as o _t 。

Strategy pi: reinforcement learning agent in state s _i Take action a _j Probability of (a), i.e. pi (a) _j |s _i )＝P(A _t ＝a _j |S _t ＝s _i ). For deterministic processes, for state s _i There should be only one optimal strategy, namely:

state transition matrix P: each row of the state transition matrix corresponds to one state and each column corresponds to the other state, element p _ij Represents a state of being _i To state s _j Corresponding probability P(s) _j |s _i ). It is obvious that the state transition matrix has properties

m:number of states

Cost value matrix R: each row of the cost value matrix corresponds to a state, each column corresponds to an action, and the element r _ij Is represented in state s _i Take action a _j The corresponding cost value.

m:number of states；n:number of actions

Discount factor γ: indicating the relative degree of importance between the future award and the current award.

Report G _t ：

Representing a weighted sum of future prize values.

Action value matrix Q: each row of the cost value matrix corresponds to a state and each column corresponds to an action, where the element q _ij ＝q(s _i ,a _j )＝E _π [R _t+1 +γq(s _t+1 ,a _t+1 )|s _t ＝s _i ,a _t ＝a _j ]Is represented in state s _i Take action a _j The corresponding action value is desired.

m:number of states；n:number of actions

In the problem solving process, if the reinforcement learning agent can only obtain local information, when the reinforcement learning agent is positioned at a Cross1, the considerable information of the reinforcement learning agent only comprises the vehicle information of Cross at an adjacent Cross and the traffic density information of Road (1, i) connected with the adjacent Cross, and the advantages of the reinforcement learning method can be well embodied.

In the learning process, the reinforcement learning agent needs to build an environment model step by step to estimate the change of the environment. One model needs to include transition probabilities

And reward value estimation

Two parts; wherein the content of the first and second substances,

is shown at S _t ＝s,A _t Under a to S _t+1 As the probability value of s',

is shown at S _t ＝s,A _t The estimate of the prize value for the next moment under a.

Can be generally obtained by Monte Carlo stochastic simulation, and

typically through multiple iterative convergence of the markov chain.

Considering the path planning problem based on local information, the path planning problem can be abstracted into a markov decision process containing 30 states and 30 actions, and a Q-Learning method is adopted in the problem to solve the markov decision process.

After the environment model is obtained, the Q-Learning algorithm may be used to select a policy. In reinforcement learning, some reinforcement learning agents can only obtain a series of action values, so that the reinforcement learning agents need to take actions in consideration of not only the reward of performing the current action but also the future reward that may be brought by reaching the next state after the action is performed. This can be expressed in mathematical language as:

q _π (s,a)＝E _π [R _t+1 +γq _π (S _t+1 ,A _t+1 )|S _t ＝s,A _t ＝a]

therefore, the Q-Learning iteration needs to involve quantities including: a specific state S _t Perform action A under s _t Prize value R corresponding to a _t+1 (ii) a A certain specific state S _t Perform action A under s _t A next state S reached _t+1 (ii) a In a state S _t+1 Then execute action A according to the current strategy pi or random selection action _t+1 Corresponding Q value Q _π (S _t+1 ,A _t+1 )。

First, for a reward value R _t+1 In the problem solving, the cost value of each road of each intersection is used as a reward value. For the intersection crosssi where the reinforcement learning agent is located, the reinforcement learning agent only knows the cost value of the Road (i, j) j belonging to the access to crosssi around the reinforcement learning agent.

Second, learning agent is in a particular state S _t Perform action A under s _t A next state S reached _t+1 This, in turn, requires solving using the previously obtained estimation model. The estimation model can give the state S _t Perform action A under s _t A to a different S _t+1 Probability of s ═ s

So that the next state of the reinforcement learning agent can be obtained.

Again, the agent is in state S for reinforcement learning _t+1 Execute action A according to the current strategy pi or random selection action _t+1 Corresponding Q value Q _π (S _t+1 ,A _t+1 ). Since Q-Learning is an iterative process, in pair Q _π (S _t ,A _t ) When iterative update is carried out, q obtained by last iteration can be adopted _π (S _t+1 ,A _t+1 ). The updating process of the Q value matrix can be gradually and iteratively learned according to the Bellman equation.

The decision result of assisted driving based on Q-Learning reinforcement Learning is shown in fig. 9. The path with the darkest color represents a suggested route generated by the driving assistance decision, and the gray value of each road section represents the congestion degree of the road section. Note that since the actual program is a dynamic process of state switching, it cannot be described here in picture volume.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

To sum up, the reinforcement learning aided driving decision method for the dynamic traffic environment provided by the embodiment of the invention comprises the following steps: extracting urban area road information from an environment road map, and abstracting the urban area road information into a road area undirected graph after simplifying preprocessing; acquiring the traffic density distribution condition of regional roads by using an API (application programming interface) for real-time road condition query; constructing a dynamic traffic model according to the traffic flow density distribution condition and the road traffic rule; obtaining the transfer cost of the vehicle between any two intersections; in the road area undirected graph, a path planning based on reinforcement learning is carried out according to a dynamic traffic model and corresponding transfer cost, and a minimum cost path from a starting point to a terminal point is solved. Therefore, dynamic traffic environment modeling is carried out according to the environment road map and real-time road condition query API interface data, simple and easily-obtained dynamic traffic flow information is fully utilized to provide an auxiliary driving decision scheme for a driver, dynamic route planning under a local information view angle is further realized, and the problem that a path planning result is inaccurate easily caused by only using static information is effectively avoided.

Finally, it should also be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The dynamic traffic environment-oriented reinforcement learning aided driving decision method provided by the invention is described in detail above, a specific example is applied in the method to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A reinforcement learning auxiliary driving decision method for a dynamic traffic environment is characterized by comprising the following steps:

extracting urban area road information from an environment road map, and abstracting the urban area road information into a road area undirected graph after simplifying preprocessing;

constructing a dynamic traffic model according to the traffic flow density distribution condition and the road traffic rule; the dynamic traffic model comprises an intersection turning probability model; the intersection turning probability model P _i,j Comprises the following steps:

wherein, each road is represented by intersection labels i and j at two ends of the road;

as a function of speed and density, p _i,j Is the traffic density, beta, of the communication road between the ith intersection and the jth intersection ₁ 、β ₂ 、β ₃ 、β ₄ Are all weighting coefficients, β ₄ rand denotes a random factor, V _max Representing the road speed upper limit, p _max Represents the upper density limit, corresponding to a speed of 0;

the dynamic traffic model further comprises an intersection outflow vehicle model; the intersection outflow vehicle model is a discrete model based on state, the time interval of each iteration updating cycle is 1, and the number L of the intersection outflow vehicles is calculated by adopting the following formula _j ：

Wherein alpha is _i,j The width matrix of a communication road between the ith intersection and the jth intersection is obtained;

obtaining the transfer cost of the vehicle between any two intersections;

2. The dynamic traffic environment-oriented reinforcement learning aided driving decision-making method according to claim 1, wherein an API (application program interface) for real-time road condition query is used for obtaining the traffic flow density distribution condition of regional roads, and the method comprises the following steps:

according to the longitude and latitude of each road in the range from the driving starting point to the destination, dividing the road into longitude and latitude points corresponding to the road according to a certain longitude and latitude step length, and returning the congestion evaluation of each intersection node through an API (application program interface) for real-time road condition query;

3. The dynamic traffic environment-oriented reinforcement learning aided driving decision method according to claim 2, is characterized in that the traffic flow density of each road is calculated by the following formula:

where ρ is _m For the current roadThe traffic density of the mth longitude and latitude point, N is the total number of the longitude and latitude points, c _m A road congestion evaluation value l corresponding to the mth longitude and latitude point in the road _m For the longitude and latitude step length of the road, ∑ l _m The total length of the road.

4. The dynamic traffic environment-oriented reinforcement learning aided driving decision method according to claim 1, is characterized in that in the process of constructing a dynamic traffic model, the method comprises the following steps:

5. The dynamic traffic environment-oriented reinforcement learning aided driving decision method according to claim 4, wherein in the process of constructing the dynamic traffic model, the method further comprises the following steps:

and normalizing the probability of intersection turning in the dynamic traffic model to meet the sum of the turning-out probabilities of all intersections as 1, wherein the turning-out probability of each intersection has four directions.

6. The dynamic traffic environment-oriented reinforcement learning aided driving decision method according to claim 1, wherein the step of obtaining the transfer cost of the vehicle between any two intersections comprises the following steps:

linearly weighting the driving time of the road with the two road junctions connected, the waiting time of the road junctions and the driving time from the target road junction to the terminal point to obtain the transfer cost of two adjacent road junctions;

7. The dynamic traffic environment-oriented reinforcement learning aided driving decision method according to claim 1, wherein the reinforcement learning-based path planning is performed by:

8. The dynamic traffic environment-oriented reinforcement learning aided driving decision method according to claim 7, wherein in the process of performing path planning by using Q-learning reinforcement learning algorithm, the method comprises the following steps:

defining the action of reinforcement learning as the state of the next moment;

defining elements in the reinforcement learning reward matrix as the transfer cost of each road of each intersection, wherein under the current intersection, the reinforcement learning intelligent agent only can know the information of the surrounding connected roads;