CN109269516B

CN109269516B - Dynamic path induction method based on multi-target Sarsa learning

Info

Publication number: CN109269516B
Application number: CN201810992284.5A
Authority: CN
Inventors: 文峰; 封筱
Original assignee: Shenyang Ligong University
Current assignee: Shenyang Ligong University
Priority date: 2018-08-29
Filing date: 2018-08-29
Publication date: 2022-03-04
Anticipated expiration: 2038-08-29
Also published as: CN109269516A

Abstract

The invention provides a dynamic path induction method based on multi-target Sarsa learning, which comprises the following steps: initializing information; updating information; inducement path calculation, including Q vector table normalization, calculation of scalar values based on driver preferences, calculation of Boltzmann probability distribution, selection of the next road segment for the driver to meet his personal preferences by roulette method until the driver's vehicle reaches the destination. According to the traffic condition of the current traffic system, the running path of the vehicle is optimized, the efficiency of the traffic system is improved, and the traffic jam condition is relieved. From a practical perspective, the dynamic path induction of multiple induction targets is carried out simultaneously, and the induction requirements in practical life are met. The driver induction preference is considered, and a dynamic induction path which accords with the personal preference is provided for the driver, so that the acceptance rate of the induction path is improved, the traffic efficiency of a traffic system is further improved, and the traffic jam condition is relieved.

Description

Dynamic path induction method based on multi-target Sarsa learning

Technical Field

The invention belongs to the technical field of intelligent transportation, and particularly relates to a dynamic path induction method based on multi-target Sarsa learning.

Background

In recent years, with the rapid development of social economy in China, the holding capacity of private automobiles is continuously increased, and the problems of increased urban traffic pressure, urban traffic congestion, traffic jam, frequent traffic accidents and the like are increasingly serious. Furthermore, drivers, as important participants in traffic systems, often have multiple inducement goals simultaneously and different preferences for different goals during a trip. Whether to consider the personal preferences of the driver can have a great influence on the acceptance of the inducement information and thus on the traffic efficiency of the traffic system. Therefore, it is very necessary to realize efficient and dynamic route guidance from the viewpoint of relieving traffic congestion and satisfying individual preference of the driver.

The reinforcement learning has strong self-adaptability and self-learning capability, the self control strategy can be continuously adjusted along with the change of the system environment without prior knowledge and modeling, the dynamic information of the system is utilized for learning, and the control requirement on the traffic guidance system with high randomness and complexity is met. The Sarsa learning as a reinforcement learning algorithm of on-policy learning is particularly suitable for searching an optimal path and dynamically inducing vehicles in a traffic induction system with complexity, variability and strong real-time performance.

Most of the currently proposed route guidance models and guidance algorithms are single-target route guidance methods constructed only for the road section travel time, and the guidance requirements in actual life and the individual preference of drivers are ignored. Multi-objective reinforcement learning is commonly used to solve such multi-objective optimization problems, and methods for solving an optimal solution set for multi-objective reinforcement learning are mainly classified into a single-strategy method and a multi-strategy method. However, compared with the single-strategy method, the multi-strategy method learns a set of a series of optimal solutions to approach the Pareto frontier when interacting with the environment every time, and a large amount of calculation time is required in the process, and the corresponding calculation amount is very large. And a multi-strategy method is used in on-policy learning, and the calculation amount and the storage time of the corresponding solution set are both large, so that the method is not suitable for a dynamic path induction system. Therefore, the single-strategy multi-target Sarsa learning is suitable for solving the dynamic path induction problem considering the preference of the driver on the basis of containing the multi-induction targets.

Disclosure of Invention

In light of the above technical problems, an object of the present invention is to provide a dynamic path induction method based on multi-objective Sarsa learning. The real-time traffic data information and the personal preference information of the driver are fully utilized, the route guidance information according to the personal preference is provided for the driver, the whole traffic system is coordinated to pass, the traffic jam is relieved, and the passing efficiency of the traffic system is improved.

The technical scheme is as follows: a dynamic path induction method based on multi-target Sarsa learning comprises the following steps of 1-3:

step 1: initializing information, specifically comprising steps 1.1 to 1.3:

step 1.1: confirming an induction target: the method comprises the steps of selecting one or more of minimized travel time, minimized travel distance and minimized cost; (ii) a

Step 1.2: aiming at the induced targets, a traffic information center initializes a Q vector table of each induced target corresponding to a terminal to be selected on a road network by using a dynamic planning algorithm based on a Q value according to road network information in a geographic information base and historically collected static data of each road section, wherein one Q vector table corresponds to one terminal to be selected;

step 1.3: setting a Q value information updating time interval T issued by a traffic information center;

the road network information includes: road network topological structure, road length and number of lanes;

the static data of each path segment comprises: historical vehicle transit time, distance, cost;

step 2: the information updating specifically includes: defining the weight of an induced target, calculating the current road network traffic jam coefficient and updating a Q vector table by using a Sarsa learning method at every T moment:

(1) defining an induced target weight:

recording current information of all vehicles in a road network, real-time traffic information passing through a current road section and preference of each passing driver in the road network; assuming that there are n induction targets in total, the preference of each driver is denoted as weight vector ω ═ ω (ω ═ ω)₁,…,ω_n) Wherein, epsilon_o∈[0,1]And (3) representing the weight of the corresponding preference of the No. o induction target, defining the weight of each induction target:

each driver self-defines the attention degree of each induction target, namely, the preference of each driver is weighted;

the all-vehicle current information includes: including location, desired destination, all next traffic nodes that can be reached;

the real-time traffic information of the current road section comprises: travel time, distance, cost;

(2) calculating the traffic congestion coefficient of the current road network: counting the number NV of vehicles in the current road network, and calculating the traffic congestion coefficient of the current road network according to the number of the vehicles in the current road network, wherein the traffic congestion coefficient belongs to the following steps:

wherein β and γ are parameters, the traffic congestion coefficient ∈ represents the current traffic condition of the traffic system, the value of ∈ increases with the increase of the total number NV of vehicles in the current road network, and when the value of ∈ is larger, it means that the current traffic condition is congested, and vice versa.

(3) And updating a Q vector table by using a Sarsa learning method every T time: and (2) updating the Q vector table of the corresponding terminal point according to a Sarsa learning method for each guidance target o by the current information of all vehicles in the recorded road network which is acquired in the step (1) and is closest to the updating time and the real-time traffic information of the current road section and by using the next driving road section distributed in the step (3.3) and the step (3.4) every T moment, wherein the formula of the Sarsa learning method is as follows:

wherein the content of the first and second substances,

a Q value with o as an induction target and d as a terminal point, wherein the Q value is from a traffic node i to an adjacent traffic node j, k is the adjacent traffic node of the traffic node j, alpha is a learning rate,

for vehicles v to pass over sections s_ijThe actual prize value obtained;

the actual prize values include: travel time, distance, or cost, only one of which is selected.

And step 3: the calculation of the induction path comprises the following steps of 3.1-3.5:

step 3.1: and (3) normalizing by a Q vector table: according to the Q vector table updated in the step 2, respectively normalizing the corresponding Q values of different induction targets by adopting a dispersion standardization method, wherein the formula is as follows:

wherein the content of the first and second substances,

to pass through a section s of road_ijNormalized Q values for the induced target o with end point d,

and

the minimum value and the maximum value of the Q values of all the road sections corresponding to the end point d and the induction target o are respectively.

Step 3.2: calculating a scalar value based on driver preferences: according to the corresponding driver preference, namely the weight vector epsilon obtained in the step 2 and the Q vector table normalized in the step 3.1, a linear scaling function is applied to convert the Q vectors of all the adjacent road sections of the current traffic node where the vehicle is located in the Q vector table with the end point d into a scaling value SQ based on the driver preference by using the following formula_d(i, j), the specific formula is as follows:

wherein n represents the number of induction targets, ω_oIndicating the preference weight corresponding to the target o,

representing a passing road section s_ijNormalized Q value of target o with end point d;

step 3.3: calculating Boltzmann probability distribution: using the scalar value SQ based on the driver's preference with the vehicle current information obtained in step 2_d(i, j), calculating Boltzmann probability distribution of the current traffic node adjacent road section, wherein the formula is as follows:

wherein, P_d(i, j) vehicle end point d and select road segment s_ijThe probability of (i, j) is a traffic node,a (i) is a set consisting of end points of road sections with the traffic node i as a starting point and end points corresponding to adjacent road sections of the current node obtained according to a road network topological structure, and epsilon is a traffic congestion coefficient and ESQ_d(i) Scalar value SQ based on driver preference from road segment around node i to destination d_d(i) Average value of (a).

Step 3.4: selecting the next travel segment that meets his personal preferences: calculating Boltzmann probability distribution of each road section based on the step 3.3, and selecting a next driving road section which meets personal preference of a driver by a roulette method;

step 3.5: and if the vehicle does not reach the destination, repeating the steps 3.2-3.3 until the vehicle reaches the destination.

The beneficial technical effects are as follows:

1. a dynamic route induction method based on multi-target Sarsa learning can fully utilize real-time information of a current traffic system, optimize a running route of a vehicle according to traffic conditions of the current traffic system, improve efficiency of the traffic system and relieve traffic jam conditions.

2. A dynamic path induction method based on multi-target Sarsa learning is based on the practical angle, and simultaneously carries out dynamic path induction of multiple induction targets, so that the induction requirements in the practical life are met.

3. A dynamic route induction method based on multi-target Sarsa learning considers the induction preference of a driver and provides a dynamic induction route which accords with the personal preference for the driver, so that the acceptance rate of the induction route is improved, the traffic efficiency of a traffic system is further improved, and the traffic jam condition is relieved.

Drawings

FIG. 1 is a flowchart of a dynamic path induction method based on multi-objective Sarsa learning according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of dynamic path induction according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a vehicle path calculation according to an embodiment of the present invention;

fig. 4 is a comparison diagram of the embodiment of the invention for traffic congestion compared with the conventional inducing method.

Detailed Description

The invention is further explained with reference to the drawings and the specific embodiment, and the whole process of information interaction between the dynamic path guidance system and the vehicle is shown in fig. 2. The vehicle in the road network sends data such as self position, terminal point, personal preference and the like to the dynamic path guidance system, the dynamic path guidance system calculates the guidance path which accords with the personal preference by using a path guidance algorithm through the data transmitted by the vehicle and the collected information such as real-time traffic conditions of the road network and the like, and sends the guidance path to the vehicle to complete information interaction between the two parties. A dynamic path induction method based on multi-target Sarsa learning comprises steps 1-3, as shown in FIG. 1:

step 1: initializing information, specifically comprising steps 1.1 to 1.3:

step 1.1: confirming an induction target: travel time and travel cost;

step 1.2: aiming at the induced targets, a traffic information center initializes a Q vector table of each induced target corresponding to a terminal to be selected on a road network by using a dynamic planning algorithm based on a Q value according to road network information in a geographic information base and historically collected static data of each road section, wherein one Q vector table corresponds to one terminal to be selected; first, initializing the Q vectors of the road segments around the possible destination d, and the specific operation is as follows:

wherein the content of the first and second substances,

to pass through a section s of road_ijBeginning of reaching end point dThe initialized Q vector, i, j is the traffic node, time_ijAnd cost_ijRespectively, vehicle passing road section s_ijD is an end point set, a (i) is an end point set of links starting from the traffic node i, and b (i) is a start point set of links starting from the traffic node i.

Then, the Q vectors of all the road segments corresponding to the destination d are updated through a plurality of iterations, and the update formula is as follows:

wherein the content of the first and second substances,

for the n-th iteration, corresponding to the road section s with the end point d_ijAnd the obtained Q vector k is an adjacent traffic node of the traffic node j.

the static data of each path segment comprises: historical vehicle transit time, cost;

as shown in fig. 3, taking a vehicle v with a destination d and located at a traffic node j as an example, the dynamic induction process is as follows:

(1) defining an induced target weight: recording current information of all vehicles in a road network, real-time traffic information passing through a current road section and preference of each passing driver in the road network; assuming that there are n induction targets in total, the preference of each driver is denoted as weight vector ω ═ ω (ω ═ ω)₁,…,ω_n) Wherein, ω is_o∈[0,1]And (3) representing the weight of the corresponding preference of the No. o induction target, defining the weight of each induction target:

the real-time traffic information of the current road section comprises: travel time, cost;

recording current information of the vehicle v such as position: traffic node j, desired destination: traffic node d, all next reachable traffic nodes: k. k ', k' passing through the current road section s_ijTravel time, cost, etc., and driver preferences. The preference weight vector ω of each driver is (0.8, 0.2). Where 0.8 and 0.2 represent the weight of the corresponding preference for the induction target in terms of time and expense, respectively.

wherein β and γ are parameters, the traffic congestion coefficient ∈ represents the current traffic condition of the traffic system, the value of ∈ increases with the increase of the total number NV of vehicles in the current road network, and when the value of ∈ is larger, it means that the current traffic condition is congested, and vice versa. Where β, γ are set to 0.3, 0.005, respectively, assuming that the number of vehicles NV in the current road network is 500, then ∈ is 0.8.

(3) Every T time, all recorded road networks which are acquired in the step (1) and are nearest to the updating time are passedCurrent information of vehicles and real-time traffic information passing through current road section, e.g. vehicle v on road section s_ijOn the immediate travel time

And immediate expense

And the next driving section s assigned using the 3-step route selection method_jkAnd assuming that the learning rate alpha is 0.7, in the current Q vector table

And

the values of (1) and (250s, 21) and (200s, 20) respectively. Therefore, the Q vector table of the corresponding end point d is respectively updated for each induction target according to the Sarsa learning method. The Sarsa learning formula is as follows:

wherein the content of the first and second substances,

a Q vector starting from traffic node i and passing through an adjacent traffic node j and ending at d.

step 3.1: and (3) normalizing by a Q vector table: according to the Q vector table updated in the step 2, different induced targets are normalized by a dispersion standardization method respectively to obtain corresponding Q values, so that the problem that the different induced targets have different units and dimensions is solved, and the formula is as follows:

wherein the content of the first and second substances,

and

Based on the values in Q vector Table and 2

Updating the section s in the normalized Q vector table corresponding to the end point d according to the value_ijThe corresponding normalized Q vector.

Step 3.2: calculating a scalar value based on driver preferences: according to the weight vector omega which is the corresponding driver preference obtained in the step (2) and the Q vector table after normalization in the step (1), a linear scaling function is applied to convert the Q vectors of all the adjacent road sections of the current traffic node where the vehicle v is located in the Q vector table with the end point d into the scaling value SQ based on the driver preference_d(i, j), according to FIG. 3, the specific operation is as follows:

SQ_d(j,k)＝0.8×0.195+0.2×0.388＝0.2336

SQ_d(j,k′)＝0.8×0.253+0.2×0.306＝0.2636

SQ_d(j,k″)＝0.8×0.310+0.2×0.306＝0.3092

wherein, P_d(i, j) vehicle end point d and select road segment s_ijThe probability of (a) is a traffic node, A (i) is a set of end points of road sections with the traffic node i as a starting point, and according to a set formed by end points corresponding to adjacent road sections of the current node obtained by a road network topological structure, the epsilon is a traffic congestion coefficient, and ESQ_d(i) Scalar value SQ based on driver preference from road segment around node i to destination d_d(i) Average value of (a).

Can be calculated as p_d(j,k)＝0.3705，p_d(j,k′)＝0.3387，p_d(j,k")＝0.2908

As shown in fig. 4, for the traffic congestion situation compared with the conventional guidance method, the abscissa is the simulation time step, and the ordinate is the current total vehicle number of the road network; the more vehicles are, the more congested the road network is. Compared with the traditional path induction method Dijk, the SMOSWU and the dynamic path induction method based on multi-target Sarsa learning, disclosed by the invention, have the advantages that real-time traffic information is fully utilized, the efficiency of a traffic system is improved, and the traffic jam condition is effectively relieved on the basis of considering the personal preference of a user.

Claims

1. A dynamic path induction method based on multi-target Sarsa learning is characterized by comprising the following procedures:

step 1: initializing information, specifically comprising steps 1.1 to 1.3:

step 1.1: confirming an induction target: the method comprises the steps of selecting one or more of minimized travel time, minimized travel distance and minimized cost;

step 1.2: aiming at the induced targets, a traffic information center initializes a Q vector table of each induced target corresponding to a terminal to be selected on a road network by using a dynamic planning algorithm based on a Q value according to road network information in a geographic information base and historically acquired static data of each road section, wherein one Q vector table corresponds to one terminal to be selected;

(1) defining an induced target weight:

recording current information of all vehicles in a road network, real-time traffic information passing through a current road section and preference of each passing driver in the road network; assuming that there are n induction targets in total, the preference of each driver is denoted as weight vector ω ═ ω (ω ═ ω)₁,…,ω_n) Wherein, ω is_o∈[0,1]And (3) representing the weight of the corresponding preference of the No. o induction target, defining the weight of each induction target:

wherein, beta and gamma are parameters, and the traffic congestion coefficient belongs to the current traffic condition of the traffic system;

wherein the content of the first and second substances,

for vehicles v to pass over sections s_ijThe actual prize value obtained;

wherein the content of the first and second substances,

and

respectively taking the minimum value and the maximum value of Q values of all road sections corresponding to the destination d and the induction target o;

step 3.2: calculating a scalar value based on driver preferences: according to the corresponding driver preference, namely the weight vector omega obtained in the step 2 and the Q vector table normalized in the step 3.1, a linear scaling function is applied to convert the Q vectors of all the adjacent road sections of the current traffic node where the vehicle is located in the Q vector table with the end point d into a scaling value SQ based on the driver preference by using the following formula_d(i, j), the specific formula is as follows:

wherein, P_d(i, j) vehicle end point d and select road segment s_ijThe probability of (a) is a traffic node, A (i) is a set of end points of road sections with the traffic node i as a starting point, according to a set formed by end points corresponding to adjacent road sections of the current node obtained by a road network topological structure, epsilon is a traffic congestion coefficient,ESQ_d(i) scalar value SQ based on driver preference from road segment around node i to destination d_d(i) Average value of (d);

2. The dynamic path induction method based on multi-objective Sarsa learning of claim 1, wherein the road network information in step 1 comprises: road network topology, road length, number of lanes.

3. The dynamic path induction method based on multi-target Sarsa learning as claimed in claim 1, wherein the static data of each road section in step 1 comprises: historical vehicle transit time, distance, cost.

4. The dynamic path guidance method based on multi-target Sarsa learning of claim 1, wherein the current information of all vehicles in step 2 comprises: including location, desired destination, all next traffic nodes that can be reached.

5. The dynamic path induction method based on multi-target Sarsa learning of claim 1, wherein the step 2 comprises the following steps: travel time, distance, cost.

6. The dynamic path induction method based on multi-target Sarsa learning of claim 1, wherein the actual reward value of step 2 comprises: travel time, distance, or cost, only one of which is selected.