CN115713860B

CN115713860B - Expressway traffic control method based on reinforcement learning

Info

Publication number: CN115713860B
Application number: CN202211472405.6A
Authority: CN
Inventors: 金波; 娄刃; 何亚强; 汪成立; 张俊烨; 杨松; 张冶芳
Original assignee: Zhejiang Scientific Research Institute of Transport
Current assignee: Zhejiang Scientific Research Institute of Transport
Priority date: 2022-11-23
Filing date: 2022-11-23
Publication date: 2023-12-15
Anticipated expiration: 2042-11-23
Also published as: CN115713860A

Abstract

The invention relates to the technical field of traffic control, in particular to a highway traffic control method based on reinforcement learning, which comprises the following steps: step one: reading road information of a traffic control area, and dividing the road into areas; step two: constructing an offline simulation model according to the regional division, wherein the offline simulation model simulates traffic of a traffic control region; step three: training the offline simulation model, and marking the trained offline simulation model as an intelligent agent; step four: and deploying the agent into an actual predictive control model, periodically calculating to obtain an optimal control strategy by taking the maximum throughput of unit time as an objective function, and executing the optimal control strategy. The beneficial technical effects of the invention include: the reinforcement learning training agent is utilized, the defect that the model prediction control is excessively dependent on the traffic flow prediction model is overcome, and the prediction accuracy is improved, so that the accurate calculation of the effective management and control strategy is realized, and the more optimized management and control strategy is obtained.

Description

Expressway traffic control method based on reinforcement learning

Technical Field

The invention relates to the technical field of traffic control, in particular to a highway traffic control method based on reinforcement learning.

Background

Due to the restriction of land resources shortage, environmental protection and other factors, the construction pace of the expressway becomes slow, the maintenance amount of the automobile is continuously increased, the traffic flow of part of road sections tends to be saturated, the expressway is more and more difficult to meet the increasing service demands in the aspects of supply capacity and efficiency, and the running efficiency and safety of the expressway are required to be improved through informatization and intelligent means. Traffic control is used as an effective means for improving the running efficiency and safety of the expressway, and the space-time distribution of road resources is optimized mainly through measures such as variable speed limit and ramp current limit, so that the running efficiency of the road is improved, traffic jam is relieved, and the service quality is improved. The variable speed limit control method controls the number of vehicles entering the bottleneck region within a certain range by adjusting the speed limit of the upstream region of the bottleneck region, thereby improving the operation efficiency of the bottleneck region. The ramp current limiting control method relieves the congestion of the bottleneck area by limiting the traffic flow which the ramp enters.

In the current expressway traffic control process, traffic control strategies are mostly formulated by management staff according to experience. Because the implementation effect of the existing traffic control strategy is not ideal based on experience only, the actual requirement is difficult to meet. Meanwhile, the manual method cannot adapt to the dynamically-changed traffic demand and flow. Therefore, an automated and intelligent traffic control method is of great importance.

The prior art discloses a model predictive control (Model Predictive Control, MPC) technique. As one of the important methods of automatic traffic control, the traffic running situation can be predicted based on a traffic flow prediction model, and the optimal strategy can be determined by solving the global optimal or local optimal problem. However, the method is limited by the accuracy of a traffic flow prediction model, and the actual traffic running condition becomes complex, so that the traffic flow prediction accuracy is low, and the control effect cannot be ensured. There is a need to develop new technologies that can more effectively and accurately regulate traffic flow.

Disclosure of Invention

The invention aims to solve the technical problems that: at present, the technical problem of accurately controlling traffic flow is lacking. The expressway traffic control method based on reinforcement learning can effectively improve the traffic capacity of expressways.

The technical problems are solved, and the invention adopts the following technical scheme: a highway traffic control method based on reinforcement learning comprises the following steps:

step one: reading road information of a traffic control area, and dividing the road into areas;

step two: constructing an offline simulation model according to the regional division, wherein the offline simulation model simulates traffic of a traffic control region;

step three: training the offline simulation model by using historical traffic data, and marking the trained offline simulation model as an intelligent agent;

step four: and deploying the agent into an actual predictive control model, periodically calculating to obtain an optimal control strategy by taking the maximum throughput of unit time as an objective function, and executing the optimal control strategy.

Preferably, in the first step, the method for performing region division includes:

dividing the road of the traffic management area into the following area types: a variable speed limit region K0, an acceleration region J, a confluence region H and a ramp region Z0;

dividing the upstream region of the variable speed limit region K0 into L-1 main path detection regions respectively denoted by K ₁ ,K ₂ ,…,K _L-1 ；

Dividing the upstream region of the ramp region Z0 into L-1 ramp detection regions, respectively denoted as Z ₁ ,Z ₂ ,…,Z _L-1 。

Preferably, in the first step, the method for constructing an offline simulation model includes:

setting state quantity of an offline simulation model, wherein the state quantity comprises the following steps: variable speed limit region K0 and traffic flow density { qK of upstream region thereof ₀ ,qK ₁ ,…,qK _L-1 Traffic flow density qH in confluent region, ramp region and traffic flow density { qZ in upstream region thereof ₀ ,qZ ₁ ,…,qZ _L-1 The state quantity is recorded as { qK } ₀ ,…,qK _L-1 ,qH,qZ ₀ ,…,qZ _L-1 }；

Setting an action of an offline simulation model, setting a control step length L, and setting the action as a speed limit value sequence { V } within the step length L in a variable speed limit area K0 ₀ ,V ₁ ,…,V _L-1 Exchange rate sequence { P } and ramp zone Z0 in L steps ₀ ,P ₁ ,…,P _L-1 }；

Setting rewards of the offline simulation model, taking the total transit time TTT in the control step length L as rewards,

wherein T represents the interval duration of the control step length, q _in (t) represents the density of the inflowing traffic flow in the control region at the time t, q _out And (t) represents the density of the outgoing traffic flow in the control area at the time t.

Preferably, in step one, the method of constructing an offline simulation model further comprises constructing a simulation environment,

the method for constructing the simulation environment comprises the following steps:

establishing a road model according to the road information of the traffic control area;

the method comprises the steps of importing inflow traffic flow data and outflow traffic flow data into a road model, and setting a motion model of a vehicle in a road area to form a traffic flow simulation model;

the traffic flow simulation model is used as an offline simulation model.

Preferably, the method of setting a motion model of a vehicle in a road area includes:

setting vehicle attributes including a position and a speed, the position and the speed being determined by the inflow traffic flow data;

setting a vehicle speed change control function, wherein the vehicle speed change control function changes the speed of a vehicle in a preset period, and the input of the vehicle speed change control function is a speed limit, a distance from a front vehicle, vehicle acceleration and the speed of the current period of the vehicle, and the speed limit is determined by the type of an area where the position of the vehicle is located.

Vehicle shift control functions of the vehicle in a variable speed limit region K0, an acceleration region J, a merging region H, and a ramp region Z0 are set, respectively.

Preferably, in the third step, the method for training the offline simulation model includes:

selecting a traffic control mode, wherein the traffic control mode comprises a variable speed limit control mode and a ramp current limit mode;

training the off-line simulation model according to the selected traffic control mode;

and step four, the optimized management and control strategy which is attempted to be generated accords with the selected traffic management and control mode, so that the optimized management and control strategy which obtains the maximum throughput in unit time is executed.

Preferably, when the selected traffic control mode is a variable speed limit control mode, the method for training the offline simulation model comprises the following steps:

step A1) obtaining the current state s _t ＝{qK ₀ ,…,qK _L-1 ,qH,qZ ₀ ,…,qZ _L-1 }；

Step A2) generating and executing an action a _t ，a _t ＝{V ₀ ,V ₁ ,…,V _L-1 }；

Step A3) obtaining the next state s through calculation of an offline simulation model _t+1 And acquire the strengthening signal r _t+1 ；

Step A4) updating the Q value according to the strengthening signal r _t+1 The Q value is calculated according to the following formula:

Q _t+1 (s _t ,a _t )←Q _t (s _t ,a _t )+γ·maxQ _t (s _t+1 ,a _t+1 )

wherein s is _t Representing the state of the timetable t, a _t Representing the action of the timetable t, r _t The rewards of the timetable t are represented, and gamma is a preset parameter for Q learning;

step A5), if the Q value is converged, the off-line simulation model training is finished; otherwise, returning to the step A1).

Preferably, when the selected traffic control mode is a ramp current limiting mode, the method for training the offline simulation model comprises the following steps:

step B1) obtaining the current state s _t ＝{qK ₀ ,…,qK _L-1 ,qH,qZ ₀ ,…,qZ _L-1 }；

Step B2) generating and executing an action a _t ，a _t ＝{P ₀ ,P ₁ ,…,P _L-1 }；

Step B3) obtaining the next state s through calculation of an offline simulation model _t+1 And acquire the strengthening signal r _t+1 ；

Step B4) updating the Q value according to the strengthening signal r _t+1 The Q value is calculated according to the following formula:

Q _t+1 (s _t ,a _t )←Q _t (s _t ,a _t )+γ·maxQ _t (s _t+1 ,a _t+1 )

step B5), if the Q value is converged, the off-line simulation model training is finished; otherwise, returning to the step A1).

Preferably, in the fourth step, the method for deploying the agent into the actual predictive control model includes:

acquiring road information of a target traffic management and control area;

establishing a road model according to the road information of the traffic control area to replace the road model in the intelligent agent;

and importing the current inflow traffic flow data and the current outflow traffic flow data into the intelligent agent to complete the deployment of the intelligent agent.

Preferably, in the fourth step, the method for periodically calculating to obtain the optimized control strategy includes:

step C1), acquiring a front state of a target traffic management area;

step C2), generating a control strategy, and importing the control strategy and the current state into an intelligent agent to obtain the throughput per unit time under the current control strategy;

step C3) continuously changing the control strategy until the control strategy with the maximum throughput per unit time is found or the maximum changing times are reached;

step C4) taking as the optimal control strategy the control strategy that maximizes throughput per unit time, denoted asAnd executing the optimized management and control strategy.

The beneficial technical effects of the invention include: 1) The reinforcement learning training agent is utilized, the defect that the model prediction control is excessively dependent on the traffic flow prediction model is overcome, and the prediction accuracy is improved, so that the accurate calculation of an effective management and control strategy is realized, and a more optimized management and control strategy is obtained; 2) Calculating an optimal control strategy according to the traffic state of the control area by utilizing the optimal control strategy of the rolling of the predictive control model, and continuously updating to realize the optimal control of the whole time domain; 3) Different from the traditional management and control strategy optimization method, the optimization of the management and control strategy not only is based on the traffic flow data of the variable speed limit and ramp area, but also is based on the traffic flow data of the variable speed limit and the upstream of the ramp area, so that a more optimized traffic management and control effect is obtained, the total traffic time of vehicles in the bottleneck area of the expressway is more effectively reduced, and the traffic running efficiency is remarkably improved.

Other features and advantages of the present invention will be disclosed in the following detailed description of the invention and the accompanying drawings.

Drawings

The invention is further described with reference to the accompanying drawings:

fig. 1 is a schematic flow chart of a highway traffic control method according to an embodiment of the present invention.

Fig. 2 is a flow chart of a method for performing region division according to an embodiment of the present invention.

Fig. 3 is a schematic view of highway regional division according to an embodiment of the present invention.

FIG. 4 is a flowchart of a method for constructing a simulation environment according to an embodiment of the present invention.

FIG. 5 is a flowchart of a method for training an offline simulation model according to an embodiment of the present invention.

Fig. 6 is a flowchart of a variable speed limit control mode training method according to an embodiment of the present invention.

FIG. 7 is a flowchart of a ramp current limit mode training method according to an embodiment of the present invention.

FIG. 8 is a flowchart illustrating a method for deploying an agent to an actual predictive control model according to an embodiment of the present invention.

Fig. 9 is a flowchart of a method for periodically calculating and obtaining an optimal control policy according to an embodiment of the present invention.

Wherein: 100. dividing areas, 101, adding dividing areas, 102 and traditional dividing areas.

Detailed Description

The technical solutions of the embodiments of the present invention will be explained and illustrated below with reference to the drawings of the embodiments of the present invention, but the following embodiments are only preferred embodiments of the present invention, and not all embodiments. Based on the examples in the implementation manner, other examples obtained by a person skilled in the art without making creative efforts fall within the protection scope of the present invention.

In the following description, directional or positional relationships such as the terms "inner", "outer", "upper", "lower", "left", "right", etc., are presented for convenience in describing the embodiments and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and therefore should not be construed as limiting the invention.

Before describing the scheme of the embodiment, a related description is made of the application background of the embodiment.

As pointed out in the background art, due to the restriction of land resources shortage, environmental protection and other factors, the construction steps of the expressway become slow, meanwhile, the maintenance amount of the automobile is continuously increased, the traffic flow of part of road sections tends to be saturated, the expressway is more and more difficult to meet the increasing service demands in the aspects of supply capacity and efficiency, and the expressway operation efficiency and safety are required to be improved through informatization and intelligent means. By means of informatization and intellectualization, the traffic of vehicles on the expressway can be controlled more scientifically and efficiently, and therefore the traffic efficiency of the expressway is improved. But currently there is a lack of schemes to effectively implement efficient generation of traffic control strategies.

The speed of a vehicle traveling on a highway is related not only to the traffic density, but also to the area of the highway where the vehicle is located. Therefore, to generate traffic control strategies, it is necessary to establish a simulation of the vehicle movement as close as possible to the traffic law of the actual target traffic control area. Either directly modeling the road or modeling the traffic flow. Due to the diversity of vehicles and the differences in driver technology, vehicles do not have a unified motion model. Therefore, by means of deep learning, the vehicle motion model is built by learning and simulating actual historical data, so as to build a vehicle flow model, and further build a corresponding road model, namely the intelligent agent in the embodiment.

Referring to fig. 1, the highway traffic control method based on reinforcement learning comprises the following steps:

step two: constructing an offline simulation model according to the regional division, and simulating traffic of the traffic control region by the offline simulation model;

step three: training an offline simulation model by using historical traffic data, and marking the trained offline simulation model as an intelligent agent;

step four: and deploying the intelligent agent into an actual predictive control model, periodically calculating to obtain an optimal control strategy by taking the maximum throughput of unit time as an objective function, and executing the optimal control strategy.

After the road is divided into areas, the speed control mode of the vehicle can be limited, and meanwhile, an object is provided for traffic control of the road. In this embodiment, the control policy is performed for the corresponding road area. Different road areas can carry out different speed limiting and traffic limiting, so that a management and control strategy is formed. In different road areas, the speed control and traffic rules of the vehicle are different. The corresponding offline simulation model will provide a simulation of these road areas.

Referring to fig. 2, in step one, the method for performing region division includes:

step S101) dividing the roads of the traffic control area into the following area types: a variable speed limit region K0, an acceleration region J, a confluence region H and a ramp region Z0;

step S102) dividing the upstream region of the variable speed limit region K0 into L-1 main road detection regions, respectively denoted as K ₁ ,K ₂ ,…,K _L-1 ；

Step S103) dividing the upstream region of the ramp region Z0 into L-1 ramp detection regions, respectively denoted as Z ₁ ,Z ₂ ,…,Z _L-1 。

The variable speed limit region K0 and its upstream region can be divided into different speed limit values, and the communication rule of the acceleration region L is determined by common knowledge, and in the merging region H, if the distance between vehicles exceeds a predetermined value, vehicles can alternately pass without decelerating and waiting, but if the distance between vehicles does not reach the predetermined value, vehicles need to be decelerated or stopped to wait and alternately pass. The ramp zone Z0 has a ramp rate limit, and the ramp zone Z0 can be current limited, i.e. turned off.

Referring to fig. 3, four zone types for dividing the highway according to the present embodiment are shown. The present embodiment newly adds the dividing region 101 including the upstream region of the variable speed limit region K0 and the upstream region of the ramp region Z0, as compared to the conventional dividing region 102 dividing only the variable speed limit region K0, the accelerating region J, the merging region H, and the ramp region Z0. By improving the area division mode, the control of the expressway management and control strategy is thinned, the accuracy of the management and control strategy is improved, and the traffic efficiency of the expressway can be effectively improved.

In the first step, the method for constructing the offline simulation model comprises the following steps:

Setting the action of the offline simulation model, setting a control step length L, and setting the action as a speed limit value sequence { V ] in the step length L in a variable speed limit area K0 ₀ ,V ₁ ,…,V _L-1 Exchange rate sequence { P } and ramp zone Z0 in L steps ₀ ,P ₁ ,…,P _L-1 }；

In the first step, the method for constructing the offline simulation model further comprises constructing a simulation environment,

referring to fig. 4, the method for constructing the simulation environment includes:

step S201), a road model is established according to the road information of the traffic control area;

step S202), the inflow traffic flow data and the outflow traffic flow data are imported into a road model, and a motion model of a vehicle in a road area is set to form a traffic flow simulation model;

step S203) the traffic flow simulation model is used as an offline simulation model.

The prior art discloses a simulation mode of a motion model of various vehicles, and the off-line simulation model is established to establish the traffic simulation of the expressway, compare with the historical data, optimize the motion model of the vehicles and enable the motion model of the vehicles to be consistent with the motion of the vehicles on the actual expressway as much as possible. However, the motion rules of vehicles driven by different drivers are not all the same for different types or different models of vehicles. Only the built unified motion model of the vehicle can reflect the operation rule of the vehicle on the expressway as a whole, thereby realizing the simulation of traffic flow.

In another aspect, the present embodiment provides a specific method for setting a motion model of a vehicle in a road area, including: setting vehicle attributes including a position and a speed, the position and the speed being determined by the inflow traffic flow data;

setting a vehicle speed change control function, wherein the vehicle speed change control function changes the speed of the vehicle in a preset period, and the input of the vehicle speed change control function is a speed limit, a distance between the vehicle speed change control function and a preceding vehicle, the vehicle acceleration and the speed of the current period of the vehicle, and the speed limit is determined by the type of the area where the vehicle position is located. By setting the position and speed attribute of the vehicle and combining the speed change control function of the vehicle, the position change of the vehicle can be continuously simulated by controlling the speed of the vehicle, so that the operation rule in the process of moving the vehicle from the entrance position of the traffic control area to the exit position of the traffic control area is realized.

Further, vehicle shift control functions of the vehicle in the variable speed limit region K0, the acceleration region J, the merging region H, and the ramp region Z0 are set, respectively. Because the passing rules of different dividing regions are different, inaccuracy of the vehicle speed change control function may be caused, and in order to further improve the simulation accuracy, the vehicle speed change control functions of the vehicle in the variable speed limit region K0, the acceleration region J, the merging region H and the ramp region Z0 are respectively set.

Referring to fig. 5, in step three, the method for training the offline simulation model includes:

step S301), a traffic control mode is selected, wherein the traffic control mode comprises a variable speed limit control mode and a ramp current limit mode;

step S302), training an offline simulation model according to the selected traffic control mode;

step S303), in step four, the optimized management and control strategy attempted to be generated accords with the selected traffic management and control mode, so that the optimized management and control strategy which obtains the maximum throughput in unit time is executed.

The variable speed limiting control mode and the ramp current limiting mode can both play a role in improving the expressway passing efficiency. Which way is specifically chosen is determined by the actual control conditions.

Referring to fig. 6, when the selected traffic control mode is a variable speed limit control mode, the method for training the offline simulation model includes:

Q _t+1 (s _t ,a _t )←Q _t (s _t ,a _t )+γ·maxQ _t (s _t+1 ,a _t+1 )

Action a _t The speed limit values of the variable speed limit region K0 and the upstream region thereof are determined and respectively recorded asWhen the simulation result of the offline simulation model converges the Q value, the best action a is found _t The optimal control strategy is obtained.

Referring to fig. 7, when the traffic control mode is selected to be the ramp current limiting mode, the method for training the offline simulation model includes:

Q _t+1 (s _t ,a _t )←Q _t (s _t ,a _t )+γ·maxQ _t (s _t+1 ,a _t+1 )

Action a _t Determining the merging state of the ramp zone Z0 and the upstream zone thereof, respectively denoted as P ₀ ,P ₁ ,…,P _L-1 。P ₀ ,P ₁ ,…,P _L-1 And the values of the (a) are all between 0 and 1. A value of 0 indicates that the corresponding ramp is closed by current limiting, and a value of 1 indicates that the corresponding ramp can normally pass. When the simulation result of the off-line simulation model makes the Q value converged, the best is foundAction a _t The optimal control strategy is obtained.

Referring to fig. 8, in step four, the method for deploying an agent into an actual predictive control model includes:

step S401), road information of a target traffic control area is obtained;

step S402), a road model is built according to the road information of the traffic control area, and the road model in the intelligent agent is replaced;

step S403) importing the current inflow traffic flow data and the current outflow traffic flow data into the agent, and completing deployment of the agent.

The intelligent agent comprises a road model and a traffic flow simulation model, and the traffic flow simulation model also comprises a motion model of the vehicle. The motion model of the vehicle is the same for different traffic control areas. And replacing the road model of the intelligent agent with the road model of the target traffic control area, thus completing the deployment of the intelligent agent.

In the fourth step, the method for obtaining the optimized control strategy through periodic calculation comprises the following steps:

step C1), acquiring a front state of a target traffic management area;

step C4) taking as the optimal control strategy the control strategy that maximizes throughput per unit time, denoted asAnd executing an optimization management and control strategy.

And through periodic continuous optimization calculation, an optimal management and control strategy is obtained continuously according to the change of actual conditions, and the traffic efficiency of the expressway is effectively improved. When the selected traffic control mode is a variable speed limit control mode,{ V representing maximum throughput per unit time ₀ ,V ₁ ,…,V _L-1 When the selected traffic control mode is a ramp current limiting mode, the control mode is a ramp current limiting mode>{ P { representing maximum throughput per unit time ₀ ,P ₁ ,…,P _L-1 }。

The beneficial technical effects of the embodiment include: 1) The reinforcement learning training agent is utilized, the defect that the model prediction control is excessively dependent on the traffic flow prediction model is overcome, and the prediction accuracy is improved, so that the accurate calculation of an effective management and control strategy is realized, and a more optimized management and control strategy is obtained; 2) Calculating an optimal control strategy according to the traffic state of the control area by utilizing the optimal control strategy of the rolling of the predictive control model, and continuously updating to realize the optimal control of the whole time domain; 3) Different from the traditional management and control strategy optimization method, the optimization of the management and control strategy not only is based on the traffic flow data of the variable speed limit and ramp area, but also is based on the traffic flow data of the variable speed limit and the upstream of the ramp area, so that a more optimized traffic management and control effect is obtained, the total traffic time of vehicles in the bottleneck area of the expressway is more effectively reduced, and the traffic running efficiency is remarkably improved.

While the invention has been described in terms of embodiments, it will be appreciated by those skilled in the art that the invention is not limited thereto but rather includes the drawings and the description of the embodiments above. Any modifications which do not depart from the functional and structural principles of the present invention are intended to be included within the scope of the appended claims.

Claims

1. A highway traffic control method based on reinforcement learning is characterized in that,

the method comprises the following steps:

step four: deploying the agent into an actual predictive control model, periodically calculating to obtain an optimal control strategy by taking the maximum throughput of unit time as an objective function, and executing the optimal control strategy;

in the first step, the method for dividing the region comprises the following steps:

dividing the upstream region of the variable speed limit region K0 into L-1 main path detection regions, respectively denoted as；

Dividing the upstream region of the ramp region Z0 into L-1 ramp detection regions, which are respectively marked as；

In the second step, the method for constructing the offline simulation model comprises the following steps:

setting state quantity of an offline simulation model, wherein the state quantity comprises the following steps: variable speed limit region K0 and traffic flow density in the upstream region thereofConfluence area traffic flow Density->Traffic flow Density of ramp region and upstream region thereof>The state quantity is marked as->；

Setting an action of an offline simulation model, setting a control step length L, and setting the action as a speed limit value sequence in the L step length in a variable speed limit area K0Exchange rate sequence with ramp zone Z0 in L steps；

where T represents the interval duration of the control step,representing the density of the incoming traffic flow in the control area at time t,representing the density of the outgoing traffic flow in the control area at time t;

in the third step, the method for training the offline simulation model includes:

in the fourth step, the optimized control strategy which is attempted to be generated accords with the selected traffic control mode, so that the optimized control strategy which obtains the maximum throughput in unit time is executed;

in the fourth step, the method for deploying the agent into the actual predictive control model includes:

acquiring road information of a target traffic management and control area;

importing current inflow traffic flow data and outflow traffic flow data into the intelligent agent to complete deployment of the intelligent agent;

the method for obtaining the optimal management and control strategy through periodic calculation comprises the following steps:

step C1), acquiring a front state of a target traffic management area;

step C4) taking as the optimal control strategy the control strategy that maximizes throughput per unit time, denoted asAnd executing the optimized control strategy.

2. The method for managing and controlling highway traffic based on reinforcement learning according to claim 1, wherein,

in the second step, the method for constructing the offline simulation model further comprises constructing a simulation environment,

the traffic flow simulation model is used as an offline simulation model;

the method for setting the motion model of the vehicle in the road area comprises the following steps: