CN112927522B

CN112927522B - Internet of things equipment-based reinforcement learning variable-duration signal lamp control method

Info

Publication number: CN112927522B
Application number: CN202110067478.6A
Authority: CN
Inventors: 陈铭松; 张雯倩; 赵吴攀; 叶豫桐; 胡铭; 韩定定
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2021-01-19
Filing date: 2021-01-19
Publication date: 2022-07-05
Anticipated expiration: 2041-01-19
Also published as: CN112927522A

Abstract

The invention provides a reinforcement learning variable-duration signal lamp control method based on Internet of things equipment, which mainly comprises the following contents: a reinforcement learning method based on the concept of 'strength' of an intersection is designed, and various real-time traffic information (such as the position, speed and the like of a vehicle) collected by Internet of things equipment is used for controlling phase selection of a signal lamp. And simultaneously, the most reasonable green light duration can be selected according to the number of vehicles on each lane. The invention can quickly converge to an excellent signal lamp control strategy under the condition of traffic dynamic change, greatly shortens the learning time of the strategy and improves the control quality of the strategy.

Description

Internet of things equipment-based reinforcement learning variable-duration signal lamp control method

Technical Field

The invention belongs to the technical field of computers, relates to a deep reinforcement learning algorithm and a signal lamp control problem, and particularly relates to an effective signal lamp control strategy generated by learning according to real-time traffic data obtained by Internet of things equipment in a highly complex real-time traffic environment.

Background

In recent years, with the rapid increase of automobile holding capacity in our country, more and more road traffic problems frequently occur, such as traffic planning problems, road safety problems, road congestion problems, traffic control problems, and the like. Traffic congestion has been a key issue in designing efficient infrastructure, but due to the rapid increase in traffic demand, traffic congestion has become a prominent issue. In addition, traffic jam also causes a series of problems of traffic environmental pollution, traffic peace confusion and the like, and the travel quality and the life quality of people are seriously influenced. And the signal lamp of the intersection is used as the minimum unit for controlling the traffic, and how to make a reasonable control strategy becomes the key point of research.

Conventional fixed logic traffic signal controllers, which control signal lights using predefined signal phases and green light times, cannot be flexibly adjusted according to changes in traffic conditions, which makes it difficult to effectively control and guide traffic in the context of dynamic changes in traffic flow. Therefore, how to control the signal lamps in real time according to the traffic conditions so as to achieve dynamic control of the overall traffic flow is one of the current research hotspots. Reinforcement learning, as a learning method of "trial and error", is increasingly applied to the problem of signal lamp control.

However, due to inaccurate modeling, the existing reinforcement learning-based signal lamp control method has difficulty in quickly extracting effective contents from complex traffic information to guide the model to converge to an excellent control strategy. Meanwhile, in order to simplify traffic modeling, the existing method usually sets a fixed green time period for the signal lamp, which actually causes a waste of control time. Therefore, how to design an accurate and reasonable reinforcement learning method to quickly learn an effective variable-duration signal lamp control strategy is a problem to be solved urgently.

Disclosure of Invention

In order to solve the technical defects, the invention aims to provide a novel reinforcement learning variable-duration signal lamp control method based on internet of things equipment, designs a reinforcement learning method based on the concept of intersection strength, and controls the phase selection of a signal lamp by utilizing various real-time traffic information (such as the position, speed and the like of a vehicle) acquired by the internet of things equipment. And simultaneously, the most reasonable green light duration can be selected according to the number of vehicles on each lane.

The method comprises the following specific steps:

step 1: acquiring real-time traffic data based on the Internet of things equipment, processing the acquired traffic information, and generating newly defined intensity information according to the acquired traffic data; the Internet of things equipment comprises a velometer and a sensor; real-time traffic data includes the location and speed of the vehicle; the intensity information includes the intensity of the vehicle, lane, action, phase, and intersection.

Through extensive research and study, most of the current reinforcement learning methods tend to design a complex state to include as much traffic information as possible, however, the complex design is usually accompanied by a long learning process. The invention provides a brand-new concept of 'strength' under the condition of fully considering various traffic data, and can calculate the strength of the vehicle, the lane, the action, the phase and the intersection on the basis of the vehicle position and the speed data which can be acquired by the equipment of the Internet of things. Designing reinforcement learning methods based on the definition of strength can greatly shorten the strategy learning process.

The intensity of the vehicle is first defined. Assuming that the current vehicle speed is v, the allowable maximum driving speed of the current lane is v_maxThe length of the lane is L, the distance between the vehicle and the intersection is x, and a weight coefficient delta is introduced, so that the vehicle strength is as follows:

on the basis, the invention defines the lane strength as the sum of all vehicle strengths on the current lane, namely

vehicle_iIndicating the i-th vehicle on the lane,

indicating the intensity of the i-th vehicle on the lane.

The action intensity is the difference between the 'intensity of a lane driving into the intersection' and the 'average value of the intensity of the lane driving out of the intersection' under the current action, namely

Wherein lane_inIndicating a set of driving lanes, lane, under the action_outRepresenting a set of outgoing lanes, lane, reachable from an incoming lane_iIndicating the ith lane, in the set of lanes_jRepresents the jth lane, | lane in the lane set_outL represents the number of exiting lanes,

indicating the strength of the ith lane,

indicating the intensity of the jth lane.

The phase strength being the sum of the motion strengths of the permissible movements in that phase, i.e.

movement_iRepresenting the ith action that constitutes phase,

indicating the intensity corresponding to action i.

The invention defines the intersection intensity as the sum of all the vehicle intensities entering the intersection minus the sum of the vehicle intensities exiting the intersection, and the intersection intensity is expressed as:

wherein, lane_inIndicating a set of incoming lanes at the crossing, lane_outSet of outgoing lanes, lane, representing an intersection_iIndicating the ith lane, in the set of lanes_jRepresenting the jth lane in the set of lanes,

indicating the strength of the ith lane,

indicating the intensity of the jth lane.

In addition, in order to realize the control cooperation between the signal lamps of the adjacent intersections, the strength of the adjacent intersection of the intersection I is defined as follows:

wherein lane_inComposed of the lanes of adjacent crossings, to which vehicles will drive toward the crossings I, lane_iRepresenting the ith lane in the set of lanes,

indicating the strength of the ith lane. n is₀Indicating the number of vehicles passing through the intersection in unit time, t indicating the remaining time of green light at the adjacent intersection, and N being the time of lane_inThe total number of vehicles, ω, is a weight coefficient.

Step 2: designing a reinforcement learning method, generating a reinforcement learning state:

reinforcement learning methods generally include three elements: state, action and reward, the invention is designed as follows:

the state is as follows: the state is calculated after the intelligent agent observes the environment through the Internet of things equipment, and the state comprises the strength of each phase, the strength of a direct neighbor intersection and the current phase of the intersection. The intensity of each phase and the intensity of a direct neighbor intersection can be obtained by calculating the speed and the position of a vehicle collected by a road test speed sensor and an intersection camera, and the current state of a signal lamp can be directly read by the current phase of the intersection.

Taking a typical four-way intersection as an example, if there are 4 selectable phases and the current phase of the intersection is p, the state is represented as

If there is no direct neighbor crossing in a certain direction, the neighbor crossing in the direction is strongThe degree is 0.

The actions are as follows: representing the behavior taken by the model in interacting with the environment, and in signal light control problems, the actions are typically set to phase numbers. If there are 4 selectable phases, the motion space is {0,1,2,3 }.

Rewarding: rewards reflect how well an action is performed in a certain state, reflecting the quality of the action taken in the current state, to guide the learning process. The negative value of the crossing intensity is set. This means that if an action can reduce the intersection intensity more greatly, the action is considered as a better action.

The key point of the design is the state and the reward, and the state and the reward are designed through various intensity information obtained through real-time traffic data calculation.

And step 3: each intersection is provided with an intensified learning agent to control the phase selection of the traffic signal lamp.

When the green light duration of the current phase is used up, the intelligent agent selects a new optimal phase for the signal lamp by processing traffic data acquired by the intersection and the road Internet of things equipment. The simultaneously acquired traffic data and the selected phase are stored for use in training the agent.

The selection strategy of the phase is trained by the reinforcement learning method based on the intensity mechanism. The intelligent agent interacts with the traffic environment, the intelligent agent is trained through the traffic data acquired in real time, model parameters are continuously optimized while signal lamps are controlled, a more excellent control strategy is gradually learned, the strategy can be adjusted according to traffic changes, and the average waiting time of all traveling vehicles is minimized. The phase of the signal lamp can be optimally selected according to real-time traffic conditions by using the model.

And 4, step 4: selecting the green light time length according to the selected phase, calculating the number of vehicles on each lane, and applying the selected phase and the green light time length to the traffic signal lamp;

the agent will select the most reasonable time for the selected phase according to the number of vehicles on each lane at the current moment, so as to ensure that waiting vehicles on the passable lane under the selected phase can smoothly pass through the intersection and avoid time waste.

The intelligent agent firstly obtains the number of vehicles on each lane of the intersection, and then selects a most reasonable time length from the selectable duration set according to the number. The specific calculation mode is designed as follows:

first, the minimum value t of the duration of the green light is set_minAnd maximum value t_maxAnd a number M of selectable durations, based on which the set of selectable durations is defined as:

where Δ t is equal to t_minAnd t_maxThe time periods between are evenly divided according to M-1 sections, t_iIndicating the final selectable duration.

After the agent selects a phase, it is observed that the total number of vehicles on the lane of the crossing that are allowed to move for that phase is N, and a green duration is assigned to that phase

Wherein n is₀Representing the number of vehicles passing through the intersection per unit time, operator

(a.ltoreq.b) means that y is x when x.ltoreq.a and x.ltoreq.b, a when x.ltoreq.a, and b when x.ltoreq.b. t is a positive integer (N)^*Representing a set of positive integers) and is not less than

And 5: storing the data and updating the network parameters through a playback mechanism of the reinforcement learning agent.

The detailed process of step 5 is as follows: the learning-intensive experience playback buffer M is first initialized, and the duration t is initialized. Whenever the current green duration runs out, the agent needs to select the next phase of the traffic signal and the green duration. The intelligent agent firstly obtains a state s through interaction with the environment (information uploaded by Internet of things equipment such as a velometer and a sensor), then the state is input into a reinforcement learning model, the phase action a is obtained through model calculation, and the green light duration time t is calculated. Then apply phase a with duration t for the traffic signal, after time t the agent may take the next state s 'and calculate the reward r gained by taking action a, then store experience < s, a, r, s' > into the experience playback buffer. When the number of stored experiences is not less than the number available for training, the agent will randomly select a batch of samples from the experience replay buffer for model training and update the network weights using a random gradient descent each round.

In each cycle of interaction between the agent and the environment, the learning process can be roughly divided into five steps:

1) observing the traffic environment to obtain the state required by reinforcement learning;

2) selecting the optimal phase action by the reinforcement learning model;

3) calculating the duration time of the green light with the most reasonable phase;

4) applying the selected phase and green duration for the signal light;

5) storing the data and updating the network parameters through a reinforcement learning playback mechanism.

The invention has the beneficial effects that:

the invention provides a novel and effective 'strength' mechanism, dynamic data of a vehicle acquired in real time are abstracted into strength information, and a reinforcement learning method is designed on the basis of the strength information. The method greatly improves the learning convergence speed of the model, realizes the rapid convergence of the reinforcement learning process, greatly shortens the learning time of the strategy, ensures the control quality of the final strategy and improves the control quality of the final learning strategy.

The invention can not only control the phase of the traffic signal lamp in real time, but also distribute the most reasonable green time for the selected phase according to the observed real-time traffic condition. Different from the traditional fixed signal time length, the variable time length signal lamp control strategy adopted by the invention can further utilize time and shorten the average waiting time of the traveling vehicle. Compared with the traditional signal lamp control method and other signal lamp control algorithms based on reinforcement learning, the method can quickly converge to an excellent signal lamp control strategy under the condition of traffic dynamic change, improves the control quality of the strategy, and continuously optimizes the control strategy along with the change of traffic environment.

The method of the invention not only can control the phase of the traffic signal lamp in real time, but also can allocate the most reasonable green light time for the selected phase, thereby further refining and utilizing the time. The phase of the signal lamp can be controlled, and meanwhile, the green time can be dynamically adjusted according to the number of vehicles on each lane, rather than the fixed time length is allocated to the phase as in the traditional method.

Drawings

Fig. 1 is a schematic intersection diagram depicting the concept of motion, signal and phase. Wherein, the left graph road junction with arrow long line represents the action, and the left graph road junction enters a certain driving-in lane and a certain driving-out lane through the road junction. The signal is used to determine which actions are allowed at a time, where light dots indicate that movement is allowed and dark dots indicate that movement is prohibited. The phase is defined as the combination of non-conflicting signals, and as shown in the right diagram, the four phases adopted by a classical signal light control scheme, namely north-south going straight, east-west going straight, north-south turning left, east-west turning left.

Fig. 2 is a flow chart of signal control and policy learning.

FIG. 3 shows the results of the performance test of the method of the present invention.

Detailed Description

The invention is described in further detail in connection with the following specific examples and the accompanying drawings. The procedures, conditions, experimental methods and the like for carrying out the present invention are general knowledge and common general knowledge in the art except for the contents specifically mentioned below, and the present invention is not particularly limited.

The invention relates to a reinforcement learning variable-duration signal lamp control method based on Internet of things equipment, which comprises the following steps:

1. traffic data definition intensity information based on internet of things equipment acquisition:

the invention first defines the strength of the vehicle. Assuming that the current vehicle speed is v, the allowable maximum driving speed of the current lane is v_maxThe length of the lane is L, the distance between the vehicle and the intersection is x, and a weight coefficient delta is introduced, so that the vehicle strength is as follows:

vehicle_iIndicating the i-th vehicle on the lane,

indicating the intensity of the i-th vehicle on the lane.

Wherein lane_inIndicating a set of driving lanes, lane, under the action_outRepresenting a set of outgoing lanes reachable from an incoming lane, lane_iIndicating the ith lane, in the set of lanes_jRepresents the jth lane, | lane in the lane set_outL represents the number of exiting lanes,

indicating the strength of the ith lane,

indicating the intensity of the jth lane.

movement_iRepresenting the ith action that constitutes phase,

indicating the intensity corresponding to action i.

The invention defines the intersection intensity as the sum of all vehicle intensities entering the intersection minus the sum of vehicle intensities exiting the intersection, and the intersection intensity is expressed as:

wherein, lane_inIndicating a set of incoming lanes, lane, at the intersection_outSet of outgoing lanes, lane, representing an intersection_iIndicating the ith lane, in the set of lanes_jRepresents the jth lane in the set of lanes,

indicating the strength of the ith lane,

indicating the intensity of the jth lane.

2. Designing a reinforcement learning method:

the three element states, actions and rewards of the reinforcement learning method are specifically as follows:

the state is as follows: the state is calculated after the intelligent agent observes the environment, and comprises the strength of each phase, the strength of a direct neighbor intersection and the current phase of the intersection. The phase intensity and the intensity of the direct neighbor intersection can be obtained by calculating the speed and the position of the vehicle collected by the road measuring speed sensor and the intersection camera, and the current phase of the intersection can be obtained by directly reading the current state of the signal lamp.

If there is no direct neighbor intersection in a certain direction, the strength of the neighbor intersection in the direction is 0.

The method comprises the following steps: the main task of the signal lamp control strategy is to select the optimal phase position to minimize the intensity of the intersection. When the time for the current phase runs out, the agent needs to take action to select a new phase (phase may be repeatedly selected). The invention thus defines the action as the best signal phase selectable by the signal light. The invention sets 4 selectable phases, four phases as shown on the right side of fig. 1, so that each agent has four different predefined allowed actions, which are spatially coded as {0,1,2,3 }. The key to the design is generally the state and the reward, and the state and the reward are designed according to various intensity information calculated by real-time traffic data.

Rewarding: the reward should reflect the quality of the action taken in the current state to guide the learning process. Higher rewards mean better action. The optimization goal of existing signal light control problems is generally to minimize the average waiting time of the vehicle, which is an indicator that can be obtained only by long-term, continuous signal light action, and is not immediately available and therefore not directly rewarded. According to existing research, the convergence trend of the reinforcement learning algorithm is consistent whether the optimization goal is to shorten the vehicle driving time or minimize the intersection intensity. Therefore, based on the correlation between the strength and the average waiting time established by our model, the reward is set to be the opposite number of the strength value of the intersection. This means that if an action can reduce the intersection intensity more greatly, the action is considered as a better action.

In addition, the invention adopts a classic DQN network structure when designing the reinforcement learning network structure.

3. Control of phase and green duration of traffic lights:

the invention configures a reinforcement learning agent for each intersection. The intelligent agent interacts with the traffic environment, constantly optimizes model parameters while controlling the signal lamp, and learns more excellent control strategies. When the green light time of the current phase is used up, the intelligent agent selects an optimal phase for the next period of time by processing traffic data acquired by the intersection and road Internet of things equipment. The simultaneously acquired traffic data and the selected phase action are stored for use in training the agent. In addition, the agent will select the most reasonable time period for the selected phase based on the number of vehicles in each lane at the current time.

In each cycle of interaction between the agent and the environment, the learning process can be roughly divided into five steps: 1) observing the traffic environment to obtain the state required by reinforcement learning; 2) making the best phase action by the reinforcement learning model; 3) calculating the duration time of the green light with the most reasonable phase; 4) selected phase and green duration for the signal lamp application, and 5) storing the data and updating the network parameters through a reinforcement learning playback mechanism.

The detailed process of the step 5 is as follows: the learning-intensive experience playback buffer M is first initialized, and the duration t is initialized. Whenever the current green duration runs out, the agent needs to select the next phase of the traffic signal and the green duration. The intelligent agent firstly obtains a state s through interaction with the environment (information uploaded by Internet of things equipment such as a velometer and a sensor), then the state is input into a reinforcement learning model, the phase action a is obtained through model calculation, and the duration time t of a green light is calculated. Then apply phase a with duration t for the traffic signal, after time t the agent may take the next state s 'and calculate the reward r gained by taking action a, then store experience < s, a, r, s' > into the experience playback buffer. When the number of stored experiences is not less than the number available for training, the agent will randomly select a batch of samples from the experience replay buffer for model training and update the network weights using a random gradient descent each round.

The selection strategy of the phase is trained by the reinforcement learning method based on the intensity mechanism. And the phase green light duration is calculated according to the number of vehicles on each lane. The specific calculation mode is designed as follows:

first, the minimum value t of the duration of the green light is set_minAnd maximum value t_maxAnd a number M of selectable durations, based on which the set of selectable durations is defined as

Wherein n is₀Indicating a path of passage per unit timeNumber of vehicles per mouth, operator

Examples

The invention provides a reinforcement learning variable-duration signal lamp control method based on equipment of the Internet of things, which comprises the following code implementation parts (important interception):

as shown in code 1, this section includes the code for reinforcement learning method state acquisition:

code 1

In code 1, the state that the intensity information is generated through the traffic data acquired in real time and then the intensity information is processed to obtain the reinforcement learning method is mainly given, namely the state is used as the observation of the traffic condition of the intersection by the intelligent agent. The main functions are interaction _ info, get _ lanepressure, get _ neighbor _ compression, get _ state. The intersectioninfo obtains partial traffic data of the current intersection, including the number of vehicles on each lane, the vehicle traveling speed, the vehicle position, and the like. get _ Lanepressure returns the calculated lane strength, and get _ neighbor _ pressure returns the strength of the neighboring intersection. get _ state calculates to get the phase intensity, then combines the phase intensity, the intensity of the neighboring crossing and the current signal lamp phase to get the state and returns.

The calculation method of the phase green duration is shown as code 2:

code 2

In code 2, a calculation method of the green light duration is mainly given. The intelligent agent firstly obtains the number of vehicles on each lane of the intersection, and then selects a most reasonable time length from the selectable duration set according to the number.

The selection method of the action under the reinforcement learning control strategy and the return of the reward corresponding to the action are shown as code 3.

Code 3

In code 3, a selection function, choose _ action, reinforcement learning experience playback function, play, and reward function, get _ reward, of the beacon phase action are given. Wherein, the choose _ action function processes the state through the model and returns to the optimal phase action in the current state. The replay function updates the network parameters through the stored historical data.

In addition, in order to comprehensively test the performance of the invention, a Cityflow traffic simulation platform is used, simulation control is performed on 4 simulation data sets (1x3 intersection, 2x2 intersection, 3x3 intersection, 4x4 intersection) and 2 real data sets (Jinan 3x3 intersection, Hangzhou 4x4 intersection), and performance comparison is performed with a traditional signal lamp control method and other advanced reinforcement learning methods. The test results simulate the average waiting time of all traveling vehicles within one hour of traffic. Fig. 3 is a result of performance test of the method of the present invention, and it can be seen that the average waiting time of traveling vehicles can be minimized by applying the method.

The invention provides a reinforcement learning variable-duration signal lamp control method based on Internet of things equipment. The method is characterized in that strength information is designed based on various real-time traffic data collected by Internet of things equipment, and a reinforcement learning method is designed on the basis. The invention gets rid of the limit of fixed green time of the traditional signal lamp and can select the most reasonable green time according to the real-time traffic condition. The invention configures a reinforcement learning agent for each intersection. The intelligent agent interacts with the traffic environment, constantly optimizes model parameters while controlling the signal lamp, and learns more excellent control strategies. Under the condition of traffic dynamic change, the intelligent agent can quickly converge to an excellent signal lamp control strategy, so that the learning time of the strategy is greatly shortened, and the control quality of the strategy is improved.

The protection of the present invention is not limited to the above embodiments. Variations and advantages that may occur to those skilled in the art may be incorporated into the invention without departing from the spirit and scope of the inventive concept, and the scope of the appended claims is intended to be protected.

Claims

1. A reinforcement learning variable-duration signal lamp control method based on Internet of things equipment is characterized by comprising the following steps:

step 1: generating newly defined intensity information through real-time traffic data acquired by the Internet of things equipment; the Internet of things equipment in the step 1 comprises a velometer and a sensor;

the real-time traffic data includes a position and a speed of the vehicle;

the intensity information comprises the intensity of the vehicle, the lane, the action, the phase and the intersection;

the calculation formula of the strength of the vehicle is as follows:

wherein, the vehicle speed is v, and the allowable maximum driving speed of the current lane is v_maxThe length of a lane is L, the distance between a vehicle and an intersection is x, and a weight coefficient delta is introduced;

the intensity of the lane is the sum of the intensities of all vehicles on the current lane, namely

vehicle_iIndicating the ith vehicle on the lane and,

indicating the intensity of the i-th vehicle on the lane;

the action intensity is the difference value between the 'intensity of a lane entering the intersection' and the 'average value of the intensity of the lane leaving the intersection' under the current action, namely

Wherein, lane_inIndicating a set of driving lanes, lane, under the action_outRepresenting a set of outgoing lanes, lane, reachable from an incoming lane_iIndicating the ith lane, in the set of lanes_jRepresents the jth lane, | lane in the lane set_outL represents the number of exiting lanes,

indicating the strength of the ith lane,

representing the intensity of the jth lane;

the intensity of the phase being the sum of the intensities of the movements permitted for that phase, i.e.

movement_iRepresenting the ith action that constitutes phase,

representing the intensity corresponding to the action i;

the intersection intensity is the sum of all the vehicle intensities entering the intersection minus the sum of the vehicle intensities exiting the intersection, and is expressed as:

wherein, lane_inIndicating a set of incoming lanes, lane, at the intersection_outSet of outgoing lanes, lane, representing an intersection_iIndicating the ith lane, in the set of lanes_jRepresenting the jth lane in the set of lanes,

indicates the intensity of the i-th lane,

representing the strength of the jth lane;

step 2: designing a reinforcement learning method based on the step 1;

and 3, step 3: each intersection is provided with an intensive learning agent, when the green light duration of the current phase is used up, the agent selects an optimal phase for a signal lamp by processing traffic data acquired by the intersection and road Internet of things equipment, and the acquired traffic data and the selected phase action are stored to train the agent;

and 4, step 4: step 3, the intelligent agent selects the most reasonable green light time length according to the selected phase, and applies the selected phase and the green light time length to the traffic signal lamp; the green light duration is obtained by calculating the number of vehicles on each lane at the current moment;

2. The method of claim 1, wherein the strength of the neighbor intersection of intersection I is:

wherein, lane_inComposed of the lanes of adjacent crossings, to which vehicles will drive toward the crossings I, lane_iRepresenting the ith lane in the set of lanes,

representing the intensity of the ith lane; n is₀Indicating the number of vehicles passing through the intersection in unit time, t indicating the remaining time of green light at the adjacent intersection, and N being the time of lane_inThe total number of vehicles, ω, is a weight coefficient.

3. The method of claim 1, wherein the reinforcement learning method in step 2 comprises three elements: status, actions, and rewards;

the state is calculated after the intelligent agent observes the environment through the Internet of things equipment, and comprises the strength of each phase, the strength of a direct neighbor intersection and the current phase of the intersection; the intensity information of each phase and the intensity of the direct neighbor intersection can be obtained by calculating the speed and the position of the vehicle collected by a road test speed sensor and an intersection camera; the current phase of the intersection can be obtained by directly reading the current state of the signal lamp;

the action represents the action taken by the interaction of the model and the environment and is set as a phase number;

the reward represents the degree of quality of executing a certain action in a certain state, and is set as a negative value of the intersection intensity.

4. The method as claimed in claim 3, wherein the key to the reinforcement learning method design is status and reward, which are designed by various intensity information calculated from real-time traffic data, and designed using DQN network structure.

5. The method of claim 1, wherein the agent interacts with the traffic environment in step 3, trains the agent through the traffic data acquired in real time, continuously optimizes model parameters while controlling the signal lights, learns an optimized control strategy, and the strategy can be adjusted to adapt to traffic changes, minimize average waiting time of all traveling vehicles, and make the best choice for the phase of the signal lights according to real-time traffic conditions.

6. The method of claim 1, wherein in step 4, the agent first obtains the number of vehicles on each lane entering the intersection, and then selects a most reasonable time length from the selectable time duration set according to the number of vehicles, wherein the most reasonable time length is to ensure that the waiting vehicles on the passable lane in the selected phase can pass through the intersection smoothly without wasting time, and the specific calculation method is as follows:

first, the minimum value t of the duration of the green light is set_minAnd maximum value t_maxAnd a number M of selectable durations, the set of selectable durations being:

where Δ t is equal to t_minAnd t_maxThe time periods between are evenly divided according to M-1 sections, t_iRepresents the final selectable duration;

after the agent selects a phase, it can be observed that the total number of vehicles on the lane of the entrance intersection allowed to move in the phase is N, and then the green duration is assigned to the phase as follows:

(a ≦ b) means y ═ x when x ≦ a and x ≦ b, a when x ≦ a, and b when x ≦ b; t is positive integer and not less than

N^*Representing a set of positive integers.

7. The method of claim 1, wherein the detailed process of step 5 is: firstly, initializing an experience playback buffer zone M for reinforcement learning, and initializing duration t; whenever the current green duration runs out, the agent needs to select the next phase of the traffic signal and the green duration; the intelligent agent firstly interacts with the environment, uploads information through the Internet of things equipment including a velocimeter and a sensor to obtain a state s, then inputs the state s into a reinforcement learning model, and calculates a phase action a and a green light duration t by the model; then applying a phase a with a duration t to the traffic light, after the time t, the agent may obtain the next state s ', and calculate a reward r obtained by taking the action a, and then store the experience < s, a, r, s' > into an experience playback buffer; when the number of stored experiences is not less than the number available for training, the agent will randomly select a batch of samples from the experience replay buffer for model training and update the network weights using a random gradient descent each round.