Disclosure of Invention
In order to solve the technical defects, the invention aims to provide a novel reinforcement learning variable-duration signal lamp control method based on internet of things equipment, designs a reinforcement learning method based on the concept of intersection strength, and controls the phase selection of a signal lamp by utilizing various real-time traffic information (such as the position, speed and the like of a vehicle) acquired by the internet of things equipment. And simultaneously, the most reasonable green light duration can be selected according to the number of vehicles on each lane.
The method comprises the following specific steps:
step 1: acquiring real-time traffic data based on the Internet of things equipment, processing the acquired traffic information, and generating newly defined intensity information according to the acquired traffic data; the Internet of things equipment comprises a velometer and a sensor; real-time traffic data includes the location and speed of the vehicle; the intensity information includes the intensity of the vehicle, lane, action, phase, and intersection.
Through extensive research and study, most of the current reinforcement learning methods tend to design a complex state to include as much traffic information as possible, however, the complex design is usually accompanied by a long learning process. The invention provides a brand-new concept of 'strength' under the condition of fully considering various traffic data, and can calculate the strength of the vehicle, the lane, the action, the phase and the intersection on the basis of the vehicle position and the speed data which can be acquired by the equipment of the Internet of things. Designing reinforcement learning methods based on the definition of strength can greatly shorten the strategy learning process.
The intensity of the vehicle is first defined. Assuming that the current vehicle speed is v, the allowable maximum driving speed of the current lane is vmaxThe length of the lane is L, the distance between the vehicle and the intersection is x, and a weight coefficient delta is introduced, so that the vehicle strength is as follows:
on the basis, the invention defines the lane strength as the sum of all vehicle strengths on the current lane, namely
vehicle
iIndicating the i-th vehicle on the lane,
indicating the intensity of the i-th vehicle on the lane.
The action intensity is the difference between the 'intensity of a lane driving into the intersection' and the 'average value of the intensity of the lane driving out of the intersection' under the current action, namely
Wherein lane
inIndicating a set of driving lanes, lane, under the action
outRepresenting a set of outgoing lanes, lane, reachable from an incoming lane
iIndicating the ith lane, in the set of lanes
jRepresents the jth lane, | lane in the lane set
outL represents the number of exiting lanes,
indicating the strength of the ith lane,
indicating the intensity of the jth lane.
The phase strength being the sum of the motion strengths of the permissible movements in that phase, i.e.
movement
iRepresenting the ith action that constitutes phase,
indicating the intensity corresponding to action i.
The invention defines the intersection intensity as the sum of all the vehicle intensities entering the intersection minus the sum of the vehicle intensities exiting the intersection, and the intersection intensity is expressed as:
wherein, lane
inIndicating a set of incoming lanes at the crossing, lane
outSet of outgoing lanes, lane, representing an intersection
iIndicating the ith lane, in the set of lanes
jRepresenting the jth lane in the set of lanes,
indicating the strength of the ith lane,
indicating the intensity of the jth lane.
In addition, in order to realize the control cooperation between the signal lamps of the adjacent intersections, the strength of the adjacent intersection of the intersection I is defined as follows:
wherein lane
inComposed of the lanes of adjacent crossings, to which vehicles will drive toward the crossings I, lane
iRepresenting the ith lane in the set of lanes,
indicating the strength of the ith lane. n is
0Indicating the number of vehicles passing through the intersection in unit time, t indicating the remaining time of green light at the adjacent intersection, and N being the time of lane
inThe total number of vehicles, ω, is a weight coefficient.
Step 2: designing a reinforcement learning method, generating a reinforcement learning state:
reinforcement learning methods generally include three elements: state, action and reward, the invention is designed as follows:
the state is as follows: the state is calculated after the intelligent agent observes the environment through the Internet of things equipment, and the state comprises the strength of each phase, the strength of a direct neighbor intersection and the current phase of the intersection. The intensity of each phase and the intensity of a direct neighbor intersection can be obtained by calculating the speed and the position of a vehicle collected by a road test speed sensor and an intersection camera, and the current state of a signal lamp can be directly read by the current phase of the intersection.
Taking a typical four-way intersection as an example, if there are 4 selectable phases and the current phase of the intersection is p, the state is represented as
If there is no direct neighbor crossing in a certain direction, the neighbor crossing in the direction is strongThe degree is 0.
The actions are as follows: representing the behavior taken by the model in interacting with the environment, and in signal light control problems, the actions are typically set to phase numbers. If there are 4 selectable phases, the motion space is {0,1,2,3 }.
Rewarding: rewards reflect how well an action is performed in a certain state, reflecting the quality of the action taken in the current state, to guide the learning process. The negative value of the crossing intensity is set. This means that if an action can reduce the intersection intensity more greatly, the action is considered as a better action.
The key point of the design is the state and the reward, and the state and the reward are designed through various intensity information obtained through real-time traffic data calculation.
And step 3: each intersection is provided with an intensified learning agent to control the phase selection of the traffic signal lamp.
When the green light duration of the current phase is used up, the intelligent agent selects a new optimal phase for the signal lamp by processing traffic data acquired by the intersection and the road Internet of things equipment. The simultaneously acquired traffic data and the selected phase are stored for use in training the agent.
The selection strategy of the phase is trained by the reinforcement learning method based on the intensity mechanism. The intelligent agent interacts with the traffic environment, the intelligent agent is trained through the traffic data acquired in real time, model parameters are continuously optimized while signal lamps are controlled, a more excellent control strategy is gradually learned, the strategy can be adjusted according to traffic changes, and the average waiting time of all traveling vehicles is minimized. The phase of the signal lamp can be optimally selected according to real-time traffic conditions by using the model.
And 4, step 4: selecting the green light time length according to the selected phase, calculating the number of vehicles on each lane, and applying the selected phase and the green light time length to the traffic signal lamp;
the agent will select the most reasonable time for the selected phase according to the number of vehicles on each lane at the current moment, so as to ensure that waiting vehicles on the passable lane under the selected phase can smoothly pass through the intersection and avoid time waste.
The intelligent agent firstly obtains the number of vehicles on each lane of the intersection, and then selects a most reasonable time length from the selectable duration set according to the number. The specific calculation mode is designed as follows:
first, the minimum value t of the duration of the green light is setminAnd maximum value tmaxAnd a number M of selectable durations, based on which the set of selectable durations is defined as:
where Δ t is equal to tminAnd tmaxThe time periods between are evenly divided according to M-1 sections, tiIndicating the final selectable duration.
After the agent selects a phase, it is observed that the total number of vehicles on the lane of the crossing that are allowed to move for that phase is N, and a green duration is assigned to that phase
Wherein n is
0Representing the number of vehicles passing through the intersection per unit time, operator
(a.ltoreq.b) means that y is x when x.ltoreq.a and x.ltoreq.b, a when x.ltoreq.a, and b when x.ltoreq.b. t is a positive integer (N)
*Representing a set of positive integers) and is not less than
And 5: storing the data and updating the network parameters through a playback mechanism of the reinforcement learning agent.
The detailed process of step 5 is as follows: the learning-intensive experience playback buffer M is first initialized, and the duration t is initialized. Whenever the current green duration runs out, the agent needs to select the next phase of the traffic signal and the green duration. The intelligent agent firstly obtains a state s through interaction with the environment (information uploaded by Internet of things equipment such as a velometer and a sensor), then the state is input into a reinforcement learning model, the phase action a is obtained through model calculation, and the green light duration time t is calculated. Then apply phase a with duration t for the traffic signal, after time t the agent may take the next state s 'and calculate the reward r gained by taking action a, then store experience < s, a, r, s' > into the experience playback buffer. When the number of stored experiences is not less than the number available for training, the agent will randomly select a batch of samples from the experience replay buffer for model training and update the network weights using a random gradient descent each round.
In each cycle of interaction between the agent and the environment, the learning process can be roughly divided into five steps:
1) observing the traffic environment to obtain the state required by reinforcement learning;
2) selecting the optimal phase action by the reinforcement learning model;
3) calculating the duration time of the green light with the most reasonable phase;
4) applying the selected phase and green duration for the signal light;
5) storing the data and updating the network parameters through a reinforcement learning playback mechanism.
The invention has the beneficial effects that:
the invention provides a novel and effective 'strength' mechanism, dynamic data of a vehicle acquired in real time are abstracted into strength information, and a reinforcement learning method is designed on the basis of the strength information. The method greatly improves the learning convergence speed of the model, realizes the rapid convergence of the reinforcement learning process, greatly shortens the learning time of the strategy, ensures the control quality of the final strategy and improves the control quality of the final learning strategy.
The invention can not only control the phase of the traffic signal lamp in real time, but also distribute the most reasonable green time for the selected phase according to the observed real-time traffic condition. Different from the traditional fixed signal time length, the variable time length signal lamp control strategy adopted by the invention can further utilize time and shorten the average waiting time of the traveling vehicle. Compared with the traditional signal lamp control method and other signal lamp control algorithms based on reinforcement learning, the method can quickly converge to an excellent signal lamp control strategy under the condition of traffic dynamic change, improves the control quality of the strategy, and continuously optimizes the control strategy along with the change of traffic environment.
The method of the invention not only can control the phase of the traffic signal lamp in real time, but also can allocate the most reasonable green light time for the selected phase, thereby further refining and utilizing the time. The phase of the signal lamp can be controlled, and meanwhile, the green time can be dynamically adjusted according to the number of vehicles on each lane, rather than the fixed time length is allocated to the phase as in the traditional method.
Detailed Description
The invention is described in further detail in connection with the following specific examples and the accompanying drawings. The procedures, conditions, experimental methods and the like for carrying out the present invention are general knowledge and common general knowledge in the art except for the contents specifically mentioned below, and the present invention is not particularly limited.
The invention relates to a reinforcement learning variable-duration signal lamp control method based on Internet of things equipment, which comprises the following steps:
1. traffic data definition intensity information based on internet of things equipment acquisition:
the invention first defines the strength of the vehicle. Assuming that the current vehicle speed is v, the allowable maximum driving speed of the current lane is vmaxThe length of the lane is L, the distance between the vehicle and the intersection is x, and a weight coefficient delta is introduced, so that the vehicle strength is as follows:
on the basis, the invention defines the lane strength as the sum of all vehicle strengths on the current lane, namely
vehicle
iIndicating the i-th vehicle on the lane,
indicating the intensity of the i-th vehicle on the lane.
The action intensity is the difference between the 'intensity of a lane driving into the intersection' and the 'average value of the intensity of the lane driving out of the intersection' under the current action, namely
Wherein lane
inIndicating a set of driving lanes, lane, under the action
outRepresenting a set of outgoing lanes reachable from an incoming lane, lane
iIndicating the ith lane, in the set of lanes
jRepresents the jth lane, | lane in the lane set
outL represents the number of exiting lanes,
indicating the strength of the ith lane,
indicating the intensity of the jth lane.
The phase strength being the sum of the motion strengths of the permissible movements in that phase, i.e.
movement
iRepresenting the ith action that constitutes phase,
indicating the intensity corresponding to action i.
The invention defines the intersection intensity as the sum of all vehicle intensities entering the intersection minus the sum of vehicle intensities exiting the intersection, and the intersection intensity is expressed as:
wherein, lane
inIndicating a set of incoming lanes, lane, at the intersection
outSet of outgoing lanes, lane, representing an intersection
iIndicating the ith lane, in the set of lanes
jRepresents the jth lane in the set of lanes,
indicating the strength of the ith lane,
indicating the intensity of the jth lane.
In addition, in order to realize the control cooperation between the signal lamps of the adjacent intersections, the strength of the adjacent intersection of the intersection I is defined as follows:
wherein lane
inComposed of the lanes of adjacent crossings, to which vehicles will drive toward the crossings I, lane
iRepresenting the ith lane in the set of lanes,
indicating the strength of the ith lane. n is
0Indicating the number of vehicles passing through the intersection in unit time, t indicating the remaining time of green light at the adjacent intersection, and N being the time of lane
inThe total number of vehicles, ω, is a weight coefficient.
2. Designing a reinforcement learning method:
the three element states, actions and rewards of the reinforcement learning method are specifically as follows:
the state is as follows: the state is calculated after the intelligent agent observes the environment, and comprises the strength of each phase, the strength of a direct neighbor intersection and the current phase of the intersection. The phase intensity and the intensity of the direct neighbor intersection can be obtained by calculating the speed and the position of the vehicle collected by the road measuring speed sensor and the intersection camera, and the current phase of the intersection can be obtained by directly reading the current state of the signal lamp.
Taking a typical four-way intersection as an example, if there are 4 selectable phases and the current phase of the intersection is p, the state is represented as
If there is no direct neighbor intersection in a certain direction, the strength of the neighbor intersection in the direction is 0.
The method comprises the following steps: the main task of the signal lamp control strategy is to select the optimal phase position to minimize the intensity of the intersection. When the time for the current phase runs out, the agent needs to take action to select a new phase (phase may be repeatedly selected). The invention thus defines the action as the best signal phase selectable by the signal light. The invention sets 4 selectable phases, four phases as shown on the right side of fig. 1, so that each agent has four different predefined allowed actions, which are spatially coded as {0,1,2,3 }. The key to the design is generally the state and the reward, and the state and the reward are designed according to various intensity information calculated by real-time traffic data.
Rewarding: the reward should reflect the quality of the action taken in the current state to guide the learning process. Higher rewards mean better action. The optimization goal of existing signal light control problems is generally to minimize the average waiting time of the vehicle, which is an indicator that can be obtained only by long-term, continuous signal light action, and is not immediately available and therefore not directly rewarded. According to existing research, the convergence trend of the reinforcement learning algorithm is consistent whether the optimization goal is to shorten the vehicle driving time or minimize the intersection intensity. Therefore, based on the correlation between the strength and the average waiting time established by our model, the reward is set to be the opposite number of the strength value of the intersection. This means that if an action can reduce the intersection intensity more greatly, the action is considered as a better action.
In addition, the invention adopts a classic DQN network structure when designing the reinforcement learning network structure.
3. Control of phase and green duration of traffic lights:
the invention configures a reinforcement learning agent for each intersection. The intelligent agent interacts with the traffic environment, constantly optimizes model parameters while controlling the signal lamp, and learns more excellent control strategies. When the green light time of the current phase is used up, the intelligent agent selects an optimal phase for the next period of time by processing traffic data acquired by the intersection and road Internet of things equipment. The simultaneously acquired traffic data and the selected phase action are stored for use in training the agent. In addition, the agent will select the most reasonable time period for the selected phase based on the number of vehicles in each lane at the current time.
In each cycle of interaction between the agent and the environment, the learning process can be roughly divided into five steps: 1) observing the traffic environment to obtain the state required by reinforcement learning; 2) making the best phase action by the reinforcement learning model; 3) calculating the duration time of the green light with the most reasonable phase; 4) selected phase and green duration for the signal lamp application, and 5) storing the data and updating the network parameters through a reinforcement learning playback mechanism.
The detailed process of the step 5 is as follows: the learning-intensive experience playback buffer M is first initialized, and the duration t is initialized. Whenever the current green duration runs out, the agent needs to select the next phase of the traffic signal and the green duration. The intelligent agent firstly obtains a state s through interaction with the environment (information uploaded by Internet of things equipment such as a velometer and a sensor), then the state is input into a reinforcement learning model, the phase action a is obtained through model calculation, and the duration time t of a green light is calculated. Then apply phase a with duration t for the traffic signal, after time t the agent may take the next state s 'and calculate the reward r gained by taking action a, then store experience < s, a, r, s' > into the experience playback buffer. When the number of stored experiences is not less than the number available for training, the agent will randomly select a batch of samples from the experience replay buffer for model training and update the network weights using a random gradient descent each round.
The selection strategy of the phase is trained by the reinforcement learning method based on the intensity mechanism. And the phase green light duration is calculated according to the number of vehicles on each lane. The specific calculation mode is designed as follows:
first, the minimum value t of the duration of the green light is setminAnd maximum value tmaxAnd a number M of selectable durations, based on which the set of selectable durations is defined as
Where Δ t is equal to tminAnd tmaxThe time periods between are evenly divided according to M-1 sections, tiIndicating the final selectable duration.
After the agent selects a phase, it is observed that the total number of vehicles on the lane of the crossing that are allowed to move for that phase is N, and a green duration is assigned to that phase
Wherein n is
0Indicating a path of passage per unit timeNumber of vehicles per mouth, operator
(a.ltoreq.b) means that y is x when x.ltoreq.a and x.ltoreq.b, a when x.ltoreq.a, and b when x.ltoreq.b. t is a positive integer (N)
*Representing a set of positive integers) and is not less than
Examples
The invention provides a reinforcement learning variable-duration signal lamp control method based on equipment of the Internet of things, which comprises the following code implementation parts (important interception):
as shown in code 1, this section includes the code for reinforcement learning method state acquisition:
code 1
In code 1, the state that the intensity information is generated through the traffic data acquired in real time and then the intensity information is processed to obtain the reinforcement learning method is mainly given, namely the state is used as the observation of the traffic condition of the intersection by the intelligent agent. The main functions are interaction _ info, get _ lanepressure, get _ neighbor _ compression, get _ state. The intersectioninfo obtains partial traffic data of the current intersection, including the number of vehicles on each lane, the vehicle traveling speed, the vehicle position, and the like. get _ Lanepressure returns the calculated lane strength, and get _ neighbor _ pressure returns the strength of the neighboring intersection. get _ state calculates to get the phase intensity, then combines the phase intensity, the intensity of the neighboring crossing and the current signal lamp phase to get the state and returns.
The calculation method of the phase green duration is shown as code 2:
code 2
In code 2, a calculation method of the green light duration is mainly given. The intelligent agent firstly obtains the number of vehicles on each lane of the intersection, and then selects a most reasonable time length from the selectable duration set according to the number.
The selection method of the action under the reinforcement learning control strategy and the return of the reward corresponding to the action are shown as code 3.
Code 3
In code 3, a selection function, choose _ action, reinforcement learning experience playback function, play, and reward function, get _ reward, of the beacon phase action are given. Wherein, the choose _ action function processes the state through the model and returns to the optimal phase action in the current state. The replay function updates the network parameters through the stored historical data.
In addition, in order to comprehensively test the performance of the invention, a Cityflow traffic simulation platform is used, simulation control is performed on 4 simulation data sets (1x3 intersection, 2x2 intersection, 3x3 intersection, 4x4 intersection) and 2 real data sets (Jinan 3x3 intersection, Hangzhou 4x4 intersection), and performance comparison is performed with a traditional signal lamp control method and other advanced reinforcement learning methods. The test results simulate the average waiting time of all traveling vehicles within one hour of traffic. Fig. 3 is a result of performance test of the method of the present invention, and it can be seen that the average waiting time of traveling vehicles can be minimized by applying the method.
The invention provides a reinforcement learning variable-duration signal lamp control method based on Internet of things equipment. The method is characterized in that strength information is designed based on various real-time traffic data collected by Internet of things equipment, and a reinforcement learning method is designed on the basis. The invention gets rid of the limit of fixed green time of the traditional signal lamp and can select the most reasonable green time according to the real-time traffic condition. The invention configures a reinforcement learning agent for each intersection. The intelligent agent interacts with the traffic environment, constantly optimizes model parameters while controlling the signal lamp, and learns more excellent control strategies. Under the condition of traffic dynamic change, the intelligent agent can quickly converge to an excellent signal lamp control strategy, so that the learning time of the strategy is greatly shortened, and the control quality of the strategy is improved.
The protection of the present invention is not limited to the above embodiments. Variations and advantages that may occur to those skilled in the art may be incorporated into the invention without departing from the spirit and scope of the inventive concept, and the scope of the appended claims is intended to be protected.