CN114639255A - Traffic signal control method, device, equipment and medium - Google Patents

Traffic signal control method, device, equipment and medium Download PDF

Info

Publication number
CN114639255A
CN114639255A CN202210314258.3A CN202210314258A CN114639255A CN 114639255 A CN114639255 A CN 114639255A CN 202210314258 A CN202210314258 A CN 202210314258A CN 114639255 A CN114639255 A CN 114639255A
Authority
CN
China
Prior art keywords
value
target
intersection
characteristic
phase
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210314258.3A
Other languages
Chinese (zh)
Other versions
CN114639255B (en
Inventor
相强强
程兴硕
王泽�
伍召举
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Dahua Technology Co Ltd
Original Assignee
Zhejiang Dahua Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Dahua Technology Co Ltd filed Critical Zhejiang Dahua Technology Co Ltd
Priority to CN202210314258.3A priority Critical patent/CN114639255B/en
Publication of CN114639255A publication Critical patent/CN114639255A/en
Application granted granted Critical
Publication of CN114639255B publication Critical patent/CN114639255B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/07Controlling traffic signals
    • G08G1/081Plural intersections under common control
    • G08G1/082Controlling the time between beginning of the same phase of a cycle at adjacent intersections
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/01Detecting movement of traffic to be counted or controlled
    • G08G1/0104Measuring and analyzing of parameters relative to traffic conditions
    • G08G1/0125Traffic data processing
    • G08G1/0129Traffic data processing for creating historical data or processing based on historical data
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Traffic Control Systems (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a traffic signal control method, a device, equipment and a medium, wherein target characteristic values acquired at a target intersection and adjacent downstream intersections are acquired, a target probability value of an action control parameter corresponding to the input target characteristic value is acquired based on an actuator discriminator model which is trained in advance, a target preset action control parameter of the target intersection is determined according to the target probability value, and a traffic signal of the target intersection is controlled. According to the invention, when the intelligent device corresponding to each intersection determines the target preset action control parameter of the intersection, the target characteristic value of the adjacent downstream intersection is considered, so that the competition among the intelligent devices of each intersection is reduced, and the overall control effect of the traffic trunk is improved.

Description

Traffic signal control method, device, equipment and medium
Technical Field
The present invention relates to the technical field of traffic signal control, and in particular, to a traffic signal control method, apparatus, device, and medium.
Background
With the increase of population and the acceleration of urbanization process, the urban trip demand is increased sharply, and the existing traffic infrastructure is difficult to meet the increasing traffic demand, so that the periodic congestion and the aperiodic congestion of urban traffic are caused. The traffic signal control is used as the core of urban traffic management and control, the scientific and reasonable signal control scheme can maximize the throughput of intersections, improve the running efficiency of urban road networks and the traffic capacity of the intersections, and reduce the frequency and the intensity of traffic conflicts, thereby relieving the problem of urban traffic jam.
In the prior art, the adaptive traffic signal control scheme is mainly based on prediction of a fixed traffic model, selection of a preset signal control scheme or real-time prediction of a traffic simulation model to control traffic signals, is driven by the simulation model essentially, and needs to calibrate traffic simulation model parameters and design a predefined signal control scheme in advance according to an actual traffic scene.
The multipoint traffic signal control based on deep reinforcement learning provided in the prior art is mostly a scene that the single-point signal control is simply transplanted to multiple points, namely, the intelligent devices at each intersection use the same neural network model after deep reinforcement learning, but each intelligent device is required to ensure the optimal control effect of the traffic signals at the intersection when using the neural network model, however, since the adjacent intersections will affect each other, if the vehicle passing rate is used as the criterion for evaluating the control effect, an agent device is likely to set the traffic signal light of its intersection to green within a preset time period after the current time, regardless of the influence on the adjacent downstream intersection, therefore, traffic jam occurs at adjacent downstream intersections, and the overall control effect of the traffic trunk is poor.
Disclosure of Invention
The invention provides a traffic signal control method, a traffic signal control device, traffic signal control equipment and a traffic signal control medium, which are used for solving the problem of poor overall control effect of a traffic trunk line in the prior art.
The invention provides a traffic signal control method, which is used for intelligent device corresponding to each intersection of a traffic trunk line, and comprises the following steps:
acquiring target characteristic values acquired at a target intersection and an adjacent downstream intersection, wherein the target characteristic values are characteristic values of first preset state characteristics of each phase in a preset control period before the current moment, and the first preset state characteristics comprise an arrival flow rate in unit time and the length of a lane occupied by each vehicle when the vehicles queue for the longest time;
inputting the target characteristic values into an actuator discriminator model which is trained in advance, and acquiring the target probability value of each output set;
and determining the target set of the target intersection according to the target probability value of each set, and controlling the traffic signal lamp of each phase in a preset time period after the current time of the target intersection according to the parameter value corresponding to the green light duration of each phase in the target set.
Further, the determining the target set of the target intersection according to the target probability value of each set comprises:
sampling to determine a target set of the target intersection based on the target probability value of each set, wherein the probability that the target probability value of a set is sampled is higher; or the like, or, alternatively,
and determining the set with the maximum target probability value as the target set of the target intersection.
Further, the training process of the actuator discriminator model comprises:
acquiring a first target characteristic value acquired by a simulator simulating the target intersection and the adjacent downstream intersection, wherein the first target characteristic value is a first sample characteristic value of a first preset state characteristic of each phase in a preset control period;
inputting the target first characteristic value into an original actuator discriminator model, acquiring a first probability value of each sample set output by the original actuator discriminator model, and determining a target sample set according to the first probability value of each sample set;
inputting the parameter value of each parameter in the target sample set into the simulator to control the parameter value of a traffic signal lamp of the target intersection to update, and acquiring a second sample characteristic value of the first preset state characteristic and a third sample characteristic value of the second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection collected by the simulator in a preset control period after the current moment;
determining the second sample characteristic value as an updated target first characteristic value, determining an award value for the original actuator discriminator model according to the third sample characteristic value, and updating a parameter value in the original actuator discriminator model according to the award value;
training the original actuator discriminator model with the updated parameter values according to the updated target first characteristic values, calculating a first probability value of each sample set output by the original actuator discriminator model during each training based on the updated parameter values and the updated target first characteristic values during each training, calculating an expected value of a sample set probability value determining function corresponding to the original actuator discriminator model according to the parameter values of each parameter in each sample set and the corresponding first probability values, and obtaining the trained actuator discriminator model until the expected value is maximum.
Further, the determining a reward value for the original actuator discriminator model from the third sample feature value comprises:
according to a first time value of the average vehicle delay time corresponding to each phase of the goal intersection and a first quantity value of the vehicle arrival flow contained in the third sample characteristic value, determining a first sum of a product value of the first time value corresponding to each phase and the first quantity value and a second sum of the first quantity value corresponding to each phase, and obtaining a first reward value corresponding to the goal intersection according to a ratio of the first sum and the second sum;
according to a second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and a second numerical value of the vehicle arrival flow contained in the third sample characteristic value, determining a third sum of the product value of the second time value corresponding to each phase and the second numerical value and a fourth sum of the second numerical value corresponding to each phase, and according to the ratio of the third sum to the fourth sum, obtaining a second reward value corresponding to the adjacent downstream intersection;
and obtaining the reward value of the original actuator discriminator model according to the fifth sum of the first reward value and the second reward value.
Accordingly, the present invention provides a traffic signal control apparatus, said apparatus comprising:
the system comprises an acquisition module, a processing module and a control module, wherein the acquisition module is used for acquiring target characteristic values acquired at a target intersection and an adjacent downstream intersection, the target characteristic values are characteristic values of first preset state characteristics of each phase in a preset control period before the current moment, and the first preset state characteristics comprise an arrival flow rate in unit time and the length of a lane occupied by each vehicle when the vehicles queue for the longest time;
the processing module is used for inputting the target characteristic value into an actuator discriminator model which is trained in advance and acquiring the target probability value of each output set;
and the control module is used for determining the target set of the target intersection according to the target probability value of each set, and controlling the traffic signal lamp of each phase in a preset time period after the current time of the target intersection according to the parameter value corresponding to the green light time of each phase in the target set.
Further, the control module is specifically configured to sample and determine the target set of the target intersection based on the target probability value of each set, where the higher the target probability value of a set is, the higher the possibility of being sampled is; or determining the set with the maximum target probability value as the target set of the target intersection.
Further, the apparatus further comprises:
the training module is used for acquiring a first target characteristic value acquired by a simulator for simulating the target intersection and the adjacent downstream intersection, wherein the first target characteristic value is a first sample characteristic value of a first preset state characteristic of each phase in a preset control period; inputting the target first characteristic value into an original actuator discriminator model, acquiring a first probability value of each sample set output by the original actuator discriminator model, and determining a target sample set according to the first probability value of each sample set; inputting the parameter value of each parameter in the target sample set into the simulator to control the parameter value of a traffic signal lamp of the target intersection to update, and acquiring a second sample characteristic value of the first preset state characteristic and a third sample characteristic value of the second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection collected by the simulator in a preset control period after the current moment; determining the second sample characteristic value as an updated target first characteristic value, determining an award value for the original actuator discriminator model according to the third sample characteristic value, and updating a parameter value in the original actuator discriminator model according to the award value; training the original actuator discriminator model with the updated parameter values according to the updated target first characteristic values, calculating a first probability value of each sample set output by the original actuator discriminator model during each training based on the updated parameter values and the updated target first characteristic values during each training, calculating an expected value of a sample set probability value determining function corresponding to the original actuator discriminator model according to the parameter values of each parameter in each sample set and the corresponding first probability values, and obtaining the trained actuator discriminator model until the expected value is maximum.
Further, the training module is specifically configured to determine, according to a first time value of an average vehicle delay time corresponding to each phase of the intersection and a first quantity value of a vehicle arrival flow included in the third sample feature value, a first sum of a product value of the first time value corresponding to each phase and the first quantity value, and a second sum of the first quantity value corresponding to each phase, and obtain a first reward value corresponding to the intersection according to a ratio of the first sum to the second sum; according to a second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and a second numerical value of the vehicle arrival flow contained in the third sample characteristic value, determining a third sum of the product value of the second time value corresponding to each phase and the second numerical value and a fourth sum of the second numerical value corresponding to each phase, and according to the ratio of the third sum to the fourth sum, obtaining a second reward value corresponding to the adjacent downstream intersection; and obtaining the reward value of the original actuator discriminator model according to the fifth sum of the first reward value and the second reward value.
Accordingly, the present invention provides an electronic device comprising: the system comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus;
the memory has stored therein a computer program which, when executed by the processor, causes the processor to carry out the steps of any of the above-described traffic signal control methods when executing the computer program stored in the memory.
Accordingly, the present invention provides a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of any of the above-mentioned traffic signal control methods.
The invention provides a traffic signal control method, a device, equipment and a medium, wherein target characteristic values acquired at a target intersection and adjacent downstream intersections are acquired, a target probability value of an action control parameter corresponding to the input target characteristic value is acquired based on an actuator discriminator model which is trained in advance, a target preset action control parameter of the target intersection is determined according to the target probability value, and a traffic signal of the target intersection is controlled. According to the invention, when the intelligent device corresponding to each intersection determines the target preset action control parameter of the intersection, the target characteristic value of the adjacent downstream intersection is considered, so that the competition among the intelligent devices of each intersection is reduced, and the overall control effect of the traffic trunk is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a schematic process diagram of a traffic signal control method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a traffic trunk according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating an embodiment of an actuator arbiter model training process;
fig. 4 is a schematic structural diagram of a traffic signal control device according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to improve the overall control effect of a traffic trunk line, the embodiment of the invention provides a traffic signal control method, a traffic signal control device, traffic signal control equipment and a traffic signal control medium.
Example 1:
fig. 1 is a schematic process diagram of a traffic signal control method according to an embodiment of the present invention, where the process includes the following steps:
s101: the method comprises the steps of obtaining target characteristic values collected at a target intersection and an adjacent downstream intersection, wherein the target characteristic values are characteristic values of first preset state characteristics of each phase in a preset control period before the current moment, and the first preset state characteristics comprise the arrival flow rate in unit time and the length of a lane occupied by each vehicle when the vehicles queue for the longest time.
The traffic signal control method provided by the embodiment of the invention is applied to intelligent equipment corresponding to each intersection of a traffic trunk line, wherein the intelligent equipment can be an intelligent terminal such as a PC (personal computer), a tablet computer and a mobile terminal for controlling traffic signal lamps, and can also be a server for controlling the traffic signal lamps; the server can be a local server, a cloud server and a controller of a traffic signal lamp. Specifically, the embodiment of the present invention does not limit this.
In order to improve the overall control effect of the traffic trunk, the intelligent device acquires target characteristic values acquired at a target intersection and an adjacent downstream intersection, wherein the target intersection is an intersection corresponding to the intelligent device, namely the intersection where a traffic signal lamp controlled by the intelligent device is located, and the adjacent downstream intersection is an intersection which is adjacent to the target intersection and is located at the downstream of the target intersection in the vehicle passing direction. Fig. 2 is a schematic diagram of a traffic trunk according to an embodiment of the present invention, and taking a direction from left (left and right in fig. 2) to right (left and right in fig. 2) as an example of a vehicle passing behavior, if an intersection on the left (left and right in fig. 2) side is a target intersection, an intersection in the middle (middle in fig. 2) is an adjacent downstream intersection.
Specifically, the intelligent device may obtain a target characteristic value corresponding to a target intersection collected by an image collection device connected to the intelligent device, and a target characteristic value corresponding to a downstream intersection collected by an image collection device connected to the intelligent device and corresponding to an adjacent downstream intersection; or the target characteristic value corresponding to the target intersection acquired by the image acquisition unit of the intelligent device itself and the target characteristic value corresponding to the downstream intersection sent by the intelligent device corresponding to the adjacent downstream intersection may be acquired.
The target characteristic value is a characteristic value of a first preset state characteristic of each phase in a preset control period before the current time, the preset control period is a maximum period in the optimal period of each intersection in a traffic trunk line calculated according to a Webster optimal period formula, the first preset state characteristic comprises an arrival flow rate in unit time and the length of a lane occupied by each vehicle when the vehicles are queued for the longest time, the arrival flow rate in unit time is a ratio of arrival flow of the vehicles passing through the intersections in the preset control period before the current time to the preset control period, and the length of the lane occupied by each vehicle when the vehicles are queued for the longest time is a ratio of the maximum queuing length in the preset control period before the current time to the number of vehicles corresponding to the length.
For example, the intersection of the target is represented by m, the adjacent downstream intersection is represented by n, and the target characteristic values of the intersection of the target and the adjacent downstream intersection are
Figure BDA0003568471590000081
Wherein
Figure BDA0003568471590000082
At the above target characteristic value
Figure BDA0003568471590000083
In (1),
Figure BDA0003568471590000084
the maximum queuing length of the ith phase of the target intersection m in a preset control period before the current moment is shown,
Figure BDA0003568471590000085
representing the number of vehicles corresponding to the maximum queuing length of the ith phase of the target intersection m in a preset control period before the current moment,
Figure BDA0003568471590000086
indicating that each vehicle occupies the lane length when the vehicle queue of the ith phase of the target intersection m in a preset control period before the current moment is longest,
Figure BDA0003568471590000087
indicating the target intersection mth phase arrival flow rate per unit time within a preset control period prior to the current time.
Figure BDA0003568471590000088
The maximum queuing length of the ith phase of the adjacent downstream intersection n in a preset control period before the current moment is shown,
Figure BDA0003568471590000089
the number of vehicles corresponding to the maximum queuing length of the ith phase of the adjacent downstream intersection n in a preset control period before the current moment is represented,
Figure BDA00035684715900000810
indicating that each vehicle occupies the length of the lane when the vehicle queue of the ith phase of the adjacent downstream intersection n is longest in a preset control period before the current moment,
Figure BDA00035684715900000811
indicating the flow rate reached per unit time for the nth phase of the adjacent downstream junction within a preset control period prior to the current time.
S102: and inputting the target characteristic value into a pre-trained actuator discriminator model to obtain the target probability value of each output set.
In order to implement control over traffic signals, in the embodiment of the present invention, the agent device stores an actuator discriminator model which is trained in advance, where the actuator discriminator model is trained in advance to determine a target probability value of each set according to target feature values collected by a target intersection and an adjacent downstream intersection, where the set includes parameter values of motion control parameters of traffic signals at the target intersection, and the motion control parameters include a green signal ratio of each phase of the traffic signal, that is, a ratio of a green light duration to a duration of a preset control period.
For example, for motion control parameters
Figure BDA0003568471590000091
It is shown that,
Figure BDA0003568471590000092
wherein
Figure BDA0003568471590000093
And the ratio of the duration of the green light of the ith phase of the target intersection m to the duration of the preset control period is represented.
After target characteristic values collected by the intersection to be targeted and the adjacent downstream intersection are obtained, the target characteristic values are input into an actuator discriminator model which is trained in advance, and the target probability value of each output set is obtained after the processing of the actuator discriminator model.
S103: and determining the target set of the target intersection according to the target probability value of each set, and controlling the traffic signal lamp of each phase in a preset time period after the current time of the target intersection according to the parameter value corresponding to the green light duration of each phase in the target set.
And after the target probability value of each set is determined, determining the target set of the target intersection according to the target probability value of each set. Specifically, the set corresponding to the median of the target probability values may be determined as the target set, the set corresponding to the maximum of the target probability values may also be determined as the target set, or another determination method may be adopted to select the target set from each set, which is not limited in the embodiment of the present invention.
And after the target set is determined, controlling the traffic signal lamp of each phase in a preset time period after the current moment of the target intersection according to the parameter value corresponding to the green light duration of each phase in the target set. In the embodiment of the invention, the control of the traffic signal lamp of the target intersection is performed once at an interval of a preset control period, and each control needs to determine the target set in the next control period according to the target characteristic value acquired in the previous preset control period.
In the embodiment of the invention, the traffic signal control method provided by the invention is characterized in that the target characteristic values collected at the target intersection and the adjacent downstream intersections are obtained, the target probability values of the action control parameters correspondingly output to the input target characteristic values are obtained based on the actuator discriminator model which is trained in advance, the target preset action control parameters of the target intersection are determined according to the target probability values, and the traffic signals of the target intersection are controlled. According to the invention, when the intelligent device corresponding to each intersection determines the target preset action control parameter of the intersection, the target characteristic value of the adjacent downstream intersection is considered, so that the competition among the intelligent devices of each intersection is reduced, and the overall control effect of the traffic trunk is improved.
Example 2:
in order to determine the target sets of the target intersection, on the basis of the above embodiment, in an embodiment of the present invention, the determining the target set of the target intersection according to the target probability value of each set includes:
sampling to determine a target set of the target intersection based on the target probability value of each set, wherein the probability that the target probability value of a set is sampled is higher; or the like, or, alternatively,
and determining the set with the maximum target probability value as the target set of the target intersection.
In order to determine the target sets of the target intersection, in the embodiment of the invention, the target sets of the target intersection are determined by sampling based on the target probability value of each set, wherein the probability of each set being sampled is inconsistent, the probability of each set being sampled is higher when the target probability value of each set is higher, and the probability of each set being sampled is lower when the target probability value of each set is lower.
As a possible implementation manner, in the embodiment of the present invention, according to the target probability value of each set, the set with the largest target probability value may also be determined as the target set of the target intersection.
Example 3:
for training the actuator discriminator model, on the basis of the above embodiments, in an embodiment of the present invention, the training process of the actuator discriminator model includes:
acquiring a first target characteristic value acquired by a simulator simulating the target intersection and the adjacent downstream intersection, wherein the first target characteristic value is a first sample characteristic value of a first preset state characteristic of each phase in a preset control period;
inputting the target first characteristic value into an original actuator discriminator model, acquiring a first probability value of each sample set output by the original actuator discriminator model, and determining a target sample set according to the first probability value of each sample set;
inputting the parameter value of each parameter in the target sample set into the simulator to control the parameter value of a traffic signal lamp of the target intersection to update, and acquiring a second sample characteristic value of the first preset state characteristic and a third sample characteristic value of the second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection collected by the simulator in a preset control period after the current moment;
determining the second sample characteristic value as an updated target first characteristic value, determining an award value for the original actuator discriminator model according to the third sample characteristic value, and updating a parameter value in the original actuator discriminator model according to the award value;
training the original actuator discriminator model with the updated parameter values according to the updated target first characteristic values, calculating a first probability value of each sample set output by the original actuator discriminator model during each training based on the updated parameter values and the updated target first characteristic values during each training, calculating an expected value of a sample set probability value determining function corresponding to the original actuator discriminator model according to the parameter values of each parameter in each sample set and the corresponding first probability values, and obtaining the trained actuator discriminator model until the expected value is maximum.
In order to train the actuator discriminator model, in the embodiment of the present invention, a simulator for simulating a target intersection and an adjacent downstream intersection is stored in advance, and the simulator is specifically configured to simulate a change in traffic state at the intersection.
The method comprises the steps of obtaining a first target characteristic value collected by a simulator simulating a target intersection and an adjacent downstream intersection, wherein the first target characteristic value is a first sample characteristic value of a first preset state characteristic of each phase in a preset control period, and the first target characteristic value comprises a first sample characteristic value corresponding to the target intersection and a first sample characteristic value corresponding to the adjacent downstream intersection.
Specifically, when the actuator discriminator model is trained for the first time, the first characteristic value of the target is an initial characteristic value pre-stored in the simulation, and the characteristic values after the simulator simulates the target intersection and the adjacent downstream intersection are characteristic values during each subsequent training.
After the target first characteristic value acquired by the simulator is acquired, inputting the target first characteristic value into the original actuator discriminator model, acquiring a first probability value of each sample set output by the original actuator discriminator model, wherein each sample set is preset, and sampling according to the first probability value of each sample set to determine the target sample set, wherein the probability of being sampled is higher when the first probability value of the sample set is higher.
And inputting the parameter value of each parameter in the target sample set into the simulator, controlling the parameter value of the traffic signal lamp of the simulated target intersection in the simulator to be updated, and updating the parameter value into the parameter value of the corresponding parameter of each phase in each phase of the preset period.
After the simulator simulates the target intersection and the adjacent downstream intersection in a preset control period after the current moment, a second sample characteristic value of a first preset state characteristic and a third sample characteristic value of a second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection in the preset control period, which are acquired by the simulator, are acquired. Wherein the second predetermined state characteristic comprises an average vehicle delay time and a vehicle arrival flow.
And in order to carry out the next training, determining the second sample characteristic value as the updated target first characteristic value, determining the reward value of the original actuator discriminator model according to the third sample characteristic value, and updating the parameter value in the original actuator discriminator model by adopting a time difference learning algorithm according to the reward value. Wherein the reward value is a reward value resulting from a change in the environmental state of the targeted intersection including a second predetermined state and a resulting reward value of a second predetermined state in the environmental state of an adjacent downstream intersection.
And training the original actuator discriminator model with the updated parameter value according to the updated target first characteristic value, namely, repeatedly executing the steps according to the updated target first characteristic value. And calculating the expected value of the probability value determining function corresponding to the original actuator discriminator model during each training based on the updated parameter value of each training and the updated target first characteristic value, and obtaining the trained actuator discriminator model until the expected value is maximum.
Specifically, according to the parameter value of each parameter in each sample set output during each training and the corresponding first sample probability value, the sum of the product values of the parameter value in each sample set and the corresponding first sample probability value is determined, and the expected value including the sum corresponding to each parameter is determined.
Example 4:
in order to implement training of an actuator discriminator model, in the embodiment of the invention, for each round of training corresponding to each preset control period, a first target characteristic value acquired by a simulator simulating a target intersection and an adjacent downstream intersection is acquired in the round of training, the first target characteristic value is input into an original actuator discriminator model to serve as a strategy network of an actuator, a first probability value of each sample set output by the strategy network is acquired according to a state value function corresponding to the strategy network and used for determining a sample set probability value, and a first score value provided by a round of training on a value network of the discriminator, and a target sample set is determined according to the first probability value of each sample set.
And inputting the first characteristic value of the target and the parameter value of each parameter in the target sample set into the starting actuator discriminator model to be used as a value network of the discriminator, and acquiring a first evaluation value of the value network to the target sample set.
And inputting the parameter value of each parameter in the target sample set into a simulator to control the parameter value of a traffic signal lamp of the target intersection to update, and acquiring a second sample characteristic value of the first preset state characteristic and a third sample characteristic value of the second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection collected by the simulator in a preset control period after the current moment. And determining the reward value of the value network in the original simulator arbiter model according to the third sample characteristic value.
And inputting the second sample characteristic value into the original actuator discriminator model to serve as a strategy network of an actuator, acquiring a predicted first probability value of each sample set output by the strategy network, and determining a predicted target sample set according to the predicted first probability value of each sample set. And inputting the second sample characteristic value and the parameter value of each parameter in the prediction target sample set into the start executor discriminator model to be used as a value network of the discriminator, and acquiring a second evaluation value of the value network to the prediction target sample set.
And determining a time difference error value by adopting a time difference algorithm according to the first score value, the reward value and the second score value, determining a product value of the derivative value and the time difference error value according to the derivative value of the first parameter value of the action value function corresponding to the value network, and updating the first parameter value according to the product value. Specifically, the first parameter value is subtracted by a product value of the product value and a preset first learning rate to obtain an updated first parameter value.
And updating the second parameter value of the policy network by adopting a random gradient ascent method according to the first score value and the derivative value of the second parameter of the policy network by using the state value function corresponding to the policy network. Specifically, a product value of the first score value and the derivative value is determined, the second parameter value is updated according to the product value, and specifically, the second parameter value is added with the product value of the product value and a preset second learning rate to obtain an updated second parameter value.
Determining the second sample characteristic value as an updated target first characteristic value, training an original actuator discriminator model with the first parameter value and the second parameter value updated, calculating a first probability value of each sample set output by the original actuator discriminator model during each training based on the updated parameter value and the updated target first characteristic value during each training, calculating an expected value of a state value function of a determined sample set probability value corresponding to an actuator in the original actuator discriminator model according to the parameter value of each parameter in each sample set and the corresponding first probability value, and obtaining the trained actuator discriminator model until the expected value is maximum.
In the embodiment of the invention, an actuator discriminator model adopts an actuator-criticic algorithm to carry out cooperative optimization control on traffic signals of a traffic trunk, and the actuator-criticic algorithm belongs to one type of strategy learning and is used for approximating a strategy function pi (a | s) by a neural network. The neural network approximating the policy function in the embodiment of the invention is a policy network expressed as pi (a | s; theta) and used as an actuator in the actuator discriminator model.
The input of the strategy network is a characteristic value S of a first preset state characteristic of each phase in a preset control period before the current time acquired by the target intersection and the adjacent downstream intersections, the output probability distribution of each set a of parameter values including action control parameters is output, and sigma isa∈Aπ(a|s;θ)=1。
The state cost function corresponding to the strategy function can be approximately expressed as Vπ(st;θ)=∑aπ(at|s;θ)Qπ(stAnd a), updating the parameter theta by adopting a random gradient ascent method in the embodiment of the invention, wherein
Figure BDA0003568471590000141
Referred to as a policy gradient.
The policy gradient may be represented by the following equation:
Figure BDA0003568471590000151
since the strategy gradient can not be directly calculated, the strategy gradient can be approximated by adopting a Monte Carlo approximation method, namely
Figure BDA0003568471590000152
Since it is not known in the strategy gradient
Figure BDA0003568471590000153
So that the cost network q (s, a; ω) can be used to approximate the cost function of the motion. Wherein the value network q (s, a; ω) is a neural network.
The policy network pi (a | s; theta) is called the actor (actor) and the value network q (s, a; omega) is called the critic (critic). The supervision signal comes from the evaluation value Q provided by the value network Q (s, a; ω) when learning the policy network pi (a | s; θ) and from the reward value R when learning the value network Q (s, a; ω).
The parameters theta and omega are updated simultaneously when the actuator discriminator model is trained, and the parameter theta of the strategy network is updated by using the strategy gradient to increase the state cost function Vπ(st(ii) a Theta) and updating the parameter omega of the value network by using a time difference algorithm in order to make the output evaluation value Q more accurate.
Fig. 3 is a schematic diagram of an actuator discriminator model training according to an embodiment of the present invention, as shown in fig. 3, the actuator discriminator model includes a policy network serving as an actuator and a value network serving as a discriminator, a simulator model is used to simulate traffic states in environments of a target intersection and an adjacent downstream intersection, the value network is used to provide an evaluation Q value to the policy network, the simulator is used to provide a reward value and output a feature value of a first preset state feature of the environment, and the policy network is used to output a probability value of a set of motion control parameters.
The actuator discriminator model of the invention is trained through a specific embodiment, and the strategy network pi (a | s; theta) and the value network q (s, a; omega) in the actuator discriminator model are initialized randomly.
The intelligent equipment acquires target characteristic values s of a target intersection and adjacent downstream intersections collected in the simulatortIf the training is the first time, the target characteristic value stFor pre-stored initial characteristic values s0The target characteristic value stInputting a strategy network pi (a | s; theta), and sampling based on the first probability value of each sample set output by the strategy network pi (a | s; theta) to determine a target sample set at. The target characteristic value stAnd a set of target samples atThe parameter value of each parameter is input into a value network q (s, a; omega) to obtain a first evaluation value q of the value network to a target sample sett(st,at;ωt)。
Collecting a target sample set atThe parameter value of each parameter is input into a simulator to control the updating of the parameter value of a traffic signal lamp of the target intersection, and a second sample characteristic value s of a first preset state characteristic of each phase of the target intersection and an adjacent downstream intersection collected by the simulator in a preset control period after the current moment is obtainedt+1And a third sample characteristic value of the second preset state characteristic. Determining a reward value r for the value network in the raw simulator arbiter model from the third sample feature valuet
The second sample characteristic value st+1Inputting a strategy network pi (as | s; theta) serving as an actuator in an original actuator discriminator model, and sampling and determining a prediction target sample set based on a prediction first probability value of each sample set output by the strategy network pi (as | s; theta)
Figure BDA0003568471590000161
The target characteristic value st+1And a set of target samples
Figure BDA0003568471590000162
The parameter value of each parameter is input into a value network q (s, a; omega) to obtain a second evaluation value of the value network to the target sample set
Figure BDA0003568471590000163
According to the first evaluation value qtPrize value rtAnd q ist+1Calculating a time difference error value deltatWherein δt=qt-(rt+γqt+1) Deriving the action value function corresponding to the value network to obtain a derivative value dω,tWherein
Figure BDA0003568471590000164
First parameter value omega for value networktUpdating to obtain an updated first parameter value omegat+1,ωt+1=ωt-αδtdω,t
Obtaining a derivative value d by deriving a state value function corresponding to the policy networkθ,tWherein
Figure BDA0003568471590000165
Figure BDA0003568471590000166
According to the first evaluation value qtAnd the derivative dθ,tUpdating the second parameter value theta of the policy network with a random gradient risetObtaining the updated second parameter value thetat+1Wherein thetat+1=θt+βqtdθ,t
Example 5:
in order to determine the reward value for the original actuator discriminator model for each training, on the basis of the above embodiments, in an embodiment of the present invention, the determining the reward value for the original actuator discriminator model according to the third sample feature value includes:
according to a first time value of the average vehicle delay time corresponding to each phase of the goal intersection and a first quantity value of the vehicle arrival flow contained in the third sample characteristic value, determining a first sum of a product value of the first time value corresponding to each phase and the first quantity value and a second sum of the first quantity value corresponding to each phase, and obtaining a first reward value corresponding to the goal intersection according to a ratio of the first sum and the second sum;
according to a second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and a second numerical value of the vehicle arrival flow contained in the third sample characteristic value, determining a third sum of the product value of the second time value corresponding to each phase and the second numerical value and a fourth sum of the second numerical value corresponding to each phase, and according to the ratio of the third sum to the fourth sum, obtaining a second reward value corresponding to the adjacent downstream intersection;
and obtaining the reward value of the original actuator discriminator model according to the fifth sum of the first reward value and the second reward value.
In order to determine the reward value of each training to the original actuator discriminator model, in the embodiment of the invention, according to a first time value of the average vehicle delay time corresponding to each phase of the target intersection and a first quantity value of the vehicle arrival flow contained in a third sample characteristic value, a product value of the first time value corresponding to each phase and the first quantity value is determined, and the product values corresponding to each phase are added to obtain a first sum value; and according to the ratio of the first sum value to the second sum value, determining the negative value of the ratio of the first sum value to the second sum value as the first reward value corresponding to the intersection.
According to a second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and a second numerical value of the vehicle arrival flow, which are contained in the third sample characteristic value, determining a product value of the second time value corresponding to each phase and the second numerical value, and adding the product values corresponding to each phase to obtain a third sum value; and according to the ratio of the third sum value to the fourth sum value, determining the negative value of the ratio of the third sum value to the fourth sum value as the second incentive value corresponding to the adjacent downstream intersection.
And determining the fifth sum value as the prize value of the original actuator discriminator model according to the fifth sum value of the first prize value and the second prize value.
In the following, a process of determining a reward value of an original executor arbiter model according to a specific embodiment of the present invention is described, where for each target intersection of a main traffic line, when training the executor arbiter model, not only the reward value of the target intersection m but also the reward value of an adjacent downstream intersection n need to be considered.
In the calculation of the reward value for the actuator discriminator model, a joint reward function is used in particular, wherein the joint reward function
Figure BDA0003568471590000181
Wherein
Figure BDA0003568471590000182
A first time value representing the average delay time of the ith phase of the target intersection m in a preset control period after the current time t,
Figure BDA0003568471590000183
the method comprises the steps that a first quantity value representing the vehicle arrival flow of the ith phase of a target intersection m in a preset control period after the current time t is obtained;
Figure BDA0003568471590000184
representing the average delay time of the ith phase at the target intersection n within a preset control period after the current time tThe value of the second time is set to,
Figure BDA0003568471590000185
a second numerical value representing vehicle arrival flow at the target intersection nth phase within a preset control period after the current time t.
Example 6:
fig. 4 is a schematic structural diagram of a traffic signal control device according to an embodiment of the present invention, where the traffic signal control device includes:
an obtaining module 401, configured to obtain target feature values collected at a target intersection and an adjacent downstream intersection, where the target feature value is a feature value of a first preset state feature of each phase in a preset control period before a current time, and the first preset state feature includes an arrival flow rate per unit time and a length of a lane occupied by each vehicle when the vehicle is queued for the longest time;
a processing module 402, configured to input the target feature value to a pre-trained actuator discriminator model, and obtain a target probability value of each output set;
the control module 403 is configured to determine a target set of the target intersection according to the target probability value of each set, and control a traffic signal lamp of each phase in a preset time period after the current time of the target intersection according to a parameter value corresponding to a green light duration of each phase in the target set.
Further, the control module is specifically configured to sample to determine a target set of the target intersection based on the target probability value of each set, where a probability that a target probability value of a set is greater is higher; or determining the set with the maximum target probability value as the target set of the target intersection.
Further, the apparatus further comprises:
the training module is used for acquiring a first target characteristic value acquired by a simulator for simulating the target intersection and the adjacent downstream intersection, wherein the first target characteristic value is a first sample characteristic value of a first preset state characteristic of each phase in a preset control period; inputting the target first characteristic value into an original actuator discriminator model, acquiring a first probability value of each sample set output by the original actuator discriminator model, and determining a target sample set according to the first probability value of each sample set; inputting the parameter value of each parameter in the target sample set into the simulator to control the parameter value of a traffic signal lamp of the target intersection to update, and acquiring a second sample characteristic value of the first preset state characteristic and a third sample characteristic value of the second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection collected by the simulator in a preset control period after the current moment; determining the second sample characteristic value as an updated target first characteristic value, determining an award value for the original actuator discriminator model according to the third sample characteristic value, and updating a parameter value in the original actuator discriminator model according to the award value; training the original actuator discriminator model with the updated parameter values according to the updated target first characteristic values, calculating a first probability value of each sample set output by the original actuator discriminator model during each training based on the updated parameter values and the updated target first characteristic values during each training, calculating an expected value of a sample set probability value determining function corresponding to the original actuator discriminator model according to the parameter values of each parameter in each sample set and the corresponding first probability values, and obtaining the trained actuator discriminator model until the expected value is maximum.
Further, the training module is specifically configured to determine, according to a first time value of an average vehicle delay time corresponding to each phase of the intersection and a first quantity value of a vehicle arrival flow included in the third sample feature value, a first sum of a product value of the first time value corresponding to each phase and the first quantity value, and a second sum of the first quantity value corresponding to each phase, and obtain a first reward value corresponding to the intersection according to a ratio of the first sum to the second sum; according to a second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and a second numerical value of the vehicle arrival flow contained in the third sample characteristic value, determining a third sum of the product value of the second time value corresponding to each phase and the second numerical value and a fourth sum of the second numerical value corresponding to each phase, and according to the ratio of the third sum to the fourth sum, obtaining a second reward value corresponding to the adjacent downstream intersection; and obtaining the reward value of the original actuator discriminator model according to the fifth sum of the first reward value and the second reward value.
Example 7:
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and on the basis of the foregoing embodiments, the present application further provides an electronic device including a processor 501, a communication interface 502, a memory 503 and a communication bus 504, where the processor 501, the communication interface 502 and the memory 503 complete communication with each other through the communication bus 504;
the memory 503 has stored therein a computer program which, when executed by the processor 501, causes the processor 501 to perform the steps of:
acquiring target characteristic values acquired at a target intersection and an adjacent downstream intersection, wherein the target characteristic values are characteristic values of first preset state characteristics of each phase in a preset control period before the current moment, and the first preset state characteristics comprise an arrival flow rate in unit time and the length of a lane occupied by each vehicle when the vehicles queue for the longest time;
inputting the target characteristic values into an actuator discriminator model which is trained in advance, and acquiring the target probability value of each output set;
and determining the target set of the target intersection according to the target probability value of each set, and controlling the traffic signal lamp of each phase in a preset time period after the current time of the target intersection according to the parameter value corresponding to the green light duration of each phase in the target set.
Further, the processor 501 is specifically configured to determine, according to the object probability value of each set, an object set of the object intersection includes:
sampling to determine a target set of the target intersection based on the target probability value of each set, wherein the probability that the target probability value of a set is sampled is higher; or the like, or, alternatively,
and determining the set with the maximum target probability value as the target set of the target intersection.
Further, the process 501 specifically applied to the training process of the actuator discriminator model includes:
acquiring a first target characteristic value acquired by a simulator simulating the target intersection and the adjacent downstream intersection, wherein the first target characteristic value is a first sample characteristic value of a first preset state characteristic of each phase in a preset control period;
inputting the target first characteristic value into an original actuator discriminator model, acquiring a first probability value of each sample set output by the original actuator discriminator model, and determining a target sample set according to the first probability value of each sample set;
inputting the parameter value of each parameter in the target sample set into the simulator to control the parameter value of a traffic light of the target intersection to update, and acquiring a second sample characteristic value of the first preset state characteristic and a third sample characteristic value of the second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection collected by the simulator in a preset control period after the current moment;
determining the second sample characteristic value as an updated target first characteristic value, determining an award value for the original actuator discriminator model according to the third sample characteristic value, and updating a parameter value in the original actuator discriminator model according to the award value;
training the original actuator discriminator model with the updated parameter values according to the updated target first characteristic values, calculating a first probability value of each sample set output by the original actuator discriminator model during each training based on the updated parameter values and the updated target first characteristic values during each training, calculating an expected value of a sample set probability value determining function corresponding to the original actuator discriminator model according to the parameter values of each parameter in each sample set and the corresponding first probability values, and obtaining the trained actuator discriminator model until the expected value is maximum.
Further, the processor 501 is specifically configured to determine the reward value for the original actuator discriminator model according to the third sample feature value by:
according to a first time value of the average vehicle delay time corresponding to each phase of the goal intersection and a first quantity value of the vehicle arrival flow contained in the third sample characteristic value, determining a first sum of a product value of the first time value corresponding to each phase and the first quantity value and a second sum of the first quantity value corresponding to each phase, and obtaining a first reward value corresponding to the goal intersection according to a ratio of the first sum and the second sum;
according to a second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and a second numerical value of the vehicle arrival flow contained in the third sample characteristic value, determining a third sum of the product value of the second time value corresponding to each phase and the second numerical value and a fourth sum of the second numerical value corresponding to each phase, and according to the ratio of the third sum to the fourth sum, obtaining a second reward value corresponding to the adjacent downstream intersection;
and obtaining the reward value of the original actuator discriminator model according to the fifth sum of the first reward value and the second reward value.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface 502 is used for communication between the above-described electronic apparatus and other apparatuses.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Alternatively, the memory may be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a central processing unit, a Network Processor (NP), and the like; but may also be a Digital instruction processor (DSP), an application specific integrated circuit, a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.
Example 8:
on the basis of the foregoing embodiments, the present invention further provides a computer-readable storage medium, in which a computer program executable by a processor is stored, and when the program runs on the processor, the processor is caused to execute the following steps:
acquiring target characteristic values acquired at a target intersection and an adjacent downstream intersection, wherein the target characteristic values are characteristic values of first preset state characteristics of each phase in a preset control period before the current moment, and the first preset state characteristics comprise an arrival flow rate in unit time and the length of a lane occupied by each vehicle when the vehicles queue for the longest time;
inputting the target characteristic values into an actuator discriminator model which is trained in advance, and acquiring the target probability value of each output set;
and determining the target set of the target intersection according to the target probability value of each set, and controlling the traffic signal lamp of each phase in a preset time period after the current time of the target intersection according to the parameter value corresponding to the green light duration of each phase in the target set.
Further, the determining the target set of the target intersection according to the target probability value of each set comprises:
sampling to determine a target set of the target intersection based on the target probability value of each set, wherein the probability that the target probability value of a set is sampled is higher; or the like, or, alternatively,
and determining the set with the maximum target probability value as the target set of the target intersection.
Further, the training process of the actuator discriminator model comprises:
acquiring a first target characteristic value acquired by a simulator simulating the target intersection and the adjacent downstream intersection, wherein the first target characteristic value is a first sample characteristic value of a first preset state characteristic of each phase in a preset control period;
inputting the target first characteristic value into an original actuator discriminator model, acquiring a first probability value of each sample set output by the original actuator discriminator model, and determining a target sample set according to the first probability value of each sample set;
inputting the parameter value of each parameter in the target sample set into the simulator to control the parameter value of a traffic signal lamp of the target intersection to update, and acquiring a second sample characteristic value of the first preset state characteristic and a third sample characteristic value of the second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection collected by the simulator in a preset control period after the current moment;
determining the second sample characteristic value as an updated target first characteristic value, determining an award value for the original actuator discriminator model according to the third sample characteristic value, and updating a parameter value in the original actuator discriminator model according to the award value;
training the original actuator discriminator model with the updated parameter values according to the updated target first characteristic values, calculating a first probability value of each sample set output by the original actuator discriminator model during each training based on the updated parameter values and the updated target first characteristic values during each training, calculating an expected value of a sample set probability value determining function corresponding to the original actuator discriminator model according to the parameter values of each parameter in each sample set and the corresponding first probability values, and obtaining the trained actuator discriminator model until the expected value is maximum.
Further, the determining a reward value for the original actuator discriminator model from the third sample feature value comprises:
according to a first time value of average vehicle delay time corresponding to each phase of the target intersection and a first quantity value of vehicle arrival flow contained in the third sample characteristic value, determining a first sum value of a product value of the first time value corresponding to each phase and the first quantity value and a second sum value of the first quantity value corresponding to each phase, and obtaining a first reward value corresponding to the target intersection according to a ratio of the first sum value to the second sum value;
according to a second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and a second numerical value of the vehicle arrival flow contained in the third sample characteristic value, determining a third sum of the product value of the second time value corresponding to each phase and the second numerical value and a fourth sum of the second numerical value corresponding to each phase, and according to the ratio of the third sum to the fourth sum, obtaining a second reward value corresponding to the adjacent downstream intersection;
and obtaining the reward value of the original actuator discriminator model according to the fifth sum of the first reward value and the second reward value.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (10)

1. A traffic signal control method characterized by an agent device corresponding to each intersection for a main line of traffic, the method comprising:
acquiring target characteristic values acquired at a target intersection and an adjacent downstream intersection, wherein the target characteristic values are characteristic values of first preset state characteristics of each phase in a preset control period before the current moment, and the first preset state characteristics comprise an arrival flow rate in unit time and the length of a lane occupied by each vehicle when the vehicles queue for the longest time;
inputting the target characteristic values into an actuator discriminator model which is trained in advance, and acquiring the target probability value of each output set;
and determining the target set of the target intersection according to the target probability value of each set, and controlling the traffic signal lamp of each phase in a preset time period after the current time of the target intersection according to the parameter value corresponding to the green light duration of each phase in the target set.
2. The method of claim 1, wherein determining the set of objects at the intersection based on the object probability values for each set comprises:
sampling to determine a target set of the target intersection based on the target probability value of each set, wherein the probability that the target probability value of a set is sampled is higher; or the like, or, alternatively,
and determining the set with the maximum target probability value as the target set of the target intersection.
3. The method of claim 1, wherein the training process of the actuator arbiter model comprises:
acquiring a first target characteristic value acquired by a simulator simulating the target intersection and the adjacent downstream intersection, wherein the first target characteristic value is a first sample characteristic value of a first preset state characteristic of each phase in a preset control period;
inputting the target first characteristic value into an original actuator discriminator model, acquiring a first probability value of each sample set output by the original actuator discriminator model, and determining a target sample set according to the first probability value of each sample set;
inputting the parameter value of each parameter in the target sample set into the simulator to control the parameter value of a traffic signal lamp of the target intersection to update, and acquiring a second sample characteristic value of the first preset state characteristic and a third sample characteristic value of the second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection collected by the simulator in a preset control period after the current moment;
determining the second sample characteristic value as an updated target first characteristic value, determining an award value for the original actuator discriminator model according to the third sample characteristic value, and updating a parameter value in the original actuator discriminator model according to the award value;
training the original actuator discriminator model with the updated parameter values according to the updated target first characteristic values, calculating a first probability value of each sample set output by the original actuator discriminator model during each training based on the updated parameter values and the updated target first characteristic values during each training, calculating an expected value of a sample set probability value determining function corresponding to the original actuator discriminator model according to the parameter values of each parameter in each sample set and the corresponding first probability values, and obtaining the trained actuator discriminator model until the expected value is maximum.
4. The method of claim 3, wherein said determining a prize value for the original actuator discriminator model based on the third sample feature value comprises:
according to a first time value of the average vehicle delay time corresponding to each phase of the goal intersection and a first quantity value of the vehicle arrival flow contained in the third sample characteristic value, determining a first sum of a product value of the first time value corresponding to each phase and the first quantity value and a second sum of the first quantity value corresponding to each phase, and obtaining a first reward value corresponding to the goal intersection according to a ratio of the first sum and the second sum;
according to a second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and a second numerical value of the vehicle arrival flow contained in the third sample characteristic value, determining a third sum of the product value of the second time value corresponding to each phase and the second numerical value and a fourth sum of the second numerical value corresponding to each phase, and according to the ratio of the third sum to the fourth sum, obtaining a second reward value corresponding to the adjacent downstream intersection;
and obtaining the reward value of the original actuator discriminator model according to the fifth sum of the first reward value and the second reward value.
5. A traffic signal control apparatus, characterized in that the apparatus comprises:
the system comprises an acquisition module, a processing module and a control module, wherein the acquisition module is used for acquiring target characteristic values acquired at a target intersection and an adjacent downstream intersection, the target characteristic values are characteristic values of first preset state characteristics of each phase in a preset control period before the current moment, and the first preset state characteristics comprise an arrival flow rate in unit time and the length of a lane occupied by each vehicle when the vehicles queue for the longest time;
the processing module is used for inputting the target characteristic value into an actuator discriminator model which is trained in advance and acquiring the target probability value of each output set;
and the control module is used for determining the target set of the target intersection according to the target probability value of each set, and controlling the traffic signal lamp of each phase in a preset time period after the current time of the target intersection according to the parameter value corresponding to the green light time of each phase in the target set.
6. The device according to claim 5, wherein the control module is specifically configured to sample and determine the target set of the target intersection based on the target probability value of each set, wherein the higher the target probability value of a set is, the higher the probability of being sampled is; or determining the set with the maximum target probability value as the target set of the target intersection.
7. The apparatus of claim 5, further comprising:
the training module is used for acquiring a first target characteristic value acquired by a simulator for simulating the target intersection and the adjacent downstream intersection, wherein the first target characteristic value is a first sample characteristic value of a first preset state characteristic of each phase in a preset control period; inputting the target first characteristic value into an original actuator discriminator model, acquiring a first probability value of each sample set output by the original actuator discriminator model, and determining a target sample set according to the first probability value of each sample set; inputting the parameter value of each parameter in the target sample set into the simulator to control the parameter value of a traffic signal lamp of the target intersection to update, and acquiring a second sample characteristic value of the first preset state characteristic and a third sample characteristic value of the second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection collected by the simulator in a preset control period after the current moment; determining the second sample characteristic value as an updated target first characteristic value, determining an award value for the original actuator discriminator model according to the third sample characteristic value, and updating a parameter value in the original actuator discriminator model according to the award value; training the original actuator discriminator model with the updated parameter values according to the updated target first characteristic values, calculating a first probability value of each sample set output by the original actuator discriminator model during each training based on the updated parameter values and the updated target first characteristic values after each training, calculating an expected value of a sample set probability value determining function corresponding to the original actuator discriminator model according to the parameter values of each parameter in each sample set and the corresponding first probability values, and obtaining the trained actuator discriminator model until the expected value is maximum.
8. The device according to claim 7, wherein the training module is specifically configured to determine, according to a first time value of an average vehicle delay time corresponding to each phase of the intersection and a first quantity value of a vehicle arrival flow included in the third sample feature value, a first sum of a product value of the first time value and the first quantity value corresponding to each phase and a second sum of the first quantity value corresponding to each phase, and obtain, according to a ratio of the first sum and the second sum, a first reward value corresponding to the intersection; according to a second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and a second numerical value of the vehicle arrival flow contained in the third sample characteristic value, determining a third sum of the product value of the second time value corresponding to each phase and the second numerical value and a fourth sum of the second numerical value corresponding to each phase, and according to the ratio of the third sum to the fourth sum, obtaining a second reward value corresponding to the adjacent downstream intersection; and obtaining the reward value of the original actuator discriminator model according to the fifth sum of the first reward value and the second reward value.
9. An electronic device, comprising: the system comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus;
the memory has stored therein a computer program which, when executed by the processor, causes the processor to execute the computer program stored in the memory to carry out the steps of the traffic signal control method according to any one of claims 1-4.
10. A computer-readable storage medium, characterized in that it stores a computer program which, when being executed by a processor, carries out the steps of the traffic signal control method according to any one of claims 1-4.
CN202210314258.3A 2022-03-28 2022-03-28 Traffic signal control method, device, equipment and medium Active CN114639255B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210314258.3A CN114639255B (en) 2022-03-28 2022-03-28 Traffic signal control method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210314258.3A CN114639255B (en) 2022-03-28 2022-03-28 Traffic signal control method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN114639255A true CN114639255A (en) 2022-06-17
CN114639255B CN114639255B (en) 2023-06-09

Family

ID=81952690

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210314258.3A Active CN114639255B (en) 2022-03-28 2022-03-28 Traffic signal control method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN114639255B (en)

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102610094A (en) * 2012-04-05 2012-07-25 郭海锋 Traffic control method for dynamic coordination according to effective capacity of road section
CN103106801A (en) * 2013-01-14 2013-05-15 上海应用技术学院 Self-organizing traffic signal coordination control method
CN104966402A (en) * 2015-06-05 2015-10-07 吉林大学 Supersaturated traffic flow intersection queue overflow prevention and control method
CN110060480A (en) * 2019-05-29 2019-07-26 招商局重庆交通科研设计院有限公司 The control method of road section traffic flow transit time
CN110060475A (en) * 2019-04-17 2019-07-26 清华大学 A kind of multi-intersection signal lamp cooperative control method based on deeply study
CN112201060A (en) * 2020-09-27 2021-01-08 航天科工广信智能技术有限公司 Actor-critical-based single-intersection traffic signal control method
WO2021051870A1 (en) * 2019-09-18 2021-03-25 平安科技(深圳)有限公司 Reinforcement learning model-based information control method and apparatus, and computer device
CN112632858A (en) * 2020-12-23 2021-04-09 浙江工业大学 Traffic light signal control method based on Actor-critical frame deep reinforcement learning algorithm
CN113257016A (en) * 2021-06-21 2021-08-13 腾讯科技(深圳)有限公司 Traffic signal control method and device and readable storage medium
CN113299085A (en) * 2021-06-11 2021-08-24 昭通亮风台信息科技有限公司 Traffic signal lamp control method, equipment and storage medium
CN113487860A (en) * 2021-06-28 2021-10-08 南京云创大数据科技股份有限公司 Intelligent traffic signal control method
CN113628442A (en) * 2021-08-06 2021-11-09 成都信息工程大学 Traffic organization scheme optimization method based on multi-signal-lamp reinforcement learning
CN113643528A (en) * 2021-07-01 2021-11-12 腾讯科技(深圳)有限公司 Signal lamp control method, model training method, system, device and storage medium
WO2021232387A1 (en) * 2020-05-22 2021-11-25 南京云创大数据科技股份有限公司 Multifunctional intelligent signal control system
CN113763723A (en) * 2021-09-06 2021-12-07 武汉理工大学 Traffic signal lamp control system and method based on reinforcement learning and dynamic timing
CN114120670A (en) * 2021-11-25 2022-03-01 支付宝(杭州)信息技术有限公司 Method and system for traffic signal control
US20220076571A1 (en) * 2019-10-28 2022-03-10 Laon People Inc. Signal control apparatus and signal control method based on reinforcement learning

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102610094A (en) * 2012-04-05 2012-07-25 郭海锋 Traffic control method for dynamic coordination according to effective capacity of road section
CN103106801A (en) * 2013-01-14 2013-05-15 上海应用技术学院 Self-organizing traffic signal coordination control method
CN104966402A (en) * 2015-06-05 2015-10-07 吉林大学 Supersaturated traffic flow intersection queue overflow prevention and control method
CN110060475A (en) * 2019-04-17 2019-07-26 清华大学 A kind of multi-intersection signal lamp cooperative control method based on deeply study
CN110060480A (en) * 2019-05-29 2019-07-26 招商局重庆交通科研设计院有限公司 The control method of road section traffic flow transit time
WO2021051870A1 (en) * 2019-09-18 2021-03-25 平安科技(深圳)有限公司 Reinforcement learning model-based information control method and apparatus, and computer device
US20220076571A1 (en) * 2019-10-28 2022-03-10 Laon People Inc. Signal control apparatus and signal control method based on reinforcement learning
WO2021232387A1 (en) * 2020-05-22 2021-11-25 南京云创大数据科技股份有限公司 Multifunctional intelligent signal control system
CN112201060A (en) * 2020-09-27 2021-01-08 航天科工广信智能技术有限公司 Actor-critical-based single-intersection traffic signal control method
CN112632858A (en) * 2020-12-23 2021-04-09 浙江工业大学 Traffic light signal control method based on Actor-critical frame deep reinforcement learning algorithm
CN113299085A (en) * 2021-06-11 2021-08-24 昭通亮风台信息科技有限公司 Traffic signal lamp control method, equipment and storage medium
CN113257016A (en) * 2021-06-21 2021-08-13 腾讯科技(深圳)有限公司 Traffic signal control method and device and readable storage medium
CN113487860A (en) * 2021-06-28 2021-10-08 南京云创大数据科技股份有限公司 Intelligent traffic signal control method
CN113643528A (en) * 2021-07-01 2021-11-12 腾讯科技(深圳)有限公司 Signal lamp control method, model training method, system, device and storage medium
CN113628442A (en) * 2021-08-06 2021-11-09 成都信息工程大学 Traffic organization scheme optimization method based on multi-signal-lamp reinforcement learning
CN113763723A (en) * 2021-09-06 2021-12-07 武汉理工大学 Traffic signal lamp control system and method based on reinforcement learning and dynamic timing
CN114120670A (en) * 2021-11-25 2022-03-01 支付宝(杭州)信息技术有限公司 Method and system for traffic signal control

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
夏新海;: "多Agent强化学习下的城市路网自适应交通信号协调配时决策研究综述", 交通运输研究, no. 02 *
杨文臣;张轮;ZHU FENG;: "多智能体强化学习在城市交通网络信号控制方法中的应用综述", 计算机应用研究, no. 06 *

Also Published As

Publication number Publication date
CN114639255B (en) 2023-06-09

Similar Documents

Publication Publication Date Title
CN112614343B (en) Traffic signal control method and system based on random strategy gradient and electronic equipment
CN112669629B (en) Real-time traffic signal control method and device based on deep reinforcement learning
WO2019165616A1 (en) Signal light control method, related device, and system
Ban A game theoretical approach for modelling merging and yielding behaviour at freeway on-ramp sections
CN111311959B (en) Multi-interface cooperative control method and device, electronic equipment and storage medium
WO2021051870A1 (en) Reinforcement learning model-based information control method and apparatus, and computer device
CN109360429B (en) Urban road traffic scheduling method and system based on simulation optimization
CN113561986B (en) Automatic driving automobile decision making method and device
CN103593535A (en) Urban traffic complex self-adaptive network parallel simulation system and method based on multi-scale integration
CN112907970B (en) Variable lane steering control method based on vehicle queuing length change rate
CN116235229A (en) Method and system for controlling self-adaptive periodic level traffic signals
CN113223293B (en) Road network simulation model construction method and device and electronic equipment
Eriksen et al. Uppaal stratego for intelligent traffic lights
CN113780624A (en) City road network signal coordination control method based on game equilibrium theory
CN115862322A (en) Vehicle variable speed limit control optimization method, system, medium and equipment
CN115311860A (en) Online federal learning method of traffic flow prediction model
CN114760585A (en) Vehicle crowd sensing excitation method, system and equipment
CN114419884A (en) Self-adaptive signal control method and system based on reinforcement learning and phase competition
CN114639255A (en) Traffic signal control method, device, equipment and medium
Zhang et al. Multi-objective deep reinforcement learning approach for adaptive traffic signal control system with concurrent optimization of safety, efficiency, and decarbonization at intersections
CN115640852B (en) Federal learning participation node selection optimization method, federal learning method and federal learning system
US10332035B1 (en) Systems and methods for accelerating model training in machine learning
CN116758768A (en) Dynamic regulation and control method for traffic lights of full crossroad
CN113435112B (en) Traffic signal control method based on neighbor awareness multi-agent reinforcement learning
CN115547050A (en) Intelligent traffic signal control optimization method and software based on Markov decision process

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant