CN114639255B - Traffic signal control method, device, equipment and medium - Google Patents

Traffic signal control method, device, equipment and medium Download PDF

Info

Publication number
CN114639255B
CN114639255B CN202210314258.3A CN202210314258A CN114639255B CN 114639255 B CN114639255 B CN 114639255B CN 202210314258 A CN202210314258 A CN 202210314258A CN 114639255 B CN114639255 B CN 114639255B
Authority
CN
China
Prior art keywords
value
target
intersection
characteristic
phase
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210314258.3A
Other languages
Chinese (zh)
Other versions
CN114639255A (en
Inventor
相强强
程兴硕
王泽�
伍召举
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Dahua Technology Co Ltd
Original Assignee
Zhejiang Dahua Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Dahua Technology Co Ltd filed Critical Zhejiang Dahua Technology Co Ltd
Priority to CN202210314258.3A priority Critical patent/CN114639255B/en
Publication of CN114639255A publication Critical patent/CN114639255A/en
Application granted granted Critical
Publication of CN114639255B publication Critical patent/CN114639255B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/07Controlling traffic signals
    • G08G1/081Plural intersections under common control
    • G08G1/082Controlling the time between beginning of the same phase of a cycle at adjacent intersections
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/01Detecting movement of traffic to be counted or controlled
    • G08G1/0104Measuring and analyzing of parameters relative to traffic conditions
    • G08G1/0125Traffic data processing
    • G08G1/0129Traffic data processing for creating historical data or processing based on historical data
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Traffic Control Systems (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a traffic signal control method, a device, equipment and a medium, wherein the method is used for acquiring target characteristic values acquired at a target intersection and adjacent downstream intersections, acquiring a target probability value of an action control parameter corresponding to the input target characteristic value based on an actuator discriminant model which is trained in advance, determining a target preset action control parameter of the target intersection according to the target probability value, and controlling traffic signals of the target intersection. According to the invention, the target characteristic values of the adjacent downstream intersections are considered when the corresponding intelligent body devices of each intersection determine the target preset action control parameters of the intersection, so that competition among the intelligent body devices of each intersection is reduced, and the overall control effect of the traffic trunk is improved.

Description

Traffic signal control method, device, equipment and medium
Technical Field
The present invention relates to the field of traffic signal control technologies, and in particular, to a traffic signal control method, device, apparatus, and medium.
Background
With population growth and urban progress acceleration, urban travel demands are rapidly increased, and the conventional traffic infrastructure is difficult to meet the increasing traffic demands, so that periodic congestion and aperiodic congestion of urban traffic are caused. The traffic signal control is used as the core of urban traffic management and control, and a scientific and reasonable signal control scheme can maximize the throughput of intersections, improve the running efficiency of urban road networks and the traffic capacity of the intersections, reduce the frequency and intensity of traffic conflict, and further alleviate the problem of urban traffic jam.
The adaptive traffic signal control scheme in the prior art is mainly used for controlling traffic signals based on prediction of a fixed traffic model, selection of a preset signal control scheme or real-time prediction of a traffic simulation model, is driven by a simulation model in nature, and is required to calibrate parameters of the traffic simulation model and design a predefined signal control scheme in advance according to an actual traffic scene, but has poor applicability to a dynamic traffic environment due to the characteristics of dynamic nature, randomness, uncertainty and the like of the actual traffic environment.
The multi-point traffic control based on deep reinforcement learning provided in the prior art mostly simply transfers single-point signal control to a multi-point scene, namely, the intelligent body equipment of each intersection uses the same neural network model subjected to the deep reinforcement learning, but when each intelligent body equipment uses the neural network model, the optimal control effect of traffic signals of the intersection is ensured, but adjacent intersections are mutually influenced, so that if the vehicle passing rate is used as an evaluation standard of the control effect, one intelligent body equipment is likely to set traffic signal lamps of the intersection in a preset time period after the current moment as green lights, and the influence on adjacent downstream intersections is not considered, so that the adjacent downstream intersections are caused to be jammed, and the overall control effect of a traffic trunk is poor.
Disclosure of Invention
The invention provides a traffic signal control method, a device, equipment and a medium, which are used for solving the problem of poor overall control effect of a traffic trunk in the prior art.
The invention provides a traffic signal control method, which is used for intelligent agent equipment corresponding to each intersection of a traffic trunk, and comprises the following steps:
acquiring target characteristic values acquired at a target intersection and an adjacent downstream intersection, wherein the target characteristic values are characteristic values of first preset state characteristics of each phase in a preset control period before the current moment, and the first preset state characteristics comprise an arrival flow rate in unit time and the length of a lane occupied by each vehicle when the vehicles are queued longest;
inputting the target characteristic values into a pre-trained actuator discriminant model to obtain target probability values of each set;
and determining a target set of the target intersection according to the target probability value of each set, and controlling the traffic signal lamp of each phase in a preset time period after the current moment of the target intersection according to the parameter value corresponding to the green light duration of each phase in the target set.
Further, the determining the target set of the target intersection according to the target probability value of each set includes:
sampling is carried out to determine a target set of the target intersection based on the target probability value of each set, wherein the probability of being sampled is higher when the target probability value of the set is larger; or alternatively, the first and second heat exchangers may be,
and determining the set with the maximum target probability value as the target set of the target intersection.
Further, the training process of the actuator discriminant model includes:
acquiring a target first characteristic value acquired by a simulator for simulating the target intersection and the adjacent downstream intersection, wherein the target first characteristic value is a first sample characteristic value of a first preset state characteristic of each phase in a preset control period;
inputting the target first characteristic value into an original executor discriminant model, acquiring a first probability value of each sample set output by the original executor discriminant model, and determining a target sample set according to the first probability value of each sample set;
inputting the parameter value of each parameter in the target sample set into the simulator to control the parameter value update of the traffic signal lamp of the target intersection, and acquiring the second sample characteristic value of the first preset state characteristic and the third sample characteristic value of the second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection in a preset control period after the current moment, wherein the parameter value update is acquired by the simulator;
Determining the second sample characteristic value as an updated target first characteristic value, determining a reward value for the original actuator discriminant model according to the third sample characteristic value, and updating a parameter value in the original actuator discriminant model according to the reward value;
training the original executor discriminant model with updated parameter values according to the updated target first characteristic values, calculating a first probability value of each sample set output by the original executor discriminant model during each training based on the updated parameter values and the updated target first characteristic values after each training, and calculating expected values of sample set probability value determining functions corresponding to the original executor discriminant model according to the parameter values of each parameter in each sample set and the corresponding first probability values until the expected values are maximum, thereby obtaining the trained executor discriminant model.
Further, the determining a reward value for the original actuator discriminant model from the third sample feature value comprises:
determining a first sum value of a product value of the first time value corresponding to each phase and the first quantity value and a second sum value of the first quantity value corresponding to each phase according to a first time value of average vehicle delay time corresponding to each phase of the target intersection and the first quantity value of vehicle arrival flow contained in the third sample characteristic value, and obtaining a first rewarding value corresponding to the target intersection according to the ratio of the first sum value to the second sum value;
According to a second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and a second quantity value of the vehicle arrival flow, which are contained in the third sample characteristic value, a third sum value of a product value of the second time value corresponding to each phase and the second quantity value and a fourth sum value of the second quantity value corresponding to each phase are determined, and according to a ratio of the third sum value to the fourth sum value, a second prize value corresponding to the adjacent downstream intersection is obtained;
and obtaining the rewarding value of the original actuator discriminant model according to the fifth sum value of the first rewarding value and the second rewarding value.
Accordingly, the present invention provides a traffic signal control apparatus, the apparatus comprising:
the system comprises an acquisition module, a control module and a control module, wherein the acquisition module is used for acquiring target characteristic values acquired at a target intersection and an adjacent downstream intersection, the target characteristic values are characteristic values of first preset state characteristics of each phase in a preset control period before the current moment, and the first preset state characteristics comprise an arrival flow rate in unit time and the length of a lane occupied by each vehicle when the vehicle is queued to be longest;
the processing module is used for inputting the target characteristic value into a pre-trained actuator discriminant model and acquiring a target probability value of each output set;
And the control module is used for determining a target set of the target intersection according to the target probability value of each set, and controlling the traffic signal lamp of each phase in a preset time period after the current moment of the target intersection according to the parameter value corresponding to the green light duration of each phase in the target set.
Further, the control module is specifically configured to sample and determine a target set of the target intersection based on the target probability value of each set, where the greater the target probability value of the set, the higher the probability of being sampled; or determining the set with the maximum target probability value as the target set of the target intersection.
Further, the apparatus further comprises:
the training module is used for acquiring target first characteristic values acquired by a simulator for simulating the target intersection and the adjacent downstream intersection, wherein the target first characteristic values are first sample characteristic values of first preset state characteristics of each phase in a preset control period; inputting the target first characteristic value into an original executor discriminant model, acquiring a first probability value of each sample set output by the original executor discriminant model, and determining a target sample set according to the first probability value of each sample set; inputting the parameter value of each parameter in the target sample set into the simulator to control the parameter value update of the traffic signal lamp of the target intersection, and acquiring the second sample characteristic value of the first preset state characteristic and the third sample characteristic value of the second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection in a preset control period after the current moment, wherein the parameter value update is acquired by the simulator; determining the second sample characteristic value as an updated target first characteristic value, determining a reward value for the original actuator discriminant model according to the third sample characteristic value, and updating a parameter value in the original actuator discriminant model according to the reward value; training the original executor discriminant model with updated parameter values according to the updated target first characteristic values, calculating a first probability value of each sample set output by the original executor discriminant model during each training based on the updated parameter values and the updated target first characteristic values after each training, and calculating expected values of sample set probability value determining functions corresponding to the original executor discriminant model according to the parameter values of each parameter in each sample set and the corresponding first probability values until the expected values are maximum, thereby obtaining the trained executor discriminant model.
Further, the training module is specifically configured to determine, according to a first time value of an average vehicle delay time corresponding to each phase of the target intersection and a first quantity value of a vehicle arrival flow included in the third sample feature value, a first sum value of a product value of the first time value corresponding to each phase and the first quantity value, and a second sum value of the first quantity value corresponding to each phase, and obtain, according to a ratio of the first sum value to the second sum value, a first reward value corresponding to the target intersection; according to a second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and a second quantity value of the vehicle arrival flow, which are contained in the third sample characteristic value, a third sum value of a product value of the second time value corresponding to each phase and the second quantity value and a fourth sum value of the second quantity value corresponding to each phase are determined, and according to a ratio of the third sum value to the fourth sum value, a second prize value corresponding to the adjacent downstream intersection is obtained; and obtaining the rewarding value of the original actuator discriminant model according to the fifth sum value of the first rewarding value and the second rewarding value.
Accordingly, the present invention provides an electronic device, comprising: the device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
the memory has stored therein a computer program which, when executed by the processor, causes the processor to perform the steps of any one of the above-described traffic signal control methods when executing the computer program stored in the memory.
Accordingly, the present invention provides a computer readable storage medium storing a computer program which when executed by a processor performs the steps of any one of the above-described traffic signal control methods.
The invention provides a traffic signal control method, a device, equipment and a medium, wherein the method is used for acquiring target characteristic values acquired at a target intersection and adjacent downstream intersections, acquiring a target probability value of an action control parameter corresponding to the input target characteristic value based on an actuator discriminant model which is trained in advance, determining a target preset action control parameter of the target intersection according to the target probability value, and controlling traffic signals of the target intersection. According to the invention, the target characteristic values of the adjacent downstream intersections are considered when the corresponding intelligent body devices of each intersection determine the target preset action control parameters of the intersection, so that competition among the intelligent body devices of each intersection is reduced, and the overall control effect of the traffic trunk is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it will be apparent that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic process diagram of a traffic signal control method according to an embodiment of the present invention;
FIG. 2 is a schematic illustration of a traffic trunk provided by an embodiment of the present invention;
FIG. 3 is a schematic diagram of training an actuator discriminant model according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a traffic signal control device according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In order to improve the overall control effect of a traffic trunk, the embodiment of the invention provides a traffic signal control method, a device, equipment and a medium.
Example 1:
fig. 1 is a schematic process diagram of a traffic signal control method according to an embodiment of the present invention, where the process includes the following steps:
s101: the method comprises the steps of obtaining target characteristic values collected at a target intersection and an adjacent downstream intersection, wherein the target characteristic values are characteristic values of first preset state characteristics of each phase in a preset control period before the current moment, and the first preset state characteristics comprise the arrival flow rate of unit time and the occupied lane length of each vehicle when the vehicles are queued longest.
The traffic signal control method provided by the embodiment of the invention is applied to the intelligent body equipment corresponding to each intersection of the traffic trunk, wherein the intelligent body equipment can be an intelligent terminal such as a PC (personal computer), a tablet personal computer, a mobile terminal and the like for controlling the traffic signal lamp, and can also be a server for controlling the traffic signal lamp; the server can be a local server, a cloud server and a traffic signal lamp controller. In particular, embodiments of the present invention are not limited in this regard.
In order to improve the overall control effect of the traffic trunk, the intelligent device acquires target characteristic values acquired at a target intersection and an adjacent downstream intersection, wherein the target intersection is an intersection corresponding to the intelligent device, namely an intersection where a traffic signal lamp controlled by the intelligent device is located, and the adjacent downstream intersection is an intersection adjacent to the target intersection and positioned downstream in the traffic direction of the target intersection. Fig. 2 is a schematic diagram of a traffic trunk according to an embodiment of the present invention, taking a direction from left (left and right in fig. 2) to right (left and right in fig. 2) of a vehicle passing behavior as an example, if an intersection on the left (left and right in fig. 2) is a target intersection, an intersection in the middle (middle in fig. 2) is an adjacent downstream intersection.
Specifically, the intelligent device may acquire a target characteristic value corresponding to a target intersection acquired by an image acquisition device connected with the intelligent device and a target characteristic value corresponding to a downstream intersection acquired by an image acquisition device connected with an adjacent downstream intersection; the target characteristic value corresponding to the target intersection acquired by the image acquisition unit of the intelligent body equipment and the target characteristic value corresponding to the downstream intersection sent by the intelligent body equipment corresponding to the adjacent downstream intersection can be acquired.
The target characteristic value is a characteristic value of a first preset state characteristic of each phase in a preset control period before the current moment, the preset control period is a maximum period in an optimal period of each intersection in a traffic trunk according to a Webster optimal period formula, the first preset state characteristic comprises a unit time arrival flow rate and a vehicle queuing longest time lane occupation length, the unit time arrival flow rate is a ratio of a vehicle arrival flow through the intersection to the preset control period in the preset control period before the current moment, and the vehicle queuing longest time lane occupation length is a ratio of the maximum queuing length to the corresponding vehicle number in the preset control period before the current moment.
For example, the target intersection is denoted as m, the adjacent downstream intersection is denoted as n, and the target characteristic values of the target intersection and the adjacent downstream intersection are
Figure BDA0003568471590000081
Wherein the method comprises the steps of
Figure BDA0003568471590000082
The target characteristic value
Figure BDA0003568471590000083
In (I)>
Figure BDA0003568471590000084
Representing the maximum queuing length,/in the preset control period before the current moment of the ith phase of the target intersection m>
Figure BDA0003568471590000085
Representing the number of vehicles corresponding to the maximum queuing length of the ith phase of the target intersection m in a preset control period before the current moment,/day >
Figure BDA0003568471590000086
Indicating that each vehicle occupies a lane length when the ith phase of the target intersection m is the longest in a preset control period before the current moment>
Figure BDA0003568471590000087
Indicating the arrival flow rate per unit time of the ith phase of the target intersection m in a preset control period before the current time.
Figure BDA0003568471590000088
Represents the maximum queuing length of the ith phase of the adjacent downstream intersection n in a preset control period before the current moment,/h>
Figure BDA0003568471590000089
Representing the number of vehicles corresponding to the maximum queuing length of the ith phase of the adjacent downstream intersection n in a preset control period before the current moment, < ->
Figure BDA00035684715900000810
Indicating the length of each vehicle occupying the lane when the vehicle queuing is longest in the preset control period before the current moment of the ith phase of the adjacent downstream intersection n +.>
Figure BDA00035684715900000811
Indicating the arrival flow rate per unit time of the ith phase of the adjacent downstream intersection n in a preset control period prior to the current time. />
S102: and inputting the target characteristic value into a pre-trained actuator discriminant model, and obtaining the target probability value of each output set.
In order to realize the control of traffic signals, in the embodiment of the invention, the intelligent body equipment stores a pre-trained actuator discriminant model, wherein the actuator discriminant model is pre-trained to determine a target probability value of each set according to target characteristic values acquired by a target intersection and an adjacent downstream intersection, wherein the set comprises parameter values of action control parameters of traffic signals of the target intersection, and the action control parameters comprise green signal ratio of each phase of the traffic signals, namely the ratio of green light duration to duration of a preset control period.
For example, for motion control parameters
Figure BDA0003568471590000091
Indicating (I)>
Figure BDA0003568471590000092
Wherein->
Figure BDA0003568471590000093
The ratio of the green light duration of the ith phase of the target intersection m to the duration of the preset control period is represented.
After the target characteristic values acquired by the to-be-target intersection and the adjacent downstream intersection are acquired, the target characteristic values are input into the pre-trained actuator discriminant model, and the target probability values of each set are acquired through the processing of the actuator discriminant model.
S103: and determining a target set of the target intersection according to the target probability value of each set, and controlling the traffic signal lamp of each phase in a preset time period after the current moment of the target intersection according to the parameter value corresponding to the green light duration of each phase in the target set.
After determining the target probability value of each set, determining the target set of the target intersection according to the target probability value of each set. Specifically, the set corresponding to the median value of the target probability values may be determined as the target set, the set corresponding to the maximum value of the target probability values may be determined as the target set, or the target set may be selected from each set by adopting other determining methods, which is not limited in the embodiment of the present invention.
After the target set is determined, controlling the traffic signal lamp of each phase in a preset time period after the current moment of the target intersection according to the parameter value corresponding to the green light duration of each phase in the target set. In the embodiment of the invention, the traffic signal lamp of the target intersection is controlled once every preset control period, and each control needs to determine the target set in the next control period according to the target characteristic value acquired in the previous preset control period.
In the embodiment of the invention, the traffic signal control method provided by the invention acquires the target characteristic values acquired at the target intersection and the adjacent downstream intersection, acquires the target probability value of the action control parameter corresponding to the input target characteristic value based on the pre-trained actuator discriminant model, determines the target preset action control parameter of the target intersection according to the target probability value, and controls the traffic signal of the target intersection. According to the invention, the target characteristic values of the adjacent downstream intersections are considered when the corresponding intelligent body devices of each intersection determine the target preset action control parameters of the intersection, so that competition among the intelligent body devices of each intersection is reduced, and the overall control effect of the traffic trunk is improved.
Example 2:
in order to determine the target set of the target intersection, on the basis of the foregoing embodiment, in an embodiment of the present invention, the determining, according to the target probability value of each set, the target set of the target intersection includes:
sampling is carried out to determine a target set of the target intersection based on the target probability value of each set, wherein the probability of being sampled is higher when the target probability value of the set is larger; or alternatively, the first and second heat exchangers may be,
and determining the set with the maximum target probability value as the target set of the target intersection.
In order to determine the target set of the target intersection, in the embodiment of the invention, sampling is performed to determine the target set of the target intersection based on the target probability value of each set, wherein the probability of sampling each set is inconsistent, the probability of sampling is higher when the target probability value of the set is larger, and the probability of sampling is lower when the target probability value of the set is smaller.
As a possible implementation manner, in the embodiment of the present invention, a set with a maximum target probability value may also be determined as a target set of the target intersection according to the target probability value of each set.
Example 3:
in order to train the actuator discriminant model, on the basis of the above embodiments, in an embodiment of the present invention, a training process of the actuator discriminant model includes:
acquiring a target first characteristic value acquired by a simulator for simulating the target intersection and the adjacent downstream intersection, wherein the target first characteristic value is a first sample characteristic value of a first preset state characteristic of each phase in a preset control period;
inputting the target first characteristic value into an original executor discriminant model, acquiring a first probability value of each sample set output by the original executor discriminant model, and determining a target sample set according to the first probability value of each sample set;
inputting the parameter value of each parameter in the target sample set into the simulator to control the parameter value update of the traffic signal lamp of the target intersection, and acquiring the second sample characteristic value of the first preset state characteristic and the third sample characteristic value of the second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection in a preset control period after the current moment, wherein the parameter value update is acquired by the simulator;
Determining the second sample characteristic value as an updated target first characteristic value, determining a reward value for the original actuator discriminant model according to the third sample characteristic value, and updating a parameter value in the original actuator discriminant model according to the reward value;
training the original executor discriminant model with updated parameter values according to the updated target first characteristic values, calculating a first probability value of each sample set output by the original executor discriminant model during each training based on the updated parameter values and the updated target first characteristic values after each training, and calculating expected values of sample set probability value determining functions corresponding to the original executor discriminant model according to the parameter values of each parameter in each sample set and the corresponding first probability values until the expected values are maximum, thereby obtaining the trained executor discriminant model.
In order to train the actuator discriminant model, in the embodiment of the present invention, simulators simulating the target intersection and the adjacent downstream intersection are stored in advance, and the simulators are specifically used for simulating the change of the traffic state of the intersection.
The method comprises the steps of obtaining target first characteristic values collected by simulators for simulating a target intersection and adjacent downstream intersections, wherein the target first characteristic values are first sample characteristic values of first preset state characteristics of each phase in a preset control period, and the target first characteristic values comprise first sample characteristic values corresponding to the target intersection and first sample characteristic values corresponding to the adjacent downstream intersections.
Specifically, when the actuator discriminant model is trained for the first time, the first characteristic value of the target is an initial characteristic value pre-stored in simulation, and the subsequent characteristic value after the simulator simulates the target intersection and the adjacent downstream intersection during each training.
After the target first characteristic value acquired by the simulator is acquired, the target first characteristic value is input into an original executor discriminant model, a first probability value of each sample set output by the original executor discriminant model is acquired, each sample set is preset, sampling is carried out according to the first probability value of each sample set, and the target sample set is determined, wherein the greater the first probability value of the sample set is, the greater the sampling probability is.
Inputting the parameter value of each parameter in the target sample set into a simulator, controlling the parameter value of the traffic signal lamp of the target intersection simulated in the simulator to update, and updating the parameter value into the parameter value of the corresponding parameter of each phase in each phase of a preset period.
After the simulator simulates a preset control period of the target intersection and the adjacent downstream intersection after the current moment, acquiring a second sample characteristic value of a first preset state characteristic and a third sample characteristic value of a second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection in the preset control period, wherein the second sample characteristic value is acquired by the simulator. Wherein the second preset status characteristic includes an average vehicle delay time and a vehicle arrival flow.
In order to perform the next training, determining the second sample characteristic value as an updated target first characteristic value, determining a reward value for the original actuator discriminant model according to the third sample characteristic value, and updating the parameter value in the original actuator discriminant model by adopting a time difference learning algorithm according to the reward value. Wherein the prize value is a prize value resulting from a transition including a second preset state in the environmental state of the target intersection and a prize value resulting from a second preset state in the environmental state of an adjacent downstream intersection.
Training the original executor discriminant model with the updated parameter values according to the updated target first characteristic values, namely repeatedly executing the steps according to the updated target first characteristic values. And calculating an expected value of a probability value determining function corresponding to the original executor discriminant model during each training based on the updated parameter value and the updated target first characteristic value of each training until the expected value is maximum, and obtaining the trained executor discriminant model.
Specifically, according to the parameter value of each parameter in each sample set output during each training and the corresponding first sample probability value, determining the sum of the product value of the parameter value in each sample set and the corresponding first sample probability value, and determining the expected value comprising the sum corresponding to each parameter.
Example 4:
in order to realize training of an actuator arbiter model, in the embodiment of the invention, for each round of training corresponding to each preset control period, a target first characteristic value acquired by a simulator simulating a target intersection and an adjacent downstream intersection is acquired in the round of training, the target first characteristic value is input into an original actuator arbiter model as a strategy network of an actuator, a first probability value of each sample set output by the strategy network is acquired according to a state cost function corresponding to the strategy network for determining a sample set probability value and a first score value provided by one round of training on a value network of the arbiter, and a target sample set is determined according to the first probability value of each sample set.
Inputting the first characteristic value of the target and the parameter value of each parameter in the target sample set into a value network serving as a discriminator in a starting actuator discriminator model, and obtaining a first evaluation value of the value network on the target sample set.
Inputting the parameter value of each parameter in the target sample set into a simulator to control the parameter value of a traffic signal lamp of a target intersection to update, and acquiring a second sample characteristic value of a first preset state characteristic and a third sample characteristic value of a second preset state characteristic of each phase of the target intersection acquired by the simulator and the adjacent downstream intersection in a preset control period after the current moment. And determining a reward value for the value network in the original simulator discriminant model according to the third sample characteristic value.
Inputting the second sample characteristic value into a strategy network of an original actuator discriminator model as an actuator, acquiring a predicted first probability value of each sample set output by the strategy network, and determining a predicted target sample set according to the predicted first probability value of each sample set. And inputting the second sample characteristic value and the parameter value of each parameter in the prediction target sample set into a value network serving as a discriminator in the initial actuator discriminator model, and obtaining a second evaluation value of the value network on the prediction target sample set.
And determining a time difference error value by adopting a time difference algorithm according to the first score value, the reward value and the second score value, determining a product value of the derivative value and the time difference error value according to a derivative value of the action cost function corresponding to the value network on the first parameter value of the value network, and updating the first parameter value according to the product value. Specifically, the product value of the product value and the preset first learning rate is subtracted from the first parameter value to obtain an updated first parameter value.
And updating the second parameter value of the strategy network by adopting a random gradient rising method according to the first grading value and the derivative value of the second parameter of the state cost function strategy network corresponding to the strategy network. Specifically, a product value of the first grading value and the derivative value is determined, the second parameter value is updated according to the product value, and specifically, the product value of the product value and a preset second learning rate is added to the second parameter value, so that an updated second parameter value is obtained.
Determining a second sample characteristic value as an updated target first characteristic value, training an original executor discriminant model with updated first parameter values and second parameter values, calculating a first probability value of each sample set output by the original executor discriminant model during each training based on the updated parameter values and the updated target first characteristic values after each training, and calculating expected values of state value functions of the determined sample set probability values corresponding to the executors in the original executor discriminant model according to the parameter values of each parameter in each sample set and the corresponding first probability values until the expected values are maximum, thereby obtaining the trained executor discriminant model.
In the embodiment of the invention, an actuator discriminant model adopts an actuator-Critic algorithm to carry out cooperative optimization control on traffic signals of a traffic trunk, and the actuator-Critic algorithm belongs to one of strategy learning, which approximates a strategy function pi (a|s) by using a neural network. The neural network approximating the strategy function in the embodiment of the invention is a strategy network, which is expressed as pi (a|s; θ) and is used as an actuator in the actuator discriminant model.
The input of the strategy network is the characteristic value S of the first preset state characteristic of each phase in the preset control period before the current moment acquired by the target intersection and the adjacent downstream intersection, the probability distribution of each set a of the output parameter values comprising the motion control parameters is equal to sigma a∈A π(a|s;θ)=1。
The state-cost function corresponding to the policy function can be approximately expressed as V π (s t ;θ)=∑ a π(a t |s;θ)Q π (s t In the embodiment of the invention, a random gradient ascent method is adopted to update the parameter theta, wherein
Figure BDA0003568471590000141
Known as policy gradients.
The policy gradient may be represented by the following formula:
Figure BDA0003568471590000151
since the policy gradient cannot be directly calculated, the policy gradient can be calculated approximately by using the Monte Carlo approximation method, namely
Figure BDA0003568471590000152
Since it is not known in the policy gradient
Figure BDA0003568471590000153
The value network q (s, a; ω) can be used to approximate the action cost function. Where the value network q (s, a; ω) is a neural network.
The policy network pi (a|s; θ) is called actor (actor), and the value network q (s, a; ω) is called critter (critic). When learning the strategy network pi (a|s; θ), the supervisory signal comes from the evaluation value Q provided by the value network Q (s, a; ω), and when learning the value network Q (s, a; ω), the supervisory signal comes from the reward value R.
The parameters θ and ω are updated simultaneously when training the actuator discriminant model, and the strategy gradients are used to update the parameters θ of the strategy network to increase the state cost function V π (s t The method comprises the steps of carrying out a first treatment on the surface of the θ) the accuracy of the probability value output, the parameter ω of the value network is updated using a time difference algorithm in order to make the output evaluation value Q more accurate.
Fig. 3 is a schematic diagram of training an actuator arbiter model according to an embodiment of the present invention, where, as shown in fig. 3, the actuator arbiter model includes a policy network as an actuator and a value network as an arbiter, where a traffic state in an environment of a target intersection and an adjacent downstream intersection of the actuator model is adopted, the value network is used to provide an evaluation Q value to the policy network, the simulator is used to provide a reward value and output a feature value of a first preset state feature of the environment, and the policy network is used to output a probability value of a set of motion control parameters.
The actuator discriminant model of the present invention is trained by one specific example, and the strategy network pi (a|s; θ) and the value network q (s, a; ω) in the actuator discriminant model are randomly initialized.
The intelligent agent equipment acquires target characteristic values s of target intersections acquired in the simulator and adjacent downstream intersections t If it is first training, the target characteristic value s t For pre-stored initial characteristic value s 0 The target characteristic value s t Inputting a policy network pi (a|s; theta), sampling based on a first probability value of each sample set output by the policy network pi (a|s; theta) to determine a target sample set a t . To target characteristic value s t And a set of target samples a t Inputting the parameter value of each parameter into a value network q (s, a; omega) to obtain a first evaluation value q of the value network on the target sample set t (s t ,a t ;ω t )。
Assembling a target sample set a t Inputting parameter values of each parameter into a simulator to control parameter value updating of traffic signal lamps of a target intersection, and acquiring second sample characteristic values s of first preset state characteristics of each phase in a preset control period of the target intersection acquired by the simulator and adjacent downstream intersections after the current moment t+1 And a third sample feature value for the second predetermined state feature. Determining a reward value r for the value network in the original simulator discriminant model from the third sample feature value t
Characterizing the second sample by s t+1 Inputting a strategy network pi (a|s; theta) serving as an actuator in an original actuator discriminant model, and sampling and determining a prediction target sample set based on a prediction first probability value of each sample set output by the strategy network pi (a|s; theta)
Figure BDA0003568471590000161
To target characteristic value s t+1 And a set of target samples
Figure BDA0003568471590000162
Inputting the parameter value of each parameter into a value network q (s, a; omega) to obtain a second evaluation value of the value network on the target sample set>
Figure BDA0003568471590000163
According to the first evaluation value q t Prize value r t Q t+1 Calculating a time difference error value delta t Wherein delta t =q t -(r t +γq t+1 ) Deriving the action cost function corresponding to the value network to obtain a derivative value d ω,t Wherein
Figure BDA0003568471590000164
First parameter value omega for value network t Updating to obtain updated first parameter value omega t+1 ,ω t+1 =ω t -αδ t d ω,t
Deriving a state value function corresponding to the strategy network to obtain a derivative value d θ,t Wherein
Figure BDA0003568471590000165
Figure BDA0003568471590000166
According to the first evaluation value q t Sum derivative d θ,t Updating a second parameter value θ of the policy network with random gradient ramp-up t Obtaining updated second parameter value theta t+1 Wherein θ is t+1 =θ t +βq t d θ,t
Example 5:
in order to determine the prize value of the original actuator discriminant model for each training, in the embodiments of the present invention, determining the prize value of the original actuator discriminant model according to the third sample feature value includes:
Determining a first sum value of a product value of the first time value corresponding to each phase and the first quantity value and a second sum value of the first quantity value corresponding to each phase according to a first time value of average vehicle delay time corresponding to each phase of the target intersection and the first quantity value of vehicle arrival flow contained in the third sample characteristic value, and obtaining a first rewarding value corresponding to the target intersection according to the ratio of the first sum value to the second sum value;
according to a second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and a second quantity value of the vehicle arrival flow, which are contained in the third sample characteristic value, a third sum value of a product value of the second time value corresponding to each phase and the second quantity value and a fourth sum value of the second quantity value corresponding to each phase are determined, and according to a ratio of the third sum value to the fourth sum value, a second prize value corresponding to the adjacent downstream intersection is obtained;
and obtaining the rewarding value of the original actuator discriminant model according to the fifth sum value of the first rewarding value and the second rewarding value.
In order to determine a reward value of each training on an original actuator discriminant model, in the embodiment of the invention, a product value of the first time value corresponding to each phase and the first quantity value is determined according to the first time value of the average vehicle delay time corresponding to each phase of a target intersection and the first quantity value of the vehicle arrival flow, which are contained in the third sample characteristic value, and the product value corresponding to each phase is added to obtain a first sum value; and adding each first quantity value according to the first quantity value corresponding to each phase to obtain a second sum value, and determining the negative value of the ratio of the first sum value to the second sum value as a first rewarding value corresponding to the target intersection according to the ratio of the first sum value to the second sum value.
Determining a product value of the second time value corresponding to each phase and the second quantity value according to the second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and the second quantity value of the vehicle arrival flow contained in the third sample characteristic value, and adding the product values corresponding to each phase to obtain a third sum value; and according to the second quantity value corresponding to each phase, adding each second quantity value to obtain a fourth sum value, and according to the ratio of the third sum value to the fourth sum value, determining the negative value of the ratio of the third sum value to the fourth sum value as a second prize value corresponding to an adjacent downstream intersection.
And determining a fifth sum of the first and second prize values as a prize value for the original actuator discriminant model based on the fifth sum.
The process of determining the prize value of the original actuator discriminant model according to the present invention is described in a specific embodiment, and for each target intersection of the traffic trunk, not only the prize value of the target intersection m but also the prize value of the adjacent downstream intersection n is considered when training the actuator discriminant model.
In calculating the prize value for the actuator arbiter model, a joint prize function is specifically employed, wherein the joint prize function
Figure BDA0003568471590000181
Wherein->
Figure BDA0003568471590000182
A first time value representing the average delay time of the ith phase of the target intersection m in a preset control period after the current time t->
Figure BDA0003568471590000183
A first quantity value representing the vehicle arrival flow of the ith phase of the target intersection m in a preset control period after the current time t; />
Figure BDA0003568471590000184
A second time value representing an average delay time of the ith phase of the target intersection n within a preset control period after the current time t +.>
Figure BDA0003568471590000185
A second quantity value representing the vehicle arrival flow of the ith phase of the target intersection n in a preset control period after the current time t.
Example 6:
fig. 4 is a schematic structural diagram of a traffic signal control device according to an embodiment of the present invention, where the device includes:
an obtaining module 401, configured to obtain a target feature value collected at a target intersection and an adjacent downstream intersection, where the target feature value is a feature value of a first preset state feature of each phase in a preset control period before a current moment, where the first preset state feature includes an arrival flow rate per unit time and a lane length occupied by each vehicle when a vehicle queues longest;
a processing module 402, configured to input the target feature value to a pre-trained actuator discriminant model, and obtain a target probability value of each set output;
the control module 403 is configured to determine a target set of the target intersection according to the target probability value of each set, and control traffic signals of each phase in a preset time period after the current time of the target intersection according to a parameter value corresponding to a green light duration of each phase in the target set.
Further, the control module is specifically configured to sample and determine a target set of the target intersection based on the target probability value of each set, where the greater the target probability value of the set, the higher the probability of being sampled; or determining the set with the maximum target probability value as the target set of the target intersection.
Further, the apparatus further comprises:
the training module is used for acquiring target first characteristic values acquired by a simulator for simulating the target intersection and the adjacent downstream intersection, wherein the target first characteristic values are first sample characteristic values of first preset state characteristics of each phase in a preset control period; inputting the target first characteristic value into an original executor discriminant model, acquiring a first probability value of each sample set output by the original executor discriminant model, and determining a target sample set according to the first probability value of each sample set; inputting the parameter value of each parameter in the target sample set into the simulator to control the parameter value update of the traffic signal lamp of the target intersection, and acquiring the second sample characteristic value of the first preset state characteristic and the third sample characteristic value of the second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection in a preset control period after the current moment, wherein the parameter value update is acquired by the simulator; determining the second sample characteristic value as an updated target first characteristic value, determining a reward value for the original actuator discriminant model according to the third sample characteristic value, and updating a parameter value in the original actuator discriminant model according to the reward value; training the original executor discriminant model with updated parameter values according to the updated target first characteristic values, calculating a first probability value of each sample set output by the original executor discriminant model during each training based on the updated parameter values and the updated target first characteristic values after each training, and calculating expected values of sample set probability value determining functions corresponding to the original executor discriminant model according to the parameter values of each parameter in each sample set and the corresponding first probability values until the expected values are maximum, thereby obtaining the trained executor discriminant model.
Further, the training module is specifically configured to determine, according to a first time value of an average vehicle delay time corresponding to each phase of the target intersection and a first quantity value of a vehicle arrival flow included in the third sample feature value, a first sum value of a product value of the first time value corresponding to each phase and the first quantity value, and a second sum value of the first quantity value corresponding to each phase, and obtain, according to a ratio of the first sum value to the second sum value, a first reward value corresponding to the target intersection; according to a second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and a second quantity value of the vehicle arrival flow, which are contained in the third sample characteristic value, a third sum value of a product value of the second time value corresponding to each phase and the second quantity value and a fourth sum value of the second quantity value corresponding to each phase are determined, and according to a ratio of the third sum value to the fourth sum value, a second prize value corresponding to the adjacent downstream intersection is obtained; and obtaining the rewarding value of the original actuator discriminant model according to the fifth sum value of the first rewarding value and the second rewarding value.
Example 7:
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and on the basis of the foregoing embodiments, the present application further provides an electronic device including a processor 501, a communication interface 502, a memory 503, and a communication bus 504, where the processor 501, the communication interface 502, and the memory 503 complete communication with each other through the communication bus 504;
the memory 503 has stored therein a computer program which, when executed by the processor 501, causes the processor 501 to perform the steps of:
acquiring target characteristic values acquired at a target intersection and an adjacent downstream intersection, wherein the target characteristic values are characteristic values of first preset state characteristics of each phase in a preset control period before the current moment, and the first preset state characteristics comprise an arrival flow rate in unit time and the length of a lane occupied by each vehicle when the vehicles are queued longest;
inputting the target characteristic values into a pre-trained actuator discriminant model to obtain target probability values of each set;
and determining a target set of the target intersection according to the target probability value of each set, and controlling the traffic signal lamp of each phase in a preset time period after the current moment of the target intersection according to the parameter value corresponding to the green light duration of each phase in the target set.
Further, the processor 501 is specifically configured to determine, according to the target probability value of each set, a target set of the target intersection, where the determining includes:
sampling is carried out to determine a target set of the target intersection based on the target probability value of each set, wherein the probability of being sampled is higher when the target probability value of the set is larger; or alternatively, the first and second heat exchangers may be,
and determining the set with the maximum target probability value as the target set of the target intersection.
Further, the training process of the processor 501 specifically for the actuator discriminant model includes:
acquiring a target first characteristic value acquired by a simulator for simulating the target intersection and the adjacent downstream intersection, wherein the target first characteristic value is a first sample characteristic value of a first preset state characteristic of each phase in a preset control period;
inputting the target first characteristic value into an original executor discriminant model, acquiring a first probability value of each sample set output by the original executor discriminant model, and determining a target sample set according to the first probability value of each sample set;
inputting the parameter value of each parameter in the target sample set into the simulator to control the parameter value update of the traffic signal lamp of the target intersection, and acquiring the second sample characteristic value of the first preset state characteristic and the third sample characteristic value of the second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection in a preset control period after the current moment, wherein the parameter value update is acquired by the simulator;
Determining the second sample characteristic value as an updated target first characteristic value, determining a reward value for the original actuator discriminant model according to the third sample characteristic value, and updating a parameter value in the original actuator discriminant model according to the reward value;
training the original executor discriminant model with updated parameter values according to the updated target first characteristic values, calculating a first probability value of each sample set output by the original executor discriminant model during each training based on the updated parameter values and the updated target first characteristic values after each training, and calculating expected values of sample set probability value determining functions corresponding to the original executor discriminant model according to the parameter values of each parameter in each sample set and the corresponding first probability values until the expected values are maximum, thereby obtaining the trained executor discriminant model.
Further, the determining, by the processor 501, a reward value for the original actuator discriminant model according to the third sample feature value includes:
determining a first sum value of a product value of the first time value corresponding to each phase and the first quantity value and a second sum value of the first quantity value corresponding to each phase according to a first time value of average vehicle delay time corresponding to each phase of the target intersection and the first quantity value of vehicle arrival flow contained in the third sample characteristic value, and obtaining a first rewarding value corresponding to the target intersection according to the ratio of the first sum value to the second sum value;
According to a second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and a second quantity value of the vehicle arrival flow, which are contained in the third sample characteristic value, a third sum value of a product value of the second time value corresponding to each phase and the second quantity value and a fourth sum value of the second quantity value corresponding to each phase are determined, and according to a ratio of the third sum value to the fourth sum value, a second prize value corresponding to the adjacent downstream intersection is obtained;
and obtaining the rewarding value of the original actuator discriminant model according to the fifth sum value of the first rewarding value and the second rewarding value.
The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface 502 is used for communication between the electronic device and other devices described above.
The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit, a network processor (Network Processor, NP), etc.; but also digital instruction processors (Digital Signal Processing, DSP), application specific integrated circuits, field programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
Example 8:
on the basis of the above embodiments, the embodiments of the present invention further provide a computer readable storage medium having stored therein a computer program executable by a processor, which when run on the processor, causes the processor to perform the steps of:
acquiring target characteristic values acquired at a target intersection and an adjacent downstream intersection, wherein the target characteristic values are characteristic values of first preset state characteristics of each phase in a preset control period before the current moment, and the first preset state characteristics comprise an arrival flow rate in unit time and the length of a lane occupied by each vehicle when the vehicles are queued longest;
Inputting the target characteristic values into a pre-trained actuator discriminant model to obtain target probability values of each set;
and determining a target set of the target intersection according to the target probability value of each set, and controlling the traffic signal lamp of each phase in a preset time period after the current moment of the target intersection according to the parameter value corresponding to the green light duration of each phase in the target set.
Further, the determining the target set of the target intersection according to the target probability value of each set includes:
sampling is carried out to determine a target set of the target intersection based on the target probability value of each set, wherein the probability of being sampled is higher when the target probability value of the set is larger; or alternatively, the first and second heat exchangers may be,
and determining the set with the maximum target probability value as the target set of the target intersection.
Further, the training process of the actuator discriminant model includes:
acquiring a target first characteristic value acquired by a simulator for simulating the target intersection and the adjacent downstream intersection, wherein the target first characteristic value is a first sample characteristic value of a first preset state characteristic of each phase in a preset control period;
Inputting the target first characteristic value into an original executor discriminant model, acquiring a first probability value of each sample set output by the original executor discriminant model, and determining a target sample set according to the first probability value of each sample set;
inputting the parameter value of each parameter in the target sample set into the simulator to control the parameter value update of the traffic signal lamp of the target intersection, and acquiring the second sample characteristic value of the first preset state characteristic and the third sample characteristic value of the second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection in a preset control period after the current moment, wherein the parameter value update is acquired by the simulator;
determining the second sample characteristic value as an updated target first characteristic value, determining a reward value for the original actuator discriminant model according to the third sample characteristic value, and updating a parameter value in the original actuator discriminant model according to the reward value;
training the original executor discriminant model with updated parameter values according to the updated target first characteristic values, calculating a first probability value of each sample set output by the original executor discriminant model during each training based on the updated parameter values and the updated target first characteristic values after each training, and calculating expected values of sample set probability value determining functions corresponding to the original executor discriminant model according to the parameter values of each parameter in each sample set and the corresponding first probability values until the expected values are maximum, thereby obtaining the trained executor discriminant model.
Further, the determining a reward value for the original actuator discriminant model from the third sample feature value comprises:
determining a first sum value of a product value of the first time value corresponding to each phase and the first quantity value and a second sum value of the first quantity value corresponding to each phase according to a first time value of average vehicle delay time corresponding to each phase of the target intersection and the first quantity value of vehicle arrival flow contained in the third sample characteristic value, and obtaining a first rewarding value corresponding to the target intersection according to the ratio of the first sum value to the second sum value;
according to a second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and a second quantity value of the vehicle arrival flow, which are contained in the third sample characteristic value, a third sum value of a product value of the second time value corresponding to each phase and the second quantity value and a fourth sum value of the second quantity value corresponding to each phase are determined, and according to a ratio of the third sum value to the fourth sum value, a second prize value corresponding to the adjacent downstream intersection is obtained;
and obtaining the rewarding value of the original actuator discriminant model according to the fifth sum value of the first rewarding value and the second rewarding value.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims (8)

1. A traffic signal control method characterized by an agent device for each intersection of a traffic trunk, the method comprising:
acquiring target characteristic values acquired at a target intersection and an adjacent downstream intersection, wherein the target characteristic values are characteristic values of first preset state characteristics of each phase in a preset control period before the current moment, and the first preset state characteristics comprise an arrival flow rate in unit time and the length of a lane occupied by each vehicle when the vehicles are queued longest;
inputting the target characteristic values into a pre-trained actuator discriminant model to obtain target probability values of each set;
determining a target set of the target intersection according to the target probability value of each set, and controlling traffic signals of each phase in a preset time period after the current moment of the target intersection according to a parameter value corresponding to the green light duration of each phase in the target set;
the training process of the actuator discriminant model comprises the following steps:
acquiring a target first characteristic value acquired by a simulator for simulating the target intersection and the adjacent downstream intersection, wherein the target first characteristic value is a first sample characteristic value of a first preset state characteristic of each phase in a preset control period;
Inputting the target first characteristic value into an original executor discriminant model, acquiring a first probability value of each sample set output by the original executor discriminant model, and determining a target sample set according to the first probability value of each sample set;
inputting the parameter value of each parameter in the target sample set into the simulator to control the parameter value update of the traffic signal lamp of the target intersection, and acquiring the second sample characteristic value of the first preset state characteristic and the third sample characteristic value of the second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection in a preset control period after the current moment, wherein the parameter value update is acquired by the simulator;
determining the second sample characteristic value as an updated target first characteristic value, determining a reward value for the original actuator discriminant model according to the third sample characteristic value, and updating a parameter value in the original actuator discriminant model according to the reward value;
training the original executor discriminant model with updated parameter values according to the updated target first characteristic values, calculating a first probability value of each sample set output by the original executor discriminant model during each training based on the updated parameter values and the updated target first characteristic values after each training, and calculating expected values of sample set probability value determining functions corresponding to the original executor discriminant model according to the parameter values of each parameter in each sample set and the corresponding first probability values until the expected values are maximum, thereby obtaining the trained executor discriminant model.
2. The method of claim 1, wherein determining the target set of target intersections from the target probability values for each set comprises:
sampling is carried out to determine a target set of the target intersection based on the target probability value of each set, wherein the probability of being sampled is higher when the target probability value of the set is larger; or alternatively, the first and second heat exchangers may be,
and determining the set with the maximum target probability value as the target set of the target intersection.
3. The method of claim 1, wherein determining a prize value for the original actuator arbiter model based on the third sample feature value comprises:
determining a first sum value of a product value of the first time value corresponding to each phase and the first quantity value and a second sum value of the first quantity value corresponding to each phase according to a first time value of average vehicle delay time corresponding to each phase of the target intersection and the first quantity value of vehicle arrival flow contained in the third sample characteristic value, and obtaining a first rewarding value corresponding to the target intersection according to the ratio of the first sum value to the second sum value;
According to a second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and a second quantity value of the vehicle arrival flow, which are contained in the third sample characteristic value, a third sum value of a product value of the second time value corresponding to each phase and the second quantity value and a fourth sum value of the second quantity value corresponding to each phase are determined, and according to a ratio of the third sum value to the fourth sum value, a second prize value corresponding to the adjacent downstream intersection is obtained;
and obtaining the rewarding value of the original actuator discriminant model according to the fifth sum value of the first rewarding value and the second rewarding value.
4. A traffic signal control apparatus, the apparatus comprising:
the system comprises an acquisition module, a control module and a control module, wherein the acquisition module is used for acquiring target characteristic values acquired at a target intersection and an adjacent downstream intersection, the target characteristic values are characteristic values of first preset state characteristics of each phase in a preset control period before the current moment, and the first preset state characteristics comprise an arrival flow rate in unit time and the length of a lane occupied by each vehicle when the vehicle is queued to be longest;
the processing module is used for inputting the target characteristic value into a pre-trained actuator discriminant model and acquiring a target probability value of each output set;
The control module is used for determining a target set of the target intersection according to the target probability value of each set, and controlling traffic signals of each phase in a preset time period after the current moment of the target intersection according to a parameter value corresponding to the green light duration of each phase in the target set;
the training module is used for acquiring target first characteristic values acquired by a simulator for simulating the target intersection and the adjacent downstream intersection, wherein the target first characteristic values are first sample characteristic values of first preset state characteristics of each phase in a preset control period; inputting the target first characteristic value into an original executor discriminant model, acquiring a first probability value of each sample set output by the original executor discriminant model, and determining a target sample set according to the first probability value of each sample set; inputting the parameter value of each parameter in the target sample set into the simulator to control the parameter value update of the traffic signal lamp of the target intersection, and acquiring the second sample characteristic value of the first preset state characteristic and the third sample characteristic value of the second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection in a preset control period after the current moment, wherein the parameter value update is acquired by the simulator; determining the second sample characteristic value as an updated target first characteristic value, determining a reward value for the original actuator discriminant model according to the third sample characteristic value, and updating a parameter value in the original actuator discriminant model according to the reward value; training the original executor discriminant model with updated parameter values according to the updated target first characteristic values, calculating a first probability value of each sample set output by the original executor discriminant model during each training based on the updated parameter values and the updated target first characteristic values after each training, and calculating expected values of sample set probability value determining functions corresponding to the original executor discriminant model according to the parameter values of each parameter in each sample set and the corresponding first probability values until the expected values are maximum, thereby obtaining the trained executor discriminant model.
5. The apparatus of claim 4, wherein the control module is configured to sample and determine a target set of the target intersection based on the target probability value for each set, wherein the greater the target probability value of a set, the greater the likelihood of being sampled; or determining the set with the maximum target probability value as the target set of the target intersection.
6. The apparatus of claim 4, wherein the training module is specifically configured to determine a first sum of product values of the first time value and the first quantity value corresponding to each phase and a second sum of the first quantity value corresponding to each phase according to a first time value of average vehicle delay time corresponding to each phase of the target intersection and a first quantity value of vehicle arrival flow included in the third sample feature value, and obtain a first reward value corresponding to the target intersection according to a ratio of the first sum to the second sum; according to a second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and a second quantity value of the vehicle arrival flow, which are contained in the third sample characteristic value, a third sum value of a product value of the second time value corresponding to each phase and the second quantity value and a fourth sum value of the second quantity value corresponding to each phase are determined, and according to a ratio of the third sum value to the fourth sum value, a second prize value corresponding to the adjacent downstream intersection is obtained; and obtaining the rewarding value of the original actuator discriminant model according to the fifth sum value of the first rewarding value and the second rewarding value.
7. An electronic device, comprising: the device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
the memory has stored therein a computer program which, when executed by the processor, causes the processor to execute the computer program stored in the memory to implement the steps of the traffic signal control method according to any one of claims 1-3.
8. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the steps of the traffic signal control method according to any one of claims 1-3.
CN202210314258.3A 2022-03-28 2022-03-28 Traffic signal control method, device, equipment and medium Active CN114639255B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210314258.3A CN114639255B (en) 2022-03-28 2022-03-28 Traffic signal control method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210314258.3A CN114639255B (en) 2022-03-28 2022-03-28 Traffic signal control method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN114639255A CN114639255A (en) 2022-06-17
CN114639255B true CN114639255B (en) 2023-06-09

Family

ID=81952690

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210314258.3A Active CN114639255B (en) 2022-03-28 2022-03-28 Traffic signal control method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN114639255B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110060475A (en) * 2019-04-17 2019-07-26 清华大学 A kind of multi-intersection signal lamp cooperative control method based on deeply study
CN112201060A (en) * 2020-09-27 2021-01-08 航天科工广信智能技术有限公司 Actor-critical-based single-intersection traffic signal control method
WO2021051870A1 (en) * 2019-09-18 2021-03-25 平安科技(深圳)有限公司 Reinforcement learning model-based information control method and apparatus, and computer device
CN113257016A (en) * 2021-06-21 2021-08-13 腾讯科技(深圳)有限公司 Traffic signal control method and device and readable storage medium
CN113487860A (en) * 2021-06-28 2021-10-08 南京云创大数据科技股份有限公司 Intelligent traffic signal control method
CN113643528A (en) * 2021-07-01 2021-11-12 腾讯科技(深圳)有限公司 Signal lamp control method, model training method, system, device and storage medium
WO2021232387A1 (en) * 2020-05-22 2021-11-25 南京云创大数据科技股份有限公司 Multifunctional intelligent signal control system
CN113763723A (en) * 2021-09-06 2021-12-07 武汉理工大学 Traffic signal lamp control system and method based on reinforcement learning and dynamic timing
CN114120670A (en) * 2021-11-25 2022-03-01 支付宝(杭州)信息技术有限公司 Method and system for traffic signal control

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102610094A (en) * 2012-04-05 2012-07-25 郭海锋 Traffic control method for dynamic coordination according to effective capacity of road section
CN103106801B (en) * 2013-01-14 2015-05-20 上海应用技术学院 Self-organizing traffic signal coordination control method
CN104966402B (en) * 2015-06-05 2017-03-01 吉林大学 Queue up and overflow preventing control method in a kind of supersaturation traffic flow crossing
CN110060480B (en) * 2019-05-29 2021-09-07 招商局重庆交通科研设计院有限公司 Method for controlling traffic flow running time of road section
KR102155055B1 (en) * 2019-10-28 2020-09-11 라온피플 주식회사 Apparatus and method for controlling traffic signal based on reinforcement learning
CN112632858A (en) * 2020-12-23 2021-04-09 浙江工业大学 Traffic light signal control method based on Actor-critical frame deep reinforcement learning algorithm
CN113299085A (en) * 2021-06-11 2021-08-24 昭通亮风台信息科技有限公司 Traffic signal lamp control method, equipment and storage medium
CN113628442B (en) * 2021-08-06 2022-10-14 成都信息工程大学 Traffic organization scheme optimization method based on multi-signal-lamp reinforcement learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110060475A (en) * 2019-04-17 2019-07-26 清华大学 A kind of multi-intersection signal lamp cooperative control method based on deeply study
WO2021051870A1 (en) * 2019-09-18 2021-03-25 平安科技(深圳)有限公司 Reinforcement learning model-based information control method and apparatus, and computer device
WO2021232387A1 (en) * 2020-05-22 2021-11-25 南京云创大数据科技股份有限公司 Multifunctional intelligent signal control system
CN112201060A (en) * 2020-09-27 2021-01-08 航天科工广信智能技术有限公司 Actor-critical-based single-intersection traffic signal control method
CN113257016A (en) * 2021-06-21 2021-08-13 腾讯科技(深圳)有限公司 Traffic signal control method and device and readable storage medium
CN113487860A (en) * 2021-06-28 2021-10-08 南京云创大数据科技股份有限公司 Intelligent traffic signal control method
CN113643528A (en) * 2021-07-01 2021-11-12 腾讯科技(深圳)有限公司 Signal lamp control method, model training method, system, device and storage medium
CN113763723A (en) * 2021-09-06 2021-12-07 武汉理工大学 Traffic signal lamp control system and method based on reinforcement learning and dynamic timing
CN114120670A (en) * 2021-11-25 2022-03-01 支付宝(杭州)信息技术有限公司 Method and system for traffic signal control

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
多Agent强化学习下的城市路网自适应交通信号协调配时决策研究综述;夏新海;;交通运输研究(第02期);全文 *
多智能体强化学习在城市交通网络信号控制方法中的应用综述;杨文臣;张轮;Zhu Feng;;计算机应用研究(第06期);全文 *

Also Published As

Publication number Publication date
CN114639255A (en) 2022-06-17

Similar Documents

Publication Publication Date Title
WO2022121510A1 (en) Stochastic policy gradient-based traffic signal control method and system, and electronic device
CN111260937B (en) Cross traffic signal lamp control method based on reinforcement learning
CN112669629B (en) Real-time traffic signal control method and device based on deep reinforcement learning
CN114638148A (en) Safe and extensible model for culture-sensitive driving of automated vehicles
CN113561986B (en) Automatic driving automobile decision making method and device
CN116235229A (en) Method and system for controlling self-adaptive periodic level traffic signals
WO2022156182A1 (en) Methods and apparatuses for constructing vehicle dynamics model and for predicting vehicle state information
CN110930996B (en) Model training method, voice recognition method, device, storage medium and equipment
CN114463997A (en) Lantern-free intersection vehicle cooperative control method and system
CN113780624A (en) City road network signal coordination control method based on game equilibrium theory
CN112907970A (en) Variable lane steering control method based on vehicle queuing length change rate
CN115019523B (en) Deep reinforcement learning traffic signal coordination optimization control method based on minimized pressure difference
CN114547917A (en) Simulation prediction method, device, equipment and storage medium
CN115862322A (en) Vehicle variable speed limit control optimization method, system, medium and equipment
CN115713128A (en) Federal learning method based on equipment training time fairness
WO2021258847A1 (en) Driving decision-making method, device, and chip
CN114639255B (en) Traffic signal control method, device, equipment and medium
CN118171723A (en) Method, device, equipment, storage medium and program product for deploying intelligent driving strategy
Li et al. Cycle-based signal timing with traffic flow prediction for dynamic environment
CN110390398A (en) On-line study method
CN116758768A (en) Dynamic regulation and control method for traffic lights of full crossroad
CN116386317A (en) Traffic intersection traffic flow prediction method and device
CN114818981A (en) Imitation learning method for driving behaviors
CN112434817B (en) Method, apparatus and computer storage medium for constructing communication algorithm database
Gamarra et al. Deep learning for traffic prediction with an application to traffic lights optimization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant