CN114639255A

CN114639255A - Traffic signal control method, device, equipment and medium

Info

Publication number: CN114639255A
Application number: CN202210314258.3A
Authority: CN
Inventors: 相强强; 程兴硕; 王泽�; 伍召举
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2022-03-28
Filing date: 2022-03-28
Publication date: 2022-06-17
Anticipated expiration: 2042-03-28
Also published as: CN114639255B

Abstract

The invention discloses a traffic signal control method, a device, equipment and a medium, wherein target characteristic values acquired at a target intersection and adjacent downstream intersections are acquired, a target probability value of an action control parameter corresponding to the input target characteristic value is acquired based on an actuator discriminator model which is trained in advance, a target preset action control parameter of the target intersection is determined according to the target probability value, and a traffic signal of the target intersection is controlled. According to the invention, when the intelligent device corresponding to each intersection determines the target preset action control parameter of the intersection, the target characteristic value of the adjacent downstream intersection is considered, so that the competition among the intelligent devices of each intersection is reduced, and the overall control effect of the traffic trunk is improved.

Description

Traffic signal control method, device, equipment and medium

Technical Field

The present invention relates to the technical field of traffic signal control, and in particular, to a traffic signal control method, apparatus, device, and medium.

Background

With the increase of population and the acceleration of urbanization process, the urban trip demand is increased sharply, and the existing traffic infrastructure is difficult to meet the increasing traffic demand, so that the periodic congestion and the aperiodic congestion of urban traffic are caused. The traffic signal control is used as the core of urban traffic management and control, the scientific and reasonable signal control scheme can maximize the throughput of intersections, improve the running efficiency of urban road networks and the traffic capacity of the intersections, and reduce the frequency and the intensity of traffic conflicts, thereby relieving the problem of urban traffic jam.

In the prior art, the adaptive traffic signal control scheme is mainly based on prediction of a fixed traffic model, selection of a preset signal control scheme or real-time prediction of a traffic simulation model to control traffic signals, is driven by the simulation model essentially, and needs to calibrate traffic simulation model parameters and design a predefined signal control scheme in advance according to an actual traffic scene.

The multipoint traffic signal control based on deep reinforcement learning provided in the prior art is mostly a scene that the single-point signal control is simply transplanted to multiple points, namely, the intelligent devices at each intersection use the same neural network model after deep reinforcement learning, but each intelligent device is required to ensure the optimal control effect of the traffic signals at the intersection when using the neural network model, however, since the adjacent intersections will affect each other, if the vehicle passing rate is used as the criterion for evaluating the control effect, an agent device is likely to set the traffic signal light of its intersection to green within a preset time period after the current time, regardless of the influence on the adjacent downstream intersection, therefore, traffic jam occurs at adjacent downstream intersections, and the overall control effect of the traffic trunk is poor.

Disclosure of Invention

The invention provides a traffic signal control method, a traffic signal control device, traffic signal control equipment and a traffic signal control medium, which are used for solving the problem of poor overall control effect of a traffic trunk line in the prior art.

The invention provides a traffic signal control method, which is used for intelligent device corresponding to each intersection of a traffic trunk line, and comprises the following steps:

acquiring target characteristic values acquired at a target intersection and an adjacent downstream intersection, wherein the target characteristic values are characteristic values of first preset state characteristics of each phase in a preset control period before the current moment, and the first preset state characteristics comprise an arrival flow rate in unit time and the length of a lane occupied by each vehicle when the vehicles queue for the longest time;

inputting the target characteristic values into an actuator discriminator model which is trained in advance, and acquiring the target probability value of each output set;

and determining the target set of the target intersection according to the target probability value of each set, and controlling the traffic signal lamp of each phase in a preset time period after the current time of the target intersection according to the parameter value corresponding to the green light duration of each phase in the target set.

Further, the determining the target set of the target intersection according to the target probability value of each set comprises:

sampling to determine a target set of the target intersection based on the target probability value of each set, wherein the probability that the target probability value of a set is sampled is higher; or the like, or, alternatively,

and determining the set with the maximum target probability value as the target set of the target intersection.

Further, the training process of the actuator discriminator model comprises:

acquiring a first target characteristic value acquired by a simulator simulating the target intersection and the adjacent downstream intersection, wherein the first target characteristic value is a first sample characteristic value of a first preset state characteristic of each phase in a preset control period;

inputting the target first characteristic value into an original actuator discriminator model, acquiring a first probability value of each sample set output by the original actuator discriminator model, and determining a target sample set according to the first probability value of each sample set;

inputting the parameter value of each parameter in the target sample set into the simulator to control the parameter value of a traffic signal lamp of the target intersection to update, and acquiring a second sample characteristic value of the first preset state characteristic and a third sample characteristic value of the second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection collected by the simulator in a preset control period after the current moment;

determining the second sample characteristic value as an updated target first characteristic value, determining an award value for the original actuator discriminator model according to the third sample characteristic value, and updating a parameter value in the original actuator discriminator model according to the award value;

training the original actuator discriminator model with the updated parameter values according to the updated target first characteristic values, calculating a first probability value of each sample set output by the original actuator discriminator model during each training based on the updated parameter values and the updated target first characteristic values during each training, calculating an expected value of a sample set probability value determining function corresponding to the original actuator discriminator model according to the parameter values of each parameter in each sample set and the corresponding first probability values, and obtaining the trained actuator discriminator model until the expected value is maximum.

Further, the determining a reward value for the original actuator discriminator model from the third sample feature value comprises:

according to a first time value of the average vehicle delay time corresponding to each phase of the goal intersection and a first quantity value of the vehicle arrival flow contained in the third sample characteristic value, determining a first sum of a product value of the first time value corresponding to each phase and the first quantity value and a second sum of the first quantity value corresponding to each phase, and obtaining a first reward value corresponding to the goal intersection according to a ratio of the first sum and the second sum;

according to a second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and a second numerical value of the vehicle arrival flow contained in the third sample characteristic value, determining a third sum of the product value of the second time value corresponding to each phase and the second numerical value and a fourth sum of the second numerical value corresponding to each phase, and according to the ratio of the third sum to the fourth sum, obtaining a second reward value corresponding to the adjacent downstream intersection;

and obtaining the reward value of the original actuator discriminator model according to the fifth sum of the first reward value and the second reward value.

Accordingly, the present invention provides a traffic signal control apparatus, said apparatus comprising:

the system comprises an acquisition module, a processing module and a control module, wherein the acquisition module is used for acquiring target characteristic values acquired at a target intersection and an adjacent downstream intersection, the target characteristic values are characteristic values of first preset state characteristics of each phase in a preset control period before the current moment, and the first preset state characteristics comprise an arrival flow rate in unit time and the length of a lane occupied by each vehicle when the vehicles queue for the longest time;

the processing module is used for inputting the target characteristic value into an actuator discriminator model which is trained in advance and acquiring the target probability value of each output set;

and the control module is used for determining the target set of the target intersection according to the target probability value of each set, and controlling the traffic signal lamp of each phase in a preset time period after the current time of the target intersection according to the parameter value corresponding to the green light time of each phase in the target set.

Further, the control module is specifically configured to sample and determine the target set of the target intersection based on the target probability value of each set, where the higher the target probability value of a set is, the higher the possibility of being sampled is; or determining the set with the maximum target probability value as the target set of the target intersection.

Further, the apparatus further comprises:

the training module is used for acquiring a first target characteristic value acquired by a simulator for simulating the target intersection and the adjacent downstream intersection, wherein the first target characteristic value is a first sample characteristic value of a first preset state characteristic of each phase in a preset control period; inputting the target first characteristic value into an original actuator discriminator model, acquiring a first probability value of each sample set output by the original actuator discriminator model, and determining a target sample set according to the first probability value of each sample set; inputting the parameter value of each parameter in the target sample set into the simulator to control the parameter value of a traffic signal lamp of the target intersection to update, and acquiring a second sample characteristic value of the first preset state characteristic and a third sample characteristic value of the second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection collected by the simulator in a preset control period after the current moment; determining the second sample characteristic value as an updated target first characteristic value, determining an award value for the original actuator discriminator model according to the third sample characteristic value, and updating a parameter value in the original actuator discriminator model according to the award value; training the original actuator discriminator model with the updated parameter values according to the updated target first characteristic values, calculating a first probability value of each sample set output by the original actuator discriminator model during each training based on the updated parameter values and the updated target first characteristic values during each training, calculating an expected value of a sample set probability value determining function corresponding to the original actuator discriminator model according to the parameter values of each parameter in each sample set and the corresponding first probability values, and obtaining the trained actuator discriminator model until the expected value is maximum.

Further, the training module is specifically configured to determine, according to a first time value of an average vehicle delay time corresponding to each phase of the intersection and a first quantity value of a vehicle arrival flow included in the third sample feature value, a first sum of a product value of the first time value corresponding to each phase and the first quantity value, and a second sum of the first quantity value corresponding to each phase, and obtain a first reward value corresponding to the intersection according to a ratio of the first sum to the second sum; according to a second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and a second numerical value of the vehicle arrival flow contained in the third sample characteristic value, determining a third sum of the product value of the second time value corresponding to each phase and the second numerical value and a fourth sum of the second numerical value corresponding to each phase, and according to the ratio of the third sum to the fourth sum, obtaining a second reward value corresponding to the adjacent downstream intersection; and obtaining the reward value of the original actuator discriminator model according to the fifth sum of the first reward value and the second reward value.

Accordingly, the present invention provides an electronic device comprising: the system comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus;

the memory has stored therein a computer program which, when executed by the processor, causes the processor to carry out the steps of any of the above-described traffic signal control methods when executing the computer program stored in the memory.

Accordingly, the present invention provides a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of any of the above-mentioned traffic signal control methods.

The invention provides a traffic signal control method, a device, equipment and a medium, wherein target characteristic values acquired at a target intersection and adjacent downstream intersections are acquired, a target probability value of an action control parameter corresponding to the input target characteristic value is acquired based on an actuator discriminator model which is trained in advance, a target preset action control parameter of the target intersection is determined according to the target probability value, and a traffic signal of the target intersection is controlled. According to the invention, when the intelligent device corresponding to each intersection determines the target preset action control parameter of the intersection, the target characteristic value of the adjacent downstream intersection is considered, so that the competition among the intelligent devices of each intersection is reduced, and the overall control effect of the traffic trunk is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic process diagram of a traffic signal control method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a traffic trunk according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating an embodiment of an actuator arbiter model training process;

fig. 4 is a schematic structural diagram of a traffic signal control device according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to improve the overall control effect of a traffic trunk line, the embodiment of the invention provides a traffic signal control method, a traffic signal control device, traffic signal control equipment and a traffic signal control medium.

Example 1:

fig. 1 is a schematic process diagram of a traffic signal control method according to an embodiment of the present invention, where the process includes the following steps:

s101: the method comprises the steps of obtaining target characteristic values collected at a target intersection and an adjacent downstream intersection, wherein the target characteristic values are characteristic values of first preset state characteristics of each phase in a preset control period before the current moment, and the first preset state characteristics comprise the arrival flow rate in unit time and the length of a lane occupied by each vehicle when the vehicles queue for the longest time.

The traffic signal control method provided by the embodiment of the invention is applied to intelligent equipment corresponding to each intersection of a traffic trunk line, wherein the intelligent equipment can be an intelligent terminal such as a PC (personal computer), a tablet computer and a mobile terminal for controlling traffic signal lamps, and can also be a server for controlling the traffic signal lamps; the server can be a local server, a cloud server and a controller of a traffic signal lamp. Specifically, the embodiment of the present invention does not limit this.

In order to improve the overall control effect of the traffic trunk, the intelligent device acquires target characteristic values acquired at a target intersection and an adjacent downstream intersection, wherein the target intersection is an intersection corresponding to the intelligent device, namely the intersection where a traffic signal lamp controlled by the intelligent device is located, and the adjacent downstream intersection is an intersection which is adjacent to the target intersection and is located at the downstream of the target intersection in the vehicle passing direction. Fig. 2 is a schematic diagram of a traffic trunk according to an embodiment of the present invention, and taking a direction from left (left and right in fig. 2) to right (left and right in fig. 2) as an example of a vehicle passing behavior, if an intersection on the left (left and right in fig. 2) side is a target intersection, an intersection in the middle (middle in fig. 2) is an adjacent downstream intersection.

Specifically, the intelligent device may obtain a target characteristic value corresponding to a target intersection collected by an image collection device connected to the intelligent device, and a target characteristic value corresponding to a downstream intersection collected by an image collection device connected to the intelligent device and corresponding to an adjacent downstream intersection; or the target characteristic value corresponding to the target intersection acquired by the image acquisition unit of the intelligent device itself and the target characteristic value corresponding to the downstream intersection sent by the intelligent device corresponding to the adjacent downstream intersection may be acquired.

The target characteristic value is a characteristic value of a first preset state characteristic of each phase in a preset control period before the current time, the preset control period is a maximum period in the optimal period of each intersection in a traffic trunk line calculated according to a Webster optimal period formula, the first preset state characteristic comprises an arrival flow rate in unit time and the length of a lane occupied by each vehicle when the vehicles are queued for the longest time, the arrival flow rate in unit time is a ratio of arrival flow of the vehicles passing through the intersections in the preset control period before the current time to the preset control period, and the length of the lane occupied by each vehicle when the vehicles are queued for the longest time is a ratio of the maximum queuing length in the preset control period before the current time to the number of vehicles corresponding to the length.

For example, the intersection of the target is represented by m, the adjacent downstream intersection is represented by n, and the target characteristic values of the intersection of the target and the adjacent downstream intersection are

Wherein

At the above target characteristic value

In (1),

the maximum queuing length of the ith phase of the target intersection m in a preset control period before the current moment is shown,

representing the number of vehicles corresponding to the maximum queuing length of the ith phase of the target intersection m in a preset control period before the current moment,

indicating that each vehicle occupies the lane length when the vehicle queue of the ith phase of the target intersection m in a preset control period before the current moment is longest,

indicating the target intersection mth phase arrival flow rate per unit time within a preset control period prior to the current time.

The maximum queuing length of the ith phase of the adjacent downstream intersection n in a preset control period before the current moment is shown,

the number of vehicles corresponding to the maximum queuing length of the ith phase of the adjacent downstream intersection n in a preset control period before the current moment is represented,

indicating that each vehicle occupies the length of the lane when the vehicle queue of the ith phase of the adjacent downstream intersection n is longest in a preset control period before the current moment,

indicating the flow rate reached per unit time for the nth phase of the adjacent downstream junction within a preset control period prior to the current time.

S102: and inputting the target characteristic value into a pre-trained actuator discriminator model to obtain the target probability value of each output set.

In order to implement control over traffic signals, in the embodiment of the present invention, the agent device stores an actuator discriminator model which is trained in advance, where the actuator discriminator model is trained in advance to determine a target probability value of each set according to target feature values collected by a target intersection and an adjacent downstream intersection, where the set includes parameter values of motion control parameters of traffic signals at the target intersection, and the motion control parameters include a green signal ratio of each phase of the traffic signal, that is, a ratio of a green light duration to a duration of a preset control period.

For example, for motion control parameters

It is shown that,

wherein

And the ratio of the duration of the green light of the ith phase of the target intersection m to the duration of the preset control period is represented.

After target characteristic values collected by the intersection to be targeted and the adjacent downstream intersection are obtained, the target characteristic values are input into an actuator discriminator model which is trained in advance, and the target probability value of each output set is obtained after the processing of the actuator discriminator model.

S103: and determining the target set of the target intersection according to the target probability value of each set, and controlling the traffic signal lamp of each phase in a preset time period after the current time of the target intersection according to the parameter value corresponding to the green light duration of each phase in the target set.

And after the target probability value of each set is determined, determining the target set of the target intersection according to the target probability value of each set. Specifically, the set corresponding to the median of the target probability values may be determined as the target set, the set corresponding to the maximum of the target probability values may also be determined as the target set, or another determination method may be adopted to select the target set from each set, which is not limited in the embodiment of the present invention.

And after the target set is determined, controlling the traffic signal lamp of each phase in a preset time period after the current moment of the target intersection according to the parameter value corresponding to the green light duration of each phase in the target set. In the embodiment of the invention, the control of the traffic signal lamp of the target intersection is performed once at an interval of a preset control period, and each control needs to determine the target set in the next control period according to the target characteristic value acquired in the previous preset control period.

In the embodiment of the invention, the traffic signal control method provided by the invention is characterized in that the target characteristic values collected at the target intersection and the adjacent downstream intersections are obtained, the target probability values of the action control parameters correspondingly output to the input target characteristic values are obtained based on the actuator discriminator model which is trained in advance, the target preset action control parameters of the target intersection are determined according to the target probability values, and the traffic signals of the target intersection are controlled. According to the invention, when the intelligent device corresponding to each intersection determines the target preset action control parameter of the intersection, the target characteristic value of the adjacent downstream intersection is considered, so that the competition among the intelligent devices of each intersection is reduced, and the overall control effect of the traffic trunk is improved.

Example 2:

in order to determine the target sets of the target intersection, on the basis of the above embodiment, in an embodiment of the present invention, the determining the target set of the target intersection according to the target probability value of each set includes:

In order to determine the target sets of the target intersection, in the embodiment of the invention, the target sets of the target intersection are determined by sampling based on the target probability value of each set, wherein the probability of each set being sampled is inconsistent, the probability of each set being sampled is higher when the target probability value of each set is higher, and the probability of each set being sampled is lower when the target probability value of each set is lower.

As a possible implementation manner, in the embodiment of the present invention, according to the target probability value of each set, the set with the largest target probability value may also be determined as the target set of the target intersection.

Example 3:

for training the actuator discriminator model, on the basis of the above embodiments, in an embodiment of the present invention, the training process of the actuator discriminator model includes:

In order to train the actuator discriminator model, in the embodiment of the present invention, a simulator for simulating a target intersection and an adjacent downstream intersection is stored in advance, and the simulator is specifically configured to simulate a change in traffic state at the intersection.

The method comprises the steps of obtaining a first target characteristic value collected by a simulator simulating a target intersection and an adjacent downstream intersection, wherein the first target characteristic value is a first sample characteristic value of a first preset state characteristic of each phase in a preset control period, and the first target characteristic value comprises a first sample characteristic value corresponding to the target intersection and a first sample characteristic value corresponding to the adjacent downstream intersection.

Specifically, when the actuator discriminator model is trained for the first time, the first characteristic value of the target is an initial characteristic value pre-stored in the simulation, and the characteristic values after the simulator simulates the target intersection and the adjacent downstream intersection are characteristic values during each subsequent training.

After the target first characteristic value acquired by the simulator is acquired, inputting the target first characteristic value into the original actuator discriminator model, acquiring a first probability value of each sample set output by the original actuator discriminator model, wherein each sample set is preset, and sampling according to the first probability value of each sample set to determine the target sample set, wherein the probability of being sampled is higher when the first probability value of the sample set is higher.

And inputting the parameter value of each parameter in the target sample set into the simulator, controlling the parameter value of the traffic signal lamp of the simulated target intersection in the simulator to be updated, and updating the parameter value into the parameter value of the corresponding parameter of each phase in each phase of the preset period.

After the simulator simulates the target intersection and the adjacent downstream intersection in a preset control period after the current moment, a second sample characteristic value of a first preset state characteristic and a third sample characteristic value of a second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection in the preset control period, which are acquired by the simulator, are acquired. Wherein the second predetermined state characteristic comprises an average vehicle delay time and a vehicle arrival flow.

And in order to carry out the next training, determining the second sample characteristic value as the updated target first characteristic value, determining the reward value of the original actuator discriminator model according to the third sample characteristic value, and updating the parameter value in the original actuator discriminator model by adopting a time difference learning algorithm according to the reward value. Wherein the reward value is a reward value resulting from a change in the environmental state of the targeted intersection including a second predetermined state and a resulting reward value of a second predetermined state in the environmental state of an adjacent downstream intersection.

And training the original actuator discriminator model with the updated parameter value according to the updated target first characteristic value, namely, repeatedly executing the steps according to the updated target first characteristic value. And calculating the expected value of the probability value determining function corresponding to the original actuator discriminator model during each training based on the updated parameter value of each training and the updated target first characteristic value, and obtaining the trained actuator discriminator model until the expected value is maximum.

Specifically, according to the parameter value of each parameter in each sample set output during each training and the corresponding first sample probability value, the sum of the product values of the parameter value in each sample set and the corresponding first sample probability value is determined, and the expected value including the sum corresponding to each parameter is determined.

Example 4:

in order to implement training of an actuator discriminator model, in the embodiment of the invention, for each round of training corresponding to each preset control period, a first target characteristic value acquired by a simulator simulating a target intersection and an adjacent downstream intersection is acquired in the round of training, the first target characteristic value is input into an original actuator discriminator model to serve as a strategy network of an actuator, a first probability value of each sample set output by the strategy network is acquired according to a state value function corresponding to the strategy network and used for determining a sample set probability value, and a first score value provided by a round of training on a value network of the discriminator, and a target sample set is determined according to the first probability value of each sample set.

And inputting the first characteristic value of the target and the parameter value of each parameter in the target sample set into the starting actuator discriminator model to be used as a value network of the discriminator, and acquiring a first evaluation value of the value network to the target sample set.

And inputting the parameter value of each parameter in the target sample set into a simulator to control the parameter value of a traffic signal lamp of the target intersection to update, and acquiring a second sample characteristic value of the first preset state characteristic and a third sample characteristic value of the second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection collected by the simulator in a preset control period after the current moment. And determining the reward value of the value network in the original simulator arbiter model according to the third sample characteristic value.

And inputting the second sample characteristic value into the original actuator discriminator model to serve as a strategy network of an actuator, acquiring a predicted first probability value of each sample set output by the strategy network, and determining a predicted target sample set according to the predicted first probability value of each sample set. And inputting the second sample characteristic value and the parameter value of each parameter in the prediction target sample set into the start executor discriminator model to be used as a value network of the discriminator, and acquiring a second evaluation value of the value network to the prediction target sample set.

And determining a time difference error value by adopting a time difference algorithm according to the first score value, the reward value and the second score value, determining a product value of the derivative value and the time difference error value according to the derivative value of the first parameter value of the action value function corresponding to the value network, and updating the first parameter value according to the product value. Specifically, the first parameter value is subtracted by a product value of the product value and a preset first learning rate to obtain an updated first parameter value.

And updating the second parameter value of the policy network by adopting a random gradient ascent method according to the first score value and the derivative value of the second parameter of the policy network by using the state value function corresponding to the policy network. Specifically, a product value of the first score value and the derivative value is determined, the second parameter value is updated according to the product value, and specifically, the second parameter value is added with the product value of the product value and a preset second learning rate to obtain an updated second parameter value.

Determining the second sample characteristic value as an updated target first characteristic value, training an original actuator discriminator model with the first parameter value and the second parameter value updated, calculating a first probability value of each sample set output by the original actuator discriminator model during each training based on the updated parameter value and the updated target first characteristic value during each training, calculating an expected value of a state value function of a determined sample set probability value corresponding to an actuator in the original actuator discriminator model according to the parameter value of each parameter in each sample set and the corresponding first probability value, and obtaining the trained actuator discriminator model until the expected value is maximum.

In the embodiment of the invention, an actuator discriminator model adopts an actuator-criticic algorithm to carry out cooperative optimization control on traffic signals of a traffic trunk, and the actuator-criticic algorithm belongs to one type of strategy learning and is used for approximating a strategy function pi (a | s) by a neural network. The neural network approximating the policy function in the embodiment of the invention is a policy network expressed as pi (a | s; theta) and used as an actuator in the actuator discriminator model.

The input of the strategy network is a characteristic value S of a first preset state characteristic of each phase in a preset control period before the current time acquired by the target intersection and the adjacent downstream intersections, the output probability distribution of each set a of parameter values including action control parameters is output, and sigma is_a∈Aπ(a|s；θ)＝1。

The state cost function corresponding to the strategy function can be approximately expressed as V_π(s_t；θ)＝∑_aπ(a_t|s；θ)Q_π(s_tAnd a), updating the parameter theta by adopting a random gradient ascent method in the embodiment of the invention, wherein

Referred to as a policy gradient.

The policy gradient may be represented by the following equation:

since the strategy gradient can not be directly calculated, the strategy gradient can be approximated by adopting a Monte Carlo approximation method, namely

Since it is not known in the strategy gradient

So that the cost network q (s, a; ω) can be used to approximate the cost function of the motion. Wherein the value network q (s, a; ω) is a neural network.

The policy network pi (a | s; theta) is called the actor (actor) and the value network q (s, a; omega) is called the critic (critic). The supervision signal comes from the evaluation value Q provided by the value network Q (s, a; ω) when learning the policy network pi (a | s; θ) and from the reward value R when learning the value network Q (s, a; ω).

The parameters theta and omega are updated simultaneously when the actuator discriminator model is trained, and the parameter theta of the strategy network is updated by using the strategy gradient to increase the state cost function V_π(s_t(ii) a Theta) and updating the parameter omega of the value network by using a time difference algorithm in order to make the output evaluation value Q more accurate.

Fig. 3 is a schematic diagram of an actuator discriminator model training according to an embodiment of the present invention, as shown in fig. 3, the actuator discriminator model includes a policy network serving as an actuator and a value network serving as a discriminator, a simulator model is used to simulate traffic states in environments of a target intersection and an adjacent downstream intersection, the value network is used to provide an evaluation Q value to the policy network, the simulator is used to provide a reward value and output a feature value of a first preset state feature of the environment, and the policy network is used to output a probability value of a set of motion control parameters.

The actuator discriminator model of the invention is trained through a specific embodiment, and the strategy network pi (a | s; theta) and the value network q (s, a; omega) in the actuator discriminator model are initialized randomly.

The intelligent equipment acquires target characteristic values s of a target intersection and adjacent downstream intersections collected in the simulator_tIf the training is the first time, the target characteristic value s_tFor pre-stored initial characteristic values s₀The target characteristic value s_tInputting a strategy network pi (a | s; theta), and sampling based on the first probability value of each sample set output by the strategy network pi (a | s; theta) to determine a target sample set a_t. The target characteristic value s_tAnd a set of target samples a_tThe parameter value of each parameter is input into a value network q (s, a; omega) to obtain a first evaluation value q of the value network to a target sample set_t(s_t，a_t；ω_t)。

Collecting a target sample set a_tThe parameter value of each parameter is input into a simulator to control the updating of the parameter value of a traffic signal lamp of the target intersection, and a second sample characteristic value s of a first preset state characteristic of each phase of the target intersection and an adjacent downstream intersection collected by the simulator in a preset control period after the current moment is obtained_t+1And a third sample characteristic value of the second preset state characteristic. Determining a reward value r for the value network in the raw simulator arbiter model from the third sample feature value_t。

The second sample characteristic value s_t+1Inputting a strategy network pi (as | s; theta) serving as an actuator in an original actuator discriminator model, and sampling and determining a prediction target sample set based on a prediction first probability value of each sample set output by the strategy network pi (as | s; theta)

The target characteristic value s_t+1And a set of target samples

The parameter value of each parameter is input into a value network q (s, a; omega) to obtain a second evaluation value of the value network to the target sample set

According to the first evaluation value q_tPrize value r_tAnd q is_t+1Calculating a time difference error value delta_tWherein δ_t＝q_t-(r_t+γq_t+1) Deriving the action value function corresponding to the value network to obtain a derivative value d_ω，tWherein

First parameter value omega for value network_tUpdating to obtain an updated first parameter value omega_t+1，ω_t+1＝ω_t-αδ_td_ω，t。

Obtaining a derivative value d by deriving a state value function corresponding to the policy network_θ，tWherein

According to the first evaluation value q_tAnd the derivative d_θ，tUpdating the second parameter value theta of the policy network with a random gradient rise_tObtaining the updated second parameter value theta_t+1Wherein theta_t+1＝θ_t+βq_td_θ，t。

Example 5:

in order to determine the reward value for the original actuator discriminator model for each training, on the basis of the above embodiments, in an embodiment of the present invention, the determining the reward value for the original actuator discriminator model according to the third sample feature value includes:

In order to determine the reward value of each training to the original actuator discriminator model, in the embodiment of the invention, according to a first time value of the average vehicle delay time corresponding to each phase of the target intersection and a first quantity value of the vehicle arrival flow contained in a third sample characteristic value, a product value of the first time value corresponding to each phase and the first quantity value is determined, and the product values corresponding to each phase are added to obtain a first sum value; and according to the ratio of the first sum value to the second sum value, determining the negative value of the ratio of the first sum value to the second sum value as the first reward value corresponding to the intersection.

According to a second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and a second numerical value of the vehicle arrival flow, which are contained in the third sample characteristic value, determining a product value of the second time value corresponding to each phase and the second numerical value, and adding the product values corresponding to each phase to obtain a third sum value; and according to the ratio of the third sum value to the fourth sum value, determining the negative value of the ratio of the third sum value to the fourth sum value as the second incentive value corresponding to the adjacent downstream intersection.

And determining the fifth sum value as the prize value of the original actuator discriminator model according to the fifth sum value of the first prize value and the second prize value.

In the following, a process of determining a reward value of an original executor arbiter model according to a specific embodiment of the present invention is described, where for each target intersection of a main traffic line, when training the executor arbiter model, not only the reward value of the target intersection m but also the reward value of an adjacent downstream intersection n need to be considered.

In the calculation of the reward value for the actuator discriminator model, a joint reward function is used in particular, wherein the joint reward function

Wherein

A first time value representing the average delay time of the ith phase of the target intersection m in a preset control period after the current time t,

the method comprises the steps that a first quantity value representing the vehicle arrival flow of the ith phase of a target intersection m in a preset control period after the current time t is obtained;

representing the average delay time of the ith phase at the target intersection n within a preset control period after the current time tThe value of the second time is set to,

a second numerical value representing vehicle arrival flow at the target intersection nth phase within a preset control period after the current time t.

Example 6:

fig. 4 is a schematic structural diagram of a traffic signal control device according to an embodiment of the present invention, where the traffic signal control device includes:

an obtaining module 401, configured to obtain target feature values collected at a target intersection and an adjacent downstream intersection, where the target feature value is a feature value of a first preset state feature of each phase in a preset control period before a current time, and the first preset state feature includes an arrival flow rate per unit time and a length of a lane occupied by each vehicle when the vehicle is queued for the longest time;

a processing module 402, configured to input the target feature value to a pre-trained actuator discriminator model, and obtain a target probability value of each output set;

the control module 403 is configured to determine a target set of the target intersection according to the target probability value of each set, and control a traffic signal lamp of each phase in a preset time period after the current time of the target intersection according to a parameter value corresponding to a green light duration of each phase in the target set.

Further, the control module is specifically configured to sample to determine a target set of the target intersection based on the target probability value of each set, where a probability that a target probability value of a set is greater is higher; or determining the set with the maximum target probability value as the target set of the target intersection.

Further, the apparatus further comprises:

Example 7:

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and on the basis of the foregoing embodiments, the present application further provides an electronic device including a processor 501, a communication interface 502, a memory 503 and a communication bus 504, where the processor 501, the communication interface 502 and the memory 503 complete communication with each other through the communication bus 504;

the memory 503 has stored therein a computer program which, when executed by the processor 501, causes the processor 501 to perform the steps of:

Further, the processor 501 is specifically configured to determine, according to the object probability value of each set, an object set of the object intersection includes:

Further, the process 501 specifically applied to the training process of the actuator discriminator model includes:

inputting the parameter value of each parameter in the target sample set into the simulator to control the parameter value of a traffic light of the target intersection to update, and acquiring a second sample characteristic value of the first preset state characteristic and a third sample characteristic value of the second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection collected by the simulator in a preset control period after the current moment;

Further, the processor 501 is specifically configured to determine the reward value for the original actuator discriminator model according to the third sample feature value by:

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface 502 is used for communication between the above-described electronic apparatus and other apparatuses.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Alternatively, the memory may be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a central processing unit, a Network Processor (NP), and the like; but may also be a Digital instruction processor (DSP), an application specific integrated circuit, a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

Example 8:

on the basis of the foregoing embodiments, the present invention further provides a computer-readable storage medium, in which a computer program executable by a processor is stored, and when the program runs on the processor, the processor is caused to execute the following steps:

Further, the training process of the actuator discriminator model comprises:

according to a first time value of average vehicle delay time corresponding to each phase of the target intersection and a first quantity value of vehicle arrival flow contained in the third sample characteristic value, determining a first sum value of a product value of the first time value corresponding to each phase and the first quantity value and a second sum value of the first quantity value corresponding to each phase, and obtaining a first reward value corresponding to the target intersection according to a ratio of the first sum value to the second sum value;

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A traffic signal control method characterized by an agent device corresponding to each intersection for a main line of traffic, the method comprising:

2. The method of claim 1, wherein determining the set of objects at the intersection based on the object probability values for each set comprises:

3. The method of claim 1, wherein the training process of the actuator arbiter model comprises:

4. The method of claim 3, wherein said determining a prize value for the original actuator discriminator model based on the third sample feature value comprises:

5. A traffic signal control apparatus, characterized in that the apparatus comprises:

6. The device according to claim 5, wherein the control module is specifically configured to sample and determine the target set of the target intersection based on the target probability value of each set, wherein the higher the target probability value of a set is, the higher the probability of being sampled is; or determining the set with the maximum target probability value as the target set of the target intersection.

7. The apparatus of claim 5, further comprising:

the training module is used for acquiring a first target characteristic value acquired by a simulator for simulating the target intersection and the adjacent downstream intersection, wherein the first target characteristic value is a first sample characteristic value of a first preset state characteristic of each phase in a preset control period; inputting the target first characteristic value into an original actuator discriminator model, acquiring a first probability value of each sample set output by the original actuator discriminator model, and determining a target sample set according to the first probability value of each sample set; inputting the parameter value of each parameter in the target sample set into the simulator to control the parameter value of a traffic signal lamp of the target intersection to update, and acquiring a second sample characteristic value of the first preset state characteristic and a third sample characteristic value of the second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection collected by the simulator in a preset control period after the current moment; determining the second sample characteristic value as an updated target first characteristic value, determining an award value for the original actuator discriminator model according to the third sample characteristic value, and updating a parameter value in the original actuator discriminator model according to the award value; training the original actuator discriminator model with the updated parameter values according to the updated target first characteristic values, calculating a first probability value of each sample set output by the original actuator discriminator model during each training based on the updated parameter values and the updated target first characteristic values after each training, calculating an expected value of a sample set probability value determining function corresponding to the original actuator discriminator model according to the parameter values of each parameter in each sample set and the corresponding first probability values, and obtaining the trained actuator discriminator model until the expected value is maximum.

8. The device according to claim 7, wherein the training module is specifically configured to determine, according to a first time value of an average vehicle delay time corresponding to each phase of the intersection and a first quantity value of a vehicle arrival flow included in the third sample feature value, a first sum of a product value of the first time value and the first quantity value corresponding to each phase and a second sum of the first quantity value corresponding to each phase, and obtain, according to a ratio of the first sum and the second sum, a first reward value corresponding to the intersection; according to a second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and a second numerical value of the vehicle arrival flow contained in the third sample characteristic value, determining a third sum of the product value of the second time value corresponding to each phase and the second numerical value and a fourth sum of the second numerical value corresponding to each phase, and according to the ratio of the third sum to the fourth sum, obtaining a second reward value corresponding to the adjacent downstream intersection; and obtaining the reward value of the original actuator discriminator model according to the fifth sum of the first reward value and the second reward value.

9. An electronic device, comprising: the system comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus;

the memory has stored therein a computer program which, when executed by the processor, causes the processor to execute the computer program stored in the memory to carry out the steps of the traffic signal control method according to any one of claims 1-4.

10. A computer-readable storage medium, characterized in that it stores a computer program which, when being executed by a processor, carries out the steps of the traffic signal control method according to any one of claims 1-4.