CN114639255B

CN114639255B - Traffic signal control method, device, equipment and medium

Info

Publication number: CN114639255B
Application number: CN202210314258.3A
Authority: CN
Inventors: 相强强; 程兴硕; 王泽�; 伍召举
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2022-03-28
Filing date: 2022-03-28
Publication date: 2023-06-09
Anticipated expiration: 2042-03-28
Also published as: CN114639255A

Abstract

The invention discloses a traffic signal control method, a device, equipment and a medium, wherein the method is used for acquiring target characteristic values acquired at a target intersection and adjacent downstream intersections, acquiring a target probability value of an action control parameter corresponding to the input target characteristic value based on an actuator discriminant model which is trained in advance, determining a target preset action control parameter of the target intersection according to the target probability value, and controlling traffic signals of the target intersection. According to the invention, the target characteristic values of the adjacent downstream intersections are considered when the corresponding intelligent body devices of each intersection determine the target preset action control parameters of the intersection, so that competition among the intelligent body devices of each intersection is reduced, and the overall control effect of the traffic trunk is improved.

Description

Traffic signal control method, device, equipment and medium

Technical Field

The present invention relates to the field of traffic signal control technologies, and in particular, to a traffic signal control method, device, apparatus, and medium.

Background

With population growth and urban progress acceleration, urban travel demands are rapidly increased, and the conventional traffic infrastructure is difficult to meet the increasing traffic demands, so that periodic congestion and aperiodic congestion of urban traffic are caused. The traffic signal control is used as the core of urban traffic management and control, and a scientific and reasonable signal control scheme can maximize the throughput of intersections, improve the running efficiency of urban road networks and the traffic capacity of the intersections, reduce the frequency and intensity of traffic conflict, and further alleviate the problem of urban traffic jam.

The adaptive traffic signal control scheme in the prior art is mainly used for controlling traffic signals based on prediction of a fixed traffic model, selection of a preset signal control scheme or real-time prediction of a traffic simulation model, is driven by a simulation model in nature, and is required to calibrate parameters of the traffic simulation model and design a predefined signal control scheme in advance according to an actual traffic scene, but has poor applicability to a dynamic traffic environment due to the characteristics of dynamic nature, randomness, uncertainty and the like of the actual traffic environment.

The multi-point traffic control based on deep reinforcement learning provided in the prior art mostly simply transfers single-point signal control to a multi-point scene, namely, the intelligent body equipment of each intersection uses the same neural network model subjected to the deep reinforcement learning, but when each intelligent body equipment uses the neural network model, the optimal control effect of traffic signals of the intersection is ensured, but adjacent intersections are mutually influenced, so that if the vehicle passing rate is used as an evaluation standard of the control effect, one intelligent body equipment is likely to set traffic signal lamps of the intersection in a preset time period after the current moment as green lights, and the influence on adjacent downstream intersections is not considered, so that the adjacent downstream intersections are caused to be jammed, and the overall control effect of a traffic trunk is poor.

Disclosure of Invention

The invention provides a traffic signal control method, a device, equipment and a medium, which are used for solving the problem of poor overall control effect of a traffic trunk in the prior art.

The invention provides a traffic signal control method, which is used for intelligent agent equipment corresponding to each intersection of a traffic trunk, and comprises the following steps:

acquiring target characteristic values acquired at a target intersection and an adjacent downstream intersection, wherein the target characteristic values are characteristic values of first preset state characteristics of each phase in a preset control period before the current moment, and the first preset state characteristics comprise an arrival flow rate in unit time and the length of a lane occupied by each vehicle when the vehicles are queued longest;

inputting the target characteristic values into a pre-trained actuator discriminant model to obtain target probability values of each set;

and determining a target set of the target intersection according to the target probability value of each set, and controlling the traffic signal lamp of each phase in a preset time period after the current moment of the target intersection according to the parameter value corresponding to the green light duration of each phase in the target set.

Further, the determining the target set of the target intersection according to the target probability value of each set includes:

sampling is carried out to determine a target set of the target intersection based on the target probability value of each set, wherein the probability of being sampled is higher when the target probability value of the set is larger; or alternatively, the first and second heat exchangers may be,

and determining the set with the maximum target probability value as the target set of the target intersection.

Further, the training process of the actuator discriminant model includes:

acquiring a target first characteristic value acquired by a simulator for simulating the target intersection and the adjacent downstream intersection, wherein the target first characteristic value is a first sample characteristic value of a first preset state characteristic of each phase in a preset control period;

inputting the target first characteristic value into an original executor discriminant model, acquiring a first probability value of each sample set output by the original executor discriminant model, and determining a target sample set according to the first probability value of each sample set;

inputting the parameter value of each parameter in the target sample set into the simulator to control the parameter value update of the traffic signal lamp of the target intersection, and acquiring the second sample characteristic value of the first preset state characteristic and the third sample characteristic value of the second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection in a preset control period after the current moment, wherein the parameter value update is acquired by the simulator;

Determining the second sample characteristic value as an updated target first characteristic value, determining a reward value for the original actuator discriminant model according to the third sample characteristic value, and updating a parameter value in the original actuator discriminant model according to the reward value;

training the original executor discriminant model with updated parameter values according to the updated target first characteristic values, calculating a first probability value of each sample set output by the original executor discriminant model during each training based on the updated parameter values and the updated target first characteristic values after each training, and calculating expected values of sample set probability value determining functions corresponding to the original executor discriminant model according to the parameter values of each parameter in each sample set and the corresponding first probability values until the expected values are maximum, thereby obtaining the trained executor discriminant model.

Further, the determining a reward value for the original actuator discriminant model from the third sample feature value comprises:

determining a first sum value of a product value of the first time value corresponding to each phase and the first quantity value and a second sum value of the first quantity value corresponding to each phase according to a first time value of average vehicle delay time corresponding to each phase of the target intersection and the first quantity value of vehicle arrival flow contained in the third sample characteristic value, and obtaining a first rewarding value corresponding to the target intersection according to the ratio of the first sum value to the second sum value;

According to a second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and a second quantity value of the vehicle arrival flow, which are contained in the third sample characteristic value, a third sum value of a product value of the second time value corresponding to each phase and the second quantity value and a fourth sum value of the second quantity value corresponding to each phase are determined, and according to a ratio of the third sum value to the fourth sum value, a second prize value corresponding to the adjacent downstream intersection is obtained;

and obtaining the rewarding value of the original actuator discriminant model according to the fifth sum value of the first rewarding value and the second rewarding value.

Accordingly, the present invention provides a traffic signal control apparatus, the apparatus comprising:

the system comprises an acquisition module, a control module and a control module, wherein the acquisition module is used for acquiring target characteristic values acquired at a target intersection and an adjacent downstream intersection, the target characteristic values are characteristic values of first preset state characteristics of each phase in a preset control period before the current moment, and the first preset state characteristics comprise an arrival flow rate in unit time and the length of a lane occupied by each vehicle when the vehicle is queued to be longest;

the processing module is used for inputting the target characteristic value into a pre-trained actuator discriminant model and acquiring a target probability value of each output set;

And the control module is used for determining a target set of the target intersection according to the target probability value of each set, and controlling the traffic signal lamp of each phase in a preset time period after the current moment of the target intersection according to the parameter value corresponding to the green light duration of each phase in the target set.

Further, the control module is specifically configured to sample and determine a target set of the target intersection based on the target probability value of each set, where the greater the target probability value of the set, the higher the probability of being sampled; or determining the set with the maximum target probability value as the target set of the target intersection.

Further, the apparatus further comprises:

the training module is used for acquiring target first characteristic values acquired by a simulator for simulating the target intersection and the adjacent downstream intersection, wherein the target first characteristic values are first sample characteristic values of first preset state characteristics of each phase in a preset control period; inputting the target first characteristic value into an original executor discriminant model, acquiring a first probability value of each sample set output by the original executor discriminant model, and determining a target sample set according to the first probability value of each sample set; inputting the parameter value of each parameter in the target sample set into the simulator to control the parameter value update of the traffic signal lamp of the target intersection, and acquiring the second sample characteristic value of the first preset state characteristic and the third sample characteristic value of the second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection in a preset control period after the current moment, wherein the parameter value update is acquired by the simulator; determining the second sample characteristic value as an updated target first characteristic value, determining a reward value for the original actuator discriminant model according to the third sample characteristic value, and updating a parameter value in the original actuator discriminant model according to the reward value; training the original executor discriminant model with updated parameter values according to the updated target first characteristic values, calculating a first probability value of each sample set output by the original executor discriminant model during each training based on the updated parameter values and the updated target first characteristic values after each training, and calculating expected values of sample set probability value determining functions corresponding to the original executor discriminant model according to the parameter values of each parameter in each sample set and the corresponding first probability values until the expected values are maximum, thereby obtaining the trained executor discriminant model.

Further, the training module is specifically configured to determine, according to a first time value of an average vehicle delay time corresponding to each phase of the target intersection and a first quantity value of a vehicle arrival flow included in the third sample feature value, a first sum value of a product value of the first time value corresponding to each phase and the first quantity value, and a second sum value of the first quantity value corresponding to each phase, and obtain, according to a ratio of the first sum value to the second sum value, a first reward value corresponding to the target intersection; according to a second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and a second quantity value of the vehicle arrival flow, which are contained in the third sample characteristic value, a third sum value of a product value of the second time value corresponding to each phase and the second quantity value and a fourth sum value of the second quantity value corresponding to each phase are determined, and according to a ratio of the third sum value to the fourth sum value, a second prize value corresponding to the adjacent downstream intersection is obtained; and obtaining the rewarding value of the original actuator discriminant model according to the fifth sum value of the first rewarding value and the second rewarding value.

Accordingly, the present invention provides an electronic device, comprising: the device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

the memory has stored therein a computer program which, when executed by the processor, causes the processor to perform the steps of any one of the above-described traffic signal control methods when executing the computer program stored in the memory.

Accordingly, the present invention provides a computer readable storage medium storing a computer program which when executed by a processor performs the steps of any one of the above-described traffic signal control methods.

The invention provides a traffic signal control method, a device, equipment and a medium, wherein the method is used for acquiring target characteristic values acquired at a target intersection and adjacent downstream intersections, acquiring a target probability value of an action control parameter corresponding to the input target characteristic value based on an actuator discriminant model which is trained in advance, determining a target preset action control parameter of the target intersection according to the target probability value, and controlling traffic signals of the target intersection. According to the invention, the target characteristic values of the adjacent downstream intersections are considered when the corresponding intelligent body devices of each intersection determine the target preset action control parameters of the intersection, so that competition among the intelligent body devices of each intersection is reduced, and the overall control effect of the traffic trunk is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it will be apparent that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic process diagram of a traffic signal control method according to an embodiment of the present invention;

FIG. 2 is a schematic illustration of a traffic trunk provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of training an actuator discriminant model according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a traffic signal control device according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order to improve the overall control effect of a traffic trunk, the embodiment of the invention provides a traffic signal control method, a device, equipment and a medium.

Example 1:

fig. 1 is a schematic process diagram of a traffic signal control method according to an embodiment of the present invention, where the process includes the following steps:

s101: the method comprises the steps of obtaining target characteristic values collected at a target intersection and an adjacent downstream intersection, wherein the target characteristic values are characteristic values of first preset state characteristics of each phase in a preset control period before the current moment, and the first preset state characteristics comprise the arrival flow rate of unit time and the occupied lane length of each vehicle when the vehicles are queued longest.

The traffic signal control method provided by the embodiment of the invention is applied to the intelligent body equipment corresponding to each intersection of the traffic trunk, wherein the intelligent body equipment can be an intelligent terminal such as a PC (personal computer), a tablet personal computer, a mobile terminal and the like for controlling the traffic signal lamp, and can also be a server for controlling the traffic signal lamp; the server can be a local server, a cloud server and a traffic signal lamp controller. In particular, embodiments of the present invention are not limited in this regard.

In order to improve the overall control effect of the traffic trunk, the intelligent device acquires target characteristic values acquired at a target intersection and an adjacent downstream intersection, wherein the target intersection is an intersection corresponding to the intelligent device, namely an intersection where a traffic signal lamp controlled by the intelligent device is located, and the adjacent downstream intersection is an intersection adjacent to the target intersection and positioned downstream in the traffic direction of the target intersection. Fig. 2 is a schematic diagram of a traffic trunk according to an embodiment of the present invention, taking a direction from left (left and right in fig. 2) to right (left and right in fig. 2) of a vehicle passing behavior as an example, if an intersection on the left (left and right in fig. 2) is a target intersection, an intersection in the middle (middle in fig. 2) is an adjacent downstream intersection.

Specifically, the intelligent device may acquire a target characteristic value corresponding to a target intersection acquired by an image acquisition device connected with the intelligent device and a target characteristic value corresponding to a downstream intersection acquired by an image acquisition device connected with an adjacent downstream intersection; the target characteristic value corresponding to the target intersection acquired by the image acquisition unit of the intelligent body equipment and the target characteristic value corresponding to the downstream intersection sent by the intelligent body equipment corresponding to the adjacent downstream intersection can be acquired.

The target characteristic value is a characteristic value of a first preset state characteristic of each phase in a preset control period before the current moment, the preset control period is a maximum period in an optimal period of each intersection in a traffic trunk according to a Webster optimal period formula, the first preset state characteristic comprises a unit time arrival flow rate and a vehicle queuing longest time lane occupation length, the unit time arrival flow rate is a ratio of a vehicle arrival flow through the intersection to the preset control period in the preset control period before the current moment, and the vehicle queuing longest time lane occupation length is a ratio of the maximum queuing length to the corresponding vehicle number in the preset control period before the current moment.

For example, the target intersection is denoted as m, the adjacent downstream intersection is denoted as n, and the target characteristic values of the target intersection and the adjacent downstream intersection are

Wherein the method comprises the steps of

The target characteristic value

In (I)>

Representing the maximum queuing length,/in the preset control period before the current moment of the ith phase of the target intersection m>

Representing the number of vehicles corresponding to the maximum queuing length of the ith phase of the target intersection m in a preset control period before the current moment,/day >

Indicating that each vehicle occupies a lane length when the ith phase of the target intersection m is the longest in a preset control period before the current moment>

Indicating the arrival flow rate per unit time of the ith phase of the target intersection m in a preset control period before the current time.

Represents the maximum queuing length of the ith phase of the adjacent downstream intersection n in a preset control period before the current moment,/h>

Representing the number of vehicles corresponding to the maximum queuing length of the ith phase of the adjacent downstream intersection n in a preset control period before the current moment, < ->

Indicating the length of each vehicle occupying the lane when the vehicle queuing is longest in the preset control period before the current moment of the ith phase of the adjacent downstream intersection n +.>

Indicating the arrival flow rate per unit time of the ith phase of the adjacent downstream intersection n in a preset control period prior to the current time. />

S102: and inputting the target characteristic value into a pre-trained actuator discriminant model, and obtaining the target probability value of each output set.

In order to realize the control of traffic signals, in the embodiment of the invention, the intelligent body equipment stores a pre-trained actuator discriminant model, wherein the actuator discriminant model is pre-trained to determine a target probability value of each set according to target characteristic values acquired by a target intersection and an adjacent downstream intersection, wherein the set comprises parameter values of action control parameters of traffic signals of the target intersection, and the action control parameters comprise green signal ratio of each phase of the traffic signals, namely the ratio of green light duration to duration of a preset control period.

For example, for motion control parameters

Indicating (I)>

Wherein->

The ratio of the green light duration of the ith phase of the target intersection m to the duration of the preset control period is represented.

After the target characteristic values acquired by the to-be-target intersection and the adjacent downstream intersection are acquired, the target characteristic values are input into the pre-trained actuator discriminant model, and the target probability values of each set are acquired through the processing of the actuator discriminant model.

S103: and determining a target set of the target intersection according to the target probability value of each set, and controlling the traffic signal lamp of each phase in a preset time period after the current moment of the target intersection according to the parameter value corresponding to the green light duration of each phase in the target set.

After determining the target probability value of each set, determining the target set of the target intersection according to the target probability value of each set. Specifically, the set corresponding to the median value of the target probability values may be determined as the target set, the set corresponding to the maximum value of the target probability values may be determined as the target set, or the target set may be selected from each set by adopting other determining methods, which is not limited in the embodiment of the present invention.

After the target set is determined, controlling the traffic signal lamp of each phase in a preset time period after the current moment of the target intersection according to the parameter value corresponding to the green light duration of each phase in the target set. In the embodiment of the invention, the traffic signal lamp of the target intersection is controlled once every preset control period, and each control needs to determine the target set in the next control period according to the target characteristic value acquired in the previous preset control period.

In the embodiment of the invention, the traffic signal control method provided by the invention acquires the target characteristic values acquired at the target intersection and the adjacent downstream intersection, acquires the target probability value of the action control parameter corresponding to the input target characteristic value based on the pre-trained actuator discriminant model, determines the target preset action control parameter of the target intersection according to the target probability value, and controls the traffic signal of the target intersection. According to the invention, the target characteristic values of the adjacent downstream intersections are considered when the corresponding intelligent body devices of each intersection determine the target preset action control parameters of the intersection, so that competition among the intelligent body devices of each intersection is reduced, and the overall control effect of the traffic trunk is improved.

Example 2:

in order to determine the target set of the target intersection, on the basis of the foregoing embodiment, in an embodiment of the present invention, the determining, according to the target probability value of each set, the target set of the target intersection includes:

In order to determine the target set of the target intersection, in the embodiment of the invention, sampling is performed to determine the target set of the target intersection based on the target probability value of each set, wherein the probability of sampling each set is inconsistent, the probability of sampling is higher when the target probability value of the set is larger, and the probability of sampling is lower when the target probability value of the set is smaller.

As a possible implementation manner, in the embodiment of the present invention, a set with a maximum target probability value may also be determined as a target set of the target intersection according to the target probability value of each set.

Example 3:

in order to train the actuator discriminant model, on the basis of the above embodiments, in an embodiment of the present invention, a training process of the actuator discriminant model includes:

In order to train the actuator discriminant model, in the embodiment of the present invention, simulators simulating the target intersection and the adjacent downstream intersection are stored in advance, and the simulators are specifically used for simulating the change of the traffic state of the intersection.

The method comprises the steps of obtaining target first characteristic values collected by simulators for simulating a target intersection and adjacent downstream intersections, wherein the target first characteristic values are first sample characteristic values of first preset state characteristics of each phase in a preset control period, and the target first characteristic values comprise first sample characteristic values corresponding to the target intersection and first sample characteristic values corresponding to the adjacent downstream intersections.

Specifically, when the actuator discriminant model is trained for the first time, the first characteristic value of the target is an initial characteristic value pre-stored in simulation, and the subsequent characteristic value after the simulator simulates the target intersection and the adjacent downstream intersection during each training.

After the target first characteristic value acquired by the simulator is acquired, the target first characteristic value is input into an original executor discriminant model, a first probability value of each sample set output by the original executor discriminant model is acquired, each sample set is preset, sampling is carried out according to the first probability value of each sample set, and the target sample set is determined, wherein the greater the first probability value of the sample set is, the greater the sampling probability is.

Inputting the parameter value of each parameter in the target sample set into a simulator, controlling the parameter value of the traffic signal lamp of the target intersection simulated in the simulator to update, and updating the parameter value into the parameter value of the corresponding parameter of each phase in each phase of a preset period.

After the simulator simulates a preset control period of the target intersection and the adjacent downstream intersection after the current moment, acquiring a second sample characteristic value of a first preset state characteristic and a third sample characteristic value of a second preset state characteristic of each phase of the target intersection and the adjacent downstream intersection in the preset control period, wherein the second sample characteristic value is acquired by the simulator. Wherein the second preset status characteristic includes an average vehicle delay time and a vehicle arrival flow.

In order to perform the next training, determining the second sample characteristic value as an updated target first characteristic value, determining a reward value for the original actuator discriminant model according to the third sample characteristic value, and updating the parameter value in the original actuator discriminant model by adopting a time difference learning algorithm according to the reward value. Wherein the prize value is a prize value resulting from a transition including a second preset state in the environmental state of the target intersection and a prize value resulting from a second preset state in the environmental state of an adjacent downstream intersection.

Training the original executor discriminant model with the updated parameter values according to the updated target first characteristic values, namely repeatedly executing the steps according to the updated target first characteristic values. And calculating an expected value of a probability value determining function corresponding to the original executor discriminant model during each training based on the updated parameter value and the updated target first characteristic value of each training until the expected value is maximum, and obtaining the trained executor discriminant model.

Specifically, according to the parameter value of each parameter in each sample set output during each training and the corresponding first sample probability value, determining the sum of the product value of the parameter value in each sample set and the corresponding first sample probability value, and determining the expected value comprising the sum corresponding to each parameter.

Example 4:

in order to realize training of an actuator arbiter model, in the embodiment of the invention, for each round of training corresponding to each preset control period, a target first characteristic value acquired by a simulator simulating a target intersection and an adjacent downstream intersection is acquired in the round of training, the target first characteristic value is input into an original actuator arbiter model as a strategy network of an actuator, a first probability value of each sample set output by the strategy network is acquired according to a state cost function corresponding to the strategy network for determining a sample set probability value and a first score value provided by one round of training on a value network of the arbiter, and a target sample set is determined according to the first probability value of each sample set.

Inputting the first characteristic value of the target and the parameter value of each parameter in the target sample set into a value network serving as a discriminator in a starting actuator discriminator model, and obtaining a first evaluation value of the value network on the target sample set.

Inputting the parameter value of each parameter in the target sample set into a simulator to control the parameter value of a traffic signal lamp of a target intersection to update, and acquiring a second sample characteristic value of a first preset state characteristic and a third sample characteristic value of a second preset state characteristic of each phase of the target intersection acquired by the simulator and the adjacent downstream intersection in a preset control period after the current moment. And determining a reward value for the value network in the original simulator discriminant model according to the third sample characteristic value.

Inputting the second sample characteristic value into a strategy network of an original actuator discriminator model as an actuator, acquiring a predicted first probability value of each sample set output by the strategy network, and determining a predicted target sample set according to the predicted first probability value of each sample set. And inputting the second sample characteristic value and the parameter value of each parameter in the prediction target sample set into a value network serving as a discriminator in the initial actuator discriminator model, and obtaining a second evaluation value of the value network on the prediction target sample set.

And determining a time difference error value by adopting a time difference algorithm according to the first score value, the reward value and the second score value, determining a product value of the derivative value and the time difference error value according to a derivative value of the action cost function corresponding to the value network on the first parameter value of the value network, and updating the first parameter value according to the product value. Specifically, the product value of the product value and the preset first learning rate is subtracted from the first parameter value to obtain an updated first parameter value.

And updating the second parameter value of the strategy network by adopting a random gradient rising method according to the first grading value and the derivative value of the second parameter of the state cost function strategy network corresponding to the strategy network. Specifically, a product value of the first grading value and the derivative value is determined, the second parameter value is updated according to the product value, and specifically, the product value of the product value and a preset second learning rate is added to the second parameter value, so that an updated second parameter value is obtained.

Determining a second sample characteristic value as an updated target first characteristic value, training an original executor discriminant model with updated first parameter values and second parameter values, calculating a first probability value of each sample set output by the original executor discriminant model during each training based on the updated parameter values and the updated target first characteristic values after each training, and calculating expected values of state value functions of the determined sample set probability values corresponding to the executors in the original executor discriminant model according to the parameter values of each parameter in each sample set and the corresponding first probability values until the expected values are maximum, thereby obtaining the trained executor discriminant model.

In the embodiment of the invention, an actuator discriminant model adopts an actuator-Critic algorithm to carry out cooperative optimization control on traffic signals of a traffic trunk, and the actuator-Critic algorithm belongs to one of strategy learning, which approximates a strategy function pi (a|s) by using a neural network. The neural network approximating the strategy function in the embodiment of the invention is a strategy network, which is expressed as pi (a|s; θ) and is used as an actuator in the actuator discriminant model.

The input of the strategy network is the characteristic value S of the first preset state characteristic of each phase in the preset control period before the current moment acquired by the target intersection and the adjacent downstream intersection, the probability distribution of each set a of the output parameter values comprising the motion control parameters is equal to sigma _a∈A π(a|s；θ)＝1。

The state-cost function corresponding to the policy function can be approximately expressed as V _π (s _t ；θ)＝∑ _a π(a _t |s；θ)Q _π (s _t In the embodiment of the invention, a random gradient ascent method is adopted to update the parameter theta, wherein

Known as policy gradients.

The policy gradient may be represented by the following formula:

since the policy gradient cannot be directly calculated, the policy gradient can be calculated approximately by using the Monte Carlo approximation method, namely

Since it is not known in the policy gradient

The value network q (s, a; ω) can be used to approximate the action cost function. Where the value network q (s, a; ω) is a neural network.

The policy network pi (a|s; θ) is called actor (actor), and the value network q (s, a; ω) is called critter (critic). When learning the strategy network pi (a|s; θ), the supervisory signal comes from the evaluation value Q provided by the value network Q (s, a; ω), and when learning the value network Q (s, a; ω), the supervisory signal comes from the reward value R.

The parameters θ and ω are updated simultaneously when training the actuator discriminant model, and the strategy gradients are used to update the parameters θ of the strategy network to increase the state cost function V _π (s _t The method comprises the steps of carrying out a first treatment on the surface of the θ) the accuracy of the probability value output, the parameter ω of the value network is updated using a time difference algorithm in order to make the output evaluation value Q more accurate.

Fig. 3 is a schematic diagram of training an actuator arbiter model according to an embodiment of the present invention, where, as shown in fig. 3, the actuator arbiter model includes a policy network as an actuator and a value network as an arbiter, where a traffic state in an environment of a target intersection and an adjacent downstream intersection of the actuator model is adopted, the value network is used to provide an evaluation Q value to the policy network, the simulator is used to provide a reward value and output a feature value of a first preset state feature of the environment, and the policy network is used to output a probability value of a set of motion control parameters.

The actuator discriminant model of the present invention is trained by one specific example, and the strategy network pi (a|s; θ) and the value network q (s, a; ω) in the actuator discriminant model are randomly initialized.

The intelligent agent equipment acquires target characteristic values s of target intersections acquired in the simulator and adjacent downstream intersections _t If it is first training, the target characteristic value s _t For pre-stored initial characteristic value s ₀ The target characteristic value s _t Inputting a policy network pi (a|s; theta), sampling based on a first probability value of each sample set output by the policy network pi (a|s; theta) to determine a target sample set a _t . To target characteristic value s _t And a set of target samples a _t Inputting the parameter value of each parameter into a value network q (s, a; omega) to obtain a first evaluation value q of the value network on the target sample set _t (s _t ，a _t ；ω _t )。

Assembling a target sample set a _t Inputting parameter values of each parameter into a simulator to control parameter value updating of traffic signal lamps of a target intersection, and acquiring second sample characteristic values s of first preset state characteristics of each phase in a preset control period of the target intersection acquired by the simulator and adjacent downstream intersections after the current moment _t+1 And a third sample feature value for the second predetermined state feature. Determining a reward value r for the value network in the original simulator discriminant model from the third sample feature value _t 。

Characterizing the second sample by s _t+1 Inputting a strategy network pi (a|s; theta) serving as an actuator in an original actuator discriminant model, and sampling and determining a prediction target sample set based on a prediction first probability value of each sample set output by the strategy network pi (a|s; theta)

To target characteristic value s _t+1 And a set of target samples

Inputting the parameter value of each parameter into a value network q (s, a; omega) to obtain a second evaluation value of the value network on the target sample set>

According to the first evaluation value q _t Prize value r _t Q _t+1 Calculating a time difference error value delta _t Wherein delta _t ＝q _t -(r _t +γq _t+1 ) Deriving the action cost function corresponding to the value network to obtain a derivative value d _ω，t Wherein

First parameter value omega for value network _t Updating to obtain updated first parameter value omega _t+1 ，ω _t+1 ＝ω _t -αδ _t d _ω，t 。

Deriving a state value function corresponding to the strategy network to obtain a derivative value d _θ，t Wherein

According to the first evaluation value q _t Sum derivative d _θ，t Updating a second parameter value θ of the policy network with random gradient ramp-up _t Obtaining updated second parameter value theta _t+1 Wherein θ is _t+1 ＝θ _t +βq _t d _θ，t 。

Example 5:

in order to determine the prize value of the original actuator discriminant model for each training, in the embodiments of the present invention, determining the prize value of the original actuator discriminant model according to the third sample feature value includes:

In order to determine a reward value of each training on an original actuator discriminant model, in the embodiment of the invention, a product value of the first time value corresponding to each phase and the first quantity value is determined according to the first time value of the average vehicle delay time corresponding to each phase of a target intersection and the first quantity value of the vehicle arrival flow, which are contained in the third sample characteristic value, and the product value corresponding to each phase is added to obtain a first sum value; and adding each first quantity value according to the first quantity value corresponding to each phase to obtain a second sum value, and determining the negative value of the ratio of the first sum value to the second sum value as a first rewarding value corresponding to the target intersection according to the ratio of the first sum value to the second sum value.

Determining a product value of the second time value corresponding to each phase and the second quantity value according to the second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and the second quantity value of the vehicle arrival flow contained in the third sample characteristic value, and adding the product values corresponding to each phase to obtain a third sum value; and according to the second quantity value corresponding to each phase, adding each second quantity value to obtain a fourth sum value, and according to the ratio of the third sum value to the fourth sum value, determining the negative value of the ratio of the third sum value to the fourth sum value as a second prize value corresponding to an adjacent downstream intersection.

And determining a fifth sum of the first and second prize values as a prize value for the original actuator discriminant model based on the fifth sum.

The process of determining the prize value of the original actuator discriminant model according to the present invention is described in a specific embodiment, and for each target intersection of the traffic trunk, not only the prize value of the target intersection m but also the prize value of the adjacent downstream intersection n is considered when training the actuator discriminant model.

In calculating the prize value for the actuator arbiter model, a joint prize function is specifically employed, wherein the joint prize function

Wherein->

A first time value representing the average delay time of the ith phase of the target intersection m in a preset control period after the current time t->

A first quantity value representing the vehicle arrival flow of the ith phase of the target intersection m in a preset control period after the current time t; />

A second time value representing an average delay time of the ith phase of the target intersection n within a preset control period after the current time t +.>

A second quantity value representing the vehicle arrival flow of the ith phase of the target intersection n in a preset control period after the current time t.

Example 6:

fig. 4 is a schematic structural diagram of a traffic signal control device according to an embodiment of the present invention, where the device includes:

an obtaining module 401, configured to obtain a target feature value collected at a target intersection and an adjacent downstream intersection, where the target feature value is a feature value of a first preset state feature of each phase in a preset control period before a current moment, where the first preset state feature includes an arrival flow rate per unit time and a lane length occupied by each vehicle when a vehicle queues longest;

a processing module 402, configured to input the target feature value to a pre-trained actuator discriminant model, and obtain a target probability value of each set output;

the control module 403 is configured to determine a target set of the target intersection according to the target probability value of each set, and control traffic signals of each phase in a preset time period after the current time of the target intersection according to a parameter value corresponding to a green light duration of each phase in the target set.

Further, the apparatus further comprises:

Example 7:

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and on the basis of the foregoing embodiments, the present application further provides an electronic device including a processor 501, a communication interface 502, a memory 503, and a communication bus 504, where the processor 501, the communication interface 502, and the memory 503 complete communication with each other through the communication bus 504;

the memory 503 has stored therein a computer program which, when executed by the processor 501, causes the processor 501 to perform the steps of:

Further, the processor 501 is specifically configured to determine, according to the target probability value of each set, a target set of the target intersection, where the determining includes:

Further, the training process of the processor 501 specifically for the actuator discriminant model includes:

Further, the determining, by the processor 501, a reward value for the original actuator discriminant model according to the third sample feature value includes:

The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface 502 is used for communication between the electronic device and other devices described above.

The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit, a network processor (Network Processor, NP), etc.; but also digital instruction processors (Digital Signal Processing, DSP), application specific integrated circuits, field programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.

Example 8:

on the basis of the above embodiments, the embodiments of the present invention further provide a computer readable storage medium having stored therein a computer program executable by a processor, which when run on the processor, causes the processor to perform the steps of:

Further, the training process of the actuator discriminant model includes:

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims

1. A traffic signal control method characterized by an agent device for each intersection of a traffic trunk, the method comprising:

determining a target set of the target intersection according to the target probability value of each set, and controlling traffic signals of each phase in a preset time period after the current moment of the target intersection according to a parameter value corresponding to the green light duration of each phase in the target set;

the training process of the actuator discriminant model comprises the following steps:

2. The method of claim 1, wherein determining the target set of target intersections from the target probability values for each set comprises:

3. The method of claim 1, wherein determining a prize value for the original actuator arbiter model based on the third sample feature value comprises:

4. A traffic signal control apparatus, the apparatus comprising:

The control module is used for determining a target set of the target intersection according to the target probability value of each set, and controlling traffic signals of each phase in a preset time period after the current moment of the target intersection according to a parameter value corresponding to the green light duration of each phase in the target set;

5. The apparatus of claim 4, wherein the control module is configured to sample and determine a target set of the target intersection based on the target probability value for each set, wherein the greater the target probability value of a set, the greater the likelihood of being sampled; or determining the set with the maximum target probability value as the target set of the target intersection.

6. The apparatus of claim 4, wherein the training module is specifically configured to determine a first sum of product values of the first time value and the first quantity value corresponding to each phase and a second sum of the first quantity value corresponding to each phase according to a first time value of average vehicle delay time corresponding to each phase of the target intersection and a first quantity value of vehicle arrival flow included in the third sample feature value, and obtain a first reward value corresponding to the target intersection according to a ratio of the first sum to the second sum; according to a second time value of the average vehicle delay time corresponding to each phase of the adjacent downstream intersection and a second quantity value of the vehicle arrival flow, which are contained in the third sample characteristic value, a third sum value of a product value of the second time value corresponding to each phase and the second quantity value and a fourth sum value of the second quantity value corresponding to each phase are determined, and according to a ratio of the third sum value to the fourth sum value, a second prize value corresponding to the adjacent downstream intersection is obtained; and obtaining the rewarding value of the original actuator discriminant model according to the fifth sum value of the first rewarding value and the second rewarding value.

7. An electronic device, comprising: the device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

the memory has stored therein a computer program which, when executed by the processor, causes the processor to execute the computer program stored in the memory to implement the steps of the traffic signal control method according to any one of claims 1-3.

8. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the steps of the traffic signal control method according to any one of claims 1-3.