WO2021051870A1 - 基于强化学习模型的信息控制方法、装置和计算机设备 - Google Patents
基于强化学习模型的信息控制方法、装置和计算机设备 Download PDFInfo
- Publication number
- WO2021051870A1 WO2021051870A1 PCT/CN2020/093432 CN2020093432W WO2021051870A1 WO 2021051870 A1 WO2021051870 A1 WO 2021051870A1 CN 2020093432 W CN2020093432 W CN 2020093432W WO 2021051870 A1 WO2021051870 A1 WO 2021051870A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- signal light
- action
- historical
- intersection
- prediction model
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/07—Controlling traffic signals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/56—Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
- G06V20/58—Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
- G06V20/584—Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads of vehicle lights or traffic lights
-
- G—PHYSICS
- G08—SIGNALLING
- G08G—TRAFFIC CONTROL SYSTEMS
- G08G1/00—Traffic control systems for road vehicles
- G08G1/01—Detecting movement of traffic to be counted or controlled
- G08G1/0104—Measuring and analyzing of parameters relative to traffic conditions
- G08G1/0125—Traffic data processing
Definitions
- This application relates to the field of artificial intelligence, and in particular to an information control method, device, computer equipment and storage medium based on a reinforcement learning model.
- Intelligent traffic light control responds to traffic changes by adjusting signal parameters, which is an effective way to reduce congestion.
- the traditional signal light control mostly adopts Pre-timed (pre-timed) signal light control and actuated (inspired) traffic signal light control.
- Pre-timed signal light control uses historical data to calculate a set of fixed time sequence signal lights; this control method cannot meet the fluctuating traffic flow and cannot improve the congestion situation.
- Actuated traffic signal light control can adjust the signal light duration according to traffic needs, but it cannot provide real-time optimization.
- the signal light control method based on reinforcement learning can improve the traffic conditions, but the inventor realizes that the training samples of the ordinary structure of the storage structure are used when the ordinary reinforcement learning model is trained, that is, the state (traffic status) and action (whether the signal light is executed or not). Switching, how to switch) together without distinction, the stored data are mostly phases and decisions that are stored centrally. In the learning process, more attention will be paid to the high-frequency phase-decision combination, while the low-frequency phase-decision combination will be ignored. This will cause wrong decisions to be made in the low-frequency sub-phase-decision combination. This reduces the performance of adaptive traffic light control.
- the main purpose of this application is to provide an information control method, device, computer equipment, and storage medium based on a reinforcement learning model, aiming to improve the adaptability of signal light control, thereby having better robustness.
- this application proposes an information control method based on a reinforcement learning model, which includes the following steps:
- the signal light action prediction model is based on a reinforcement learning model and is obtained through sample data training with a specified data structure
- the specified data structure is composed of multiple data blocks, wherein sample data with the same signal light phase and the same prediction action are stored in the same data block, and the signal light phase refers to the type of color that the signal light can display;
- the signal lamp is controlled according to the predicted action.
- This application provides an information control device based on a reinforcement learning model, including:
- the image acquisition unit is used to acquire the image of the intersection between the current time and the signal light
- the use condition determination unit is configured to determine whether the current time and the image of the intersection where the signal light is located meet the use conditions of the preset signal light action prediction model;
- the state feature extraction unit is configured to extract a specified state feature from the image of the intersection where the traffic light is located if the current time and the image of the intersection where the traffic light is located meet the preset usage conditions of the traffic light action prediction model;
- the predictive action acquisition unit is configured to input the specified state feature into the signal light action prediction model to obtain the predicted action output by the signal light action prediction model; wherein the signal light action prediction model is based on a reinforcement learning model and has a designated
- the sample data of the data structure is obtained by training, the specified data structure is composed of multiple data blocks, wherein the sample data with the same signal light phase and the same prediction action are stored in the same data block, and the signal light phase refers to the signal light
- the signal light control unit is configured to control the signal light according to the predicted action.
- the present application provides a computer device including a memory and a processor, the memory stores a computer program, and when the processor executes the computer program, a method for information control based on a reinforcement learning model is implemented, including the following steps:
- the signal light action prediction model is based on a reinforcement learning model and is obtained through sample data training with a specified data structure
- the specified data structure is composed of multiple data blocks, wherein sample data with the same signal light phase and the same prediction action are stored in the same data block, and the signal light phase refers to the type of color that the signal light can display;
- the signal lamp is controlled according to the predicted action.
- the present application provides a computer-readable storage medium on which a computer program is stored.
- a method for information control based on a reinforcement learning model is implemented, which includes the following steps:
- the signal light action prediction model is based on a reinforcement learning model and is obtained through sample data training with a specified data structure
- the specified data structure is composed of multiple data blocks, wherein sample data with the same signal light phase and the same prediction action are stored in the same data block, and the signal light phase refers to the type of color that the signal light can display;
- the signal lamp is controlled according to the predicted action.
- the information control method, device, computer equipment and storage medium based on the reinforcement learning model of the present application acquire the current time and the image of the intersection where the signal light is located; if the current time and the image of the intersection where the signal light is located match the preset signal light
- the use condition of the action prediction model is to extract the specified state feature from the image of the intersection where the traffic light is located; input the specified state feature into the traffic light action prediction model to obtain the predicted action output by the traffic light action prediction model
- the signal light action prediction model is based on a reinforcement learning model and is obtained through sample data training with a specified data structure, the specified data structure is composed of multiple data blocks, wherein the samples with the same signal light phase and the same predicted action
- the data is stored in the same data block, and the signal lamp phase refers to the type of color that the signal lamp can display; and the signal lamp is controlled according to the predicted action. So that the control of the signal light is more suitable for more traffic conditions and more robust.
- FIG. 1 is a schematic flowchart of an information control method based on a reinforcement learning model according to an embodiment of this application;
- FIG. 2 is a schematic block diagram of the structure of an information control device based on a reinforcement learning model according to an embodiment of the application;
- FIG. 3 is a schematic block diagram of the structure of a computer device according to an embodiment of the application.
- an embodiment of the present application provides an information control method based on a reinforcement learning model, including the following steps:
- the information control method based on the reinforcement learning model involved in this application is aimed at the control of a single signal light, that is, the control of a single signal light at a certain intersection. Therefore, the signal lights mentioned in this application are all the same signal light.
- the current time and the image of the intersection of the signal light are acquired.
- the signal lights refer to traffic lights, and can also be red, yellow, and green lights.
- the image of the intersection where the signal signal is located may be the entire image that can reflect the traffic conditions of the entire intersection, or multiple images that reflect the traffic conditions of a part of the intersection (for example, an image of a road at the intersection) ) To comprehensively reflect the image of the traffic situation of the entire intersection.
- the image of the intersection where the signal light is located may be acquired by one image acquisition device, or may be acquired by multiple image acquisition devices. Wherein, the signal lights are used to guide the traffic conditions at the intersection, and therefore generally include multiple signal lights set at the intersection.
- step S2 it is determined whether the current time and the image of the intersection where the traffic light is located meet the preset usage conditions of the traffic light action prediction model. Since the reinforcement learning model is to improve the traffic situation, if the traffic situation at the intersection does not need to be improved (for example, at midnight, there are few cars, there is no possibility of congestion), or the traffic situation at the intersection is impossible to improve (for example, caused by a car accident If a certain lane is blocked, traffic police will need to be guided), then there is no need to use the signal light control method of the reinforcement learning model.
- the specific judging process is, for example: judging whether the current time belongs to the usage period of the preset signal light action prediction model; if the current time belongs to the usage period of the preset signal light action prediction model, analyzing the intersection where the signal light is located To determine whether there is a vehicle with a suspended wheel in each lane of the intersection; if there is no vehicle with a suspended wheel in each lane of the intersection, determine whether the current time and the intersection of the signal light are located The image meets the usage conditions of the preset signal light motion prediction model.
- step S3 if the current time and the image of the intersection where the signal light is located meet the preset usage conditions of the traffic light action prediction model, then the specified state feature is extracted from the image of the intersection where the signal light is located. If the conditions of use are met, the image of the intersection where the signal light is located is used as the basis for the information control method based on the reinforcement learning model of this application to determine the corresponding action.
- the specified state feature is extracted from the image of the intersection where the signal light is located, and the process of extracting the specified state feature is, for example, extracting from the image of the intersection where the signal light is located according to a preset image feature acquisition method
- Designated image features where the designated image features at least include screenshots of areas of multiple lanes; analyze the images of intersections where the traffic lights are located, to obtain designated digital features, where the designated digital features include at least the number of vehicles in each lane and each Lane queuing length and lane occupancy rate; acquiring the current phase of each signal lamp indicating each lane at the intersection; recording the specified image feature, the specified digital feature, and the current phase as the specified state feature.
- the method for acquiring the designated data feature may also include: using a predicted sensor (such as an infrared sensor, a laser sensor, etc.) to acquire a designated digital feature, where the designated digital feature includes at least the number of vehicles in each lane and the number of vehicles in each lane. Queue length and lane occupancy rate.
- a predicted sensor such as an infrared sensor, a laser sensor, etc.
- the specified state feature is input into the traffic light action prediction model to obtain the predicted action output by the traffic light action prediction model; wherein, the traffic light action prediction model is based on a reinforcement learning model and has specified
- the sample data of the data structure is obtained by training, the specified data structure is composed of multiple data blocks, wherein the sample data with the same signal light phase and the same prediction action are stored in the same data block, and the signal light phase refers to the signal light The types of colors that can be displayed.
- this application uses sample data training with a specified data structure, which is composed of multiple data blocks, where sample data with the same signal phase and the same prediction action exists In the same data block, the signal light phase refers to the type of color that the signal light can display.
- the phase of the signal light varies according to the signal light. For example, for a red and green signal light, its phase is red or green; for a red, yellow, and green light, its phase is red or yellow or green; for more color signal lights, Its phase is also more. Therefore, when training the traffic light action prediction model, the low-frequency sub-phase-decision combination is not ignored, so that when some special intersection states (ie, the state corresponding to the low-frequency sub-phase-decision) are encountered, the traffic can still be effectively cleared.
- the specific process of obtaining the prediction action is, for example: inputting the specified state feature into the traffic light action prediction model, and using the hidden layer to process the specified state feature, so as to obtain the output of the last hidden layer
- the hidden value corresponding to multiple initial prediction actions use the formula:
- the predicted probability value is calculated, where y(action i ) is the predicted probability value corresponding to the i-th initial predicted action, and action i is the hidden value corresponding to the i-th initial predicted action, and there are Na initial predicted actions in total;
- the signal lamp is controlled according to the predicted action.
- the prediction action is, for example, whether to switch the signal light, how the signal light should be switched, and how long the current phase should be maintained if the signal light is not switched, and so on. Since the predicted action is considered to be the most effective way to smooth the traffic, the information control based on the reinforcement learning model can be realized by controlling the signal light according to the predicted action.
- the step S2 of judging whether the current time and the image of the intersection where the signal light is located meets the use condition of a preset signal light action prediction model including:
- S201 Determine whether the current time belongs to a preset usage period of the signal light action prediction model
- the purpose of this application is to determine whether it is a busy time period and whether there is a car accident, so as to determine whether to use an information control model based on a reinforcement learning model. Specifically, if the current time belongs to the usage period of the preset signal light action prediction model, and there are no vehicles with hanging wheels in each lane of the intersection, it indicates that the usage conditions of the preset signal light action prediction model are met.
- judging whether the current time and the image of the intersection where the signal light is located meet the preset usage conditions of the traffic light action prediction model may also include: analyzing the image of the intersection where the signal light is located, so as to determine whether the image of the intersection is located. Whether there is a red area with an area larger than the preset area in each lane; if there is a red area with an area larger than the preset area in each lane of the intersection, determine whether the shape of the red area is irregular; If the shape of the red color area is irregular, it is considered that a car accident has occurred, and it is judged that it does not meet the use conditions of the preset signal light motion prediction model.
- the red color area represents the blood area. Since a large area of blood is rarely seen in general car accidents, when there is a large area of blood, it is judged as a major traffic accident, so as to further determine that it does not meet the preset signal light action prediction model. Conditions of Use.
- the step S3 of extracting the specified state feature from the image of the intersection where the signal light is located includes:
- a preset image feature acquisition method extract a designated image feature from an image of an intersection where the signal light is located, where the designated image feature includes at least a screenshot of an area of a plurality of lanes;
- the specified state feature is extracted from the image of the intersection where the signal light is located.
- This application takes the specified image feature, the specified digital feature, and the current phase of each signal lamp as the specified state feature, and uses them as the calculation basis of the signal lamp action prediction model. This application separates the image feature from the digital feature, so that the subsequent signal light action prediction model is more targeted, so as to obtain more accurate processing results.
- the specified image feature may be any image feature, such as a screenshot of a specified area, or an image of the intersection after gray-scale processing, or split into multiple that can reflect the traffic state of multiple lanes The sub-pictures and so on. Subsequently, a convolutional layer may be used to process the specified image feature.
- the designated state feature is extracted from the image of the intersection where the signal light is located, and the designated state feature can effectively reflect the current traffic state.
- the number of vehicles in each lane and the queue length of each lane can be obtained by recognizing the image of the intersection where the signal light is located through an image recognition method.
- the vehicle occupation time (that is, the time difference between when the vehicle enters the lane and the time when the vehicle leaves the lane, that is, the time difference between acquiring the corresponding two images) and the length of the vehicle can be obtained by analyzing the image of the intersection where the signal light is located.
- the specified state feature is input into the signal light action prediction model to obtain the predicted action output by the signal light action prediction model; wherein, the signal light action prediction model is based on a reinforcement learning model and has The sample data of a designated data structure is obtained by training, and the designated data structure is composed of multiple data blocks, wherein the sample data with the same signal light phase and the same prediction action are stored in the same data block, and the signal light phase refers to Before step S4, the color types that can be displayed by the signal light include:
- each historical data includes the historical phase, historical action, historical state, historical reward, and next historical state of the signal light at the same time;
- the multiple data blocks P11, P12, ..., Pik, ..., Pmn form the designated data structure.
- this application constructs multiple data blocks P11, P12,...,Pik,...,Pmn, and combines the multiple data blocks P11, P12,...,Pik ,...,Pmn constitute the specified data structure.
- the low-frequency phase-decisions are marked out and treated as equivalent to other data blocks.
- the training process due to the existence of the multiple data blocks, it is possible to select the training samples so that the amount of sample data corresponding to the low-frequency phase-decisions accounts for the proportion of the total training samples and other phase-decisions.
- the proportions of the signals are similar or the same, so that the trained signal light action prediction model can also be competent in the low-frequency phase-decision corresponding traffic state, so as to effectively guide the traffic.
- the method includes:
- a preset multi-round training sequence use the historical data for training to train the signal light action prediction model based on the reinforcement learning model, and use a gradient descent method to update the network parameters of the signal light action prediction model, wherein
- the number of rounds of the multiple-round training sequence is the same as the specified number, and the historical training data used in each round of training comes from different data blocks.
- the network parameters of the signal light action prediction model include, for example, the network parameters in the decision network and the evaluation network (the signal light action prediction model includes the decision network and the evaluation network), for example, in the form of minimizing a preset loss function, And using the reverse transfer method to update the network parameters in the signal light action prediction model, where the loss function formula is (where the signal light action prediction model includes a decision network and an evaluation network, and the decision network includes the same network A first prediction network and a first target network with different structures but different network parameters, and the evaluation network includes a second prediction network and a second target network with the same network structure but different network parameters):
- Loss to loss of function a total of the N decision time, t means the t-th decision time, Q is the evaluation of the network output desired value, a state S t t th feature to the moment of decision in which light intersection, A t Is the output of the first prediction network at the t-th decision time, ⁇ is the network parameter of the first target network, ⁇ - is the network parameter of the second target network, R t+1 is the t+1-th At the decision moment, the negative number of the sum of the squares of the queue lengths of the lanes at the intersection where the signal light is located, ⁇ is a preset parameter, ⁇ is the output of the first target network, and ⁇ - is a network parameter of the first target network.
- the updated network parameters include the network parameters ⁇ of the first target network and so on.
- This application extracts a specified amount of historical training data from the multiple data blocks P11, P12,...,Pik,...,Pmn, so that the training sample data provided by different data blocks are all the same. Specify the data to ensure that the signal light motion prediction model obtained by training is suitable for all traffic conditions.
- a multi-round training sequence is adopted to make the training process more uniform, so as to further ensure that the signal light action prediction model obtained by training is suitable for all traffic conditions.
- the historical training data used in each round of training comes from different data blocks, that is to say, each round of training uses the sample data from the first data block, ...
- the preset sample extraction rule can be any rule, and it only needs to ensure that the extracted quantity is a specified quantity, for example, a method of extracting a specified quantity of data with an odd number at the top is adopted.
- the signal light action prediction model includes a decision network
- the decision network includes a plurality of hidden layers
- the specified state characteristics are input into the signal light action prediction model to obtain the signal light action prediction model
- Step S4 of the output prediction action includes:
- the predicted probability value is calculated, where y(action i ) is the predicted probability value corresponding to the i-th initial predicted action, and action i is the hidden value corresponding to the i-th initial predicted action, and there are Na initial predicted actions in total;
- S403 Obtain the designated prediction probability value with the largest value among the multiple prediction probability values, record the initial prediction action corresponding to the designated prediction probability value as the final prediction action, and output the final prediction action.
- the specified state feature is input into the traffic light action prediction model to obtain the predicted action output by the traffic light action prediction model.
- This application uses multiple hidden layers to obtain the hidden values corresponding to multiple initial predicted actions, and then calculates the corresponding predicted probability values accordingly, where the predicted probability values reflect the degree to which the corresponding initial predicted actions match the current traffic conditions. Therefore, The value with the largest value among the multiple predicted probability values is recorded as the designated predicted probability value, the initial predicted action corresponding to the designated predicted probability value is recorded as the final predicted action, and the final predicted action is output.
- the predicted probability value of not switching the signal light is 80%, and the sum of the predicted probability values corresponding to other actions is only 20%, so the predicted action of not switching the signal light will be output.
- the traffic light action prediction model includes a decision network and an evaluation network.
- the decision network includes a first prediction network and a first target network that have the same network structure but different network parameters
- the evaluation network includes The second prediction network and the second target network with different network structures but different network parameters
- the signal light action prediction model used in this application includes a decision network and an evaluation network.
- the decision network includes a first prediction network and a first target network that have the same network structure but different network parameters.
- the evaluation network includes a network that has the same network structure but different network parameters.
- the first prediction network is used to predict and output the predicted actions to meet the needs of traffic control. But because the nature of reinforcement learning is trial and error, other methods are needed to feedback and correct.
- the first target network, the second prediction network, and the second target network are used to feed back and modify the network parameters, specifically by adopting the form of minimizing the preset loss function, and using the reverse transfer method to update, wherein the The formula of the loss function is:
- an embodiment of the present application provides an information control device based on a reinforcement learning model, including:
- the image acquisition unit 10 is used to acquire an image of the intersection at the current time and where the signal light is located;
- the use condition judging unit 20 is configured to judge whether the current time and the image of the intersection where the signal light is located meets the use condition of the preset signal light action prediction model;
- the state feature extraction unit 30 is configured to extract a specified state feature from the image of the intersection where the traffic light is located if the current time and the image of the intersection where the traffic light is located meet the preset usage conditions of the traffic light action prediction model ;
- the predictive action acquisition unit 40 is configured to input the specified state characteristics into the traffic light action prediction model to obtain the predicted action output by the traffic light action prediction model; wherein the traffic light action prediction model is based on a reinforcement learning model and has The sample data of a designated data structure is obtained by training, and the designated data structure is composed of multiple data blocks, wherein the sample data with the same signal light phase and the same prediction action are stored in the same data block, and the signal light phase refers to The type of color that the signal lamp can display;
- the signal light control unit 50 is configured to control the signal light according to the predicted action.
- the use condition judging unit 20 includes:
- the use period judging subunit is used to judge whether the current time belongs to the use period of the preset signal lamp action prediction model
- the vehicle judging subunit is used to analyze the image of the intersection where the signal light is located if the current time belongs to the use period of the preset signal light action prediction model, so as to determine whether there is a hanging wheel in each lane of the intersection vehicle;
- the use condition determination subunit is used for determining that the current time and the image of the intersection where the signal light is located meet the use conditions of the preset signal light action prediction model if there is no vehicle with the driving wheel hanging in each lane of the intersection .
- the state feature extraction unit 30 includes:
- the designated image feature acquisition subunit is configured to extract designated image features from the image of the intersection where the traffic light is located according to a preset image feature acquisition method, where the designated image features include at least a screenshot of a region of multiple lanes;
- the designated digital feature acquisition subunit is used to analyze the image of the intersection where the signal light is located to obtain the designated digital feature, where the designated digital feature includes at least the number of vehicles in each lane, the queue length of each lane, and the occupancy rate of each lane;
- the current phase obtaining subunit is used to obtain the current phase of each signal lamp indicating each lane at the intersection;
- the specified state feature acquisition subunit is used to record the specified image feature, the specified digital feature and the current phase as the specified state feature.
- the device includes:
- the historical data acquisition unit is configured to acquire multiple historical data of the signal light, each historical data includes the historical phase, historical action, historical state, historical reward, and next historical state of the signal light at the same time;
- the data block generation unit is used to generate multiple data blocks P11, P12,...,Pik,...,Pmn, where the data block P11 stores historical data with historical phase numbered as 1 and historical action numbered as 1, data block P12 stores historical data with historical phases numbered 1 and historical actions numbered 2, data block Pik stores historical data with historical phases numbered i and historical actions numbered k, data block Pmn stores historical data with Historical data with a historical phase numbered m and a historical action numbered n, wherein the historical phase has a total of m numbers, the historical action has a total of n numbers, i is a positive integer less than m, and k is less than n Positive integer;
- the designated data structure forming unit is used to form the designated data structure by the plurality of data blocks P11, P12, ..., Pik, ..., Pmn.
- the device includes:
- the historical data extraction unit for training is used to extract a specified number of historical data for training from each of the multiple data blocks P11, P12,...,Pik,...,Pmn according to preset sample extraction rules;
- the multi-round training unit is configured to use the training historical data to train the signal light action prediction model based on the reinforcement learning model according to a preset multi-round training sequence, and update the signal light action prediction model using a gradient descent method
- the number of rounds in the multi-round training sequence is the same as the specified number, and the historical training data used in each round of training comes from different data blocks.
- the signal light action prediction model includes a decision network
- the decision network includes a plurality of hidden layers
- the predicted action acquisition unit 40 includes:
- the hidden value acquisition subunit is used to input the specified state feature into the traffic light action prediction model, and use the hidden layer to process the specified state feature, so as to obtain multiple initial output from the last hidden layer Predict the hidden value corresponding to the action;
- the prediction probability value acquisition subunit is used to adopt the formula:
- the predicted probability value is calculated, where y(action i ) is the predicted probability value corresponding to the i-th initial predicted action, and action i is the hidden value corresponding to the i-th initial predicted action, and there are Na initial predicted actions in total;
- the final prediction action output subunit is used to obtain the specified prediction probability value with the largest value among the multiple prediction probability values, record the initial prediction action corresponding to the specified prediction probability value as the final prediction action, and output the final prediction action.
- the traffic light action prediction model includes a decision network and an evaluation network.
- the decision network includes a first prediction network and a first target network that have the same network structure but different network parameters
- the evaluation network includes A second prediction network and a second target network with different network structures but different network parameters
- the device includes:
- the network parameter update unit is used to update the network parameters in the signal light action prediction model in the form of minimizing the preset loss function and adopting the reverse transfer method, wherein the formula of the loss function is: Wherein Loss to loss of function, a total of the N decision time, t means the t-th decision time, Q is the evaluation of the network output desired value, a state S t t th feature to the moment of decision in which light intersection, A t Is the output of the first prediction network at the t-th decision time, ⁇ is the network parameter of the first target network, ⁇ - is the network parameter of the second target network, R t+1 is the t+1-th At the decision moment, the negative number of the sum of the squares of the queue lengths of the lanes at the intersection where the signal light is located, ⁇ is a preset parameter, ⁇ is the output of the first target network, and ⁇ - is a network parameter of the first target network.
- an embodiment of the present application also provides a computer device.
- the computer device may be a server, and its internal structure may be as shown in the figure.
- the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor designed by the computer is used to provide calculation and control capabilities.
- the memory of the computer device includes a non-volatile storage medium and an internal memory.
- the non-volatile storage medium stores an operating system, a computer program, and a database.
- the memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
- the database of the computer equipment is used to store the data used in the information control method based on the reinforcement learning model.
- the network interface of the computer device is used to communicate with an external terminal through a network connection.
- the computer program is executed by the processor to realize an information control method based on the reinforcement learning model.
- the above-mentioned processor executes the above-mentioned information control method based on the reinforcement learning model, wherein the steps included in the method respectively correspond to the steps of executing the information control method based on the reinforcement learning model of the foregoing embodiment one-to-one, and will not be repeated here.
- the information control method includes: acquiring the current time and the image of the intersection where the signal light is located; judging whether the current time and the image of the intersection where the signal light is located meets the use conditions of a preset signal light action prediction model; if the current time If the image of the intersection where the traffic light is located meets the pre-determined use conditions of the traffic light action prediction model, then a specified state feature is extracted from the image of the intersection where the traffic light is located; the specified state feature is input into the traffic light action prediction In the model, the predicted action output by the traffic light action prediction model is obtained; wherein, the traffic light action prediction model is based on a reinforcement learning model and is obtained by training sample data with a specified data structure, and the specified data structure is composed of multiple data blocks The structure, wherein sample data having the same signal light phase and the same predicted action are stored in the same data block, the signal light phase refers to the type of color that the signal light can display; and the signal light is controlled according to the predicted action.
- An embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, a method for information control based on a reinforcement learning model is realized, and the storage medium is a volatile storage medium or A non-volatile storage medium, wherein the steps included in the method respectively correspond to the steps of executing the information control method based on the reinforcement learning model of the foregoing embodiment, and will not be repeated here.
- the information control method includes: acquiring the current time and the image of the intersection where the signal light is located; judging whether the current time and the image of the intersection where the signal light is located meets the use conditions of a preset signal light action prediction model; if the current time If the image of the intersection where the traffic light is located meets the pre-determined use conditions of the traffic light action prediction model, then a specified state feature is extracted from the image of the intersection where the traffic light is located; the specified state feature is input into the traffic light action prediction In the model, the predicted action output by the traffic light action prediction model is obtained; wherein the traffic light action prediction model is based on a reinforcement learning model and is obtained through training of sample data with a specified data structure, and the specified data structure is composed of multiple data blocks.
Landscapes
- Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Analytical Chemistry (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Chemical & Material Sciences (AREA)
- Multimedia (AREA)
- Traffic Control Systems (AREA)
Abstract
一种基于强化学习模型的信息控制方法、装置、计算机设备和存储介质,涉及人工智能领域,方法包括:获取当前时间与信号灯所处路口的图像(S1);若当前时间与信号灯所处路口的图像符合使用条件,则从信号灯所处路口的图像中提取出指定状态特征(S3);将指定状态特征输入信号灯动作预测模型中,从而得到预测动作;信号灯动作预测模型基于强化学习模型并通过具有指定数据结构的样本数据训练得到的,指定数据结构由多个数据块构成,具有同一信号灯相位和同一预测动作的样本数据被存在同一个数据块中(S4);根据预测动作控制信号灯(S5)。从而使信号灯的控制更适用于更多的交通状况,更具鲁棒性。
Description
本申请要求于2019年9月18日提交中国专利局、申请号为201910882718.0,发明名称为“基于强化学习模型的信息控制方法、装置和计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及到人工智能领域,特别是涉及到一种基于强化学习模型的信息控制方法、装置、计算机设备和存储介质。
随着车辆数目的日益增多,交通拥堵问题日益严重。交通拥堵带来了更多旅行时间、燃料消耗和空气污染等社会问题,迫切的需要解决。智能交通信号灯控制通过调整信号参数来响应交通变化,是一种减少拥塞的有效方式。传统的信号灯控制多采用的是Pre-timed(预先计时)信号灯控制和actuated(激励)交通信号灯控制。Pre-timed信号灯控制通过历史数据,计算一组固定时序的信号灯;该控制方式无法满足波动的交通流量,无法改善拥堵情况。Actuated交通信号灯控制根据交通需求能够调整信号灯时长,但无法提供实时的优化。因此基于强化学习的信号灯控制方法能够改善交通状况,但是发明人意识到,普通的强化学习模型训练时采用的是普通结构的存储结构的训练样本,即将状态(交通状况)和动作(是否执行信号灯切换,如何切换)不加区分的存在一起,则存储的数据中多为比较集中存储的相位和决策。在学习过程中,将会更多的关注高频次相位-决策组合,而忽视低频次相位-决策组合。这会使得在低频次相位-决策组合中,做出错误决策。使得自适应红绿灯控制性能降低。
本申请的主要目的为提供一种基于强化学习模型的信息控制方法、装置、计算机设备和存储介质,旨在提高信号灯控制的适应性,从而具有更好的鲁棒性。
为了实现上述目的,本申请提出一种基于强化学习模型的信息控制方法,包括以下步骤:
获取当前时间与信号灯所处路口的图像;
判断所述当前时间与所述信号灯所处路口的图像是否符合预设的信号灯动作预测模型的使用条件;
若所述当前时间与所述信号灯所处路口的图像符合预设的信号灯动作预测模型的使用条件,则从所述信号灯所处路口的图像中提取出指定状态特征;
将所述指定状态特征输入所述信号灯动作预测模型中,得到所述信号灯动作预测模型输出的预测动作;其中,所述信号灯动作预测模型基于强化学习模型并通过具有指定数据结构的样本数据训练得到的,所述指定数据结构由多个 数据块构成,其中,具有同一信号灯相位和同一预测动作的样本数据被存在同一个所述数据块中,所述信号灯相位指信号灯可显示的颜色种类;
根据所述预测动作控制所述信号灯。
本申请提供一种基于强化学习模型的信息控制装置,包括:
图像获取单元,用于获取当前时间与信号灯所处路口的图像;
使用条件判断单元,用于判断所述当前时间与所述信号灯所处路口的图像是否符合预设的信号灯动作预测模型的使用条件;
状态特征提取单元,用于若所述当前时间与所述信号灯所处路口的图像符合预设的信号灯动作预测模型的使用条件,则从所述信号灯所处路口的图像中提取出指定状态特征;
预测动作获取单元,用于将所述指定状态特征输入所述信号灯动作预测模型中,得到所述信号灯动作预测模型输出的预测动作;其中,所述信号灯动作预测模型基于强化学习模型并通过具有指定数据结构的样本数据训练得到的,所述指定数据结构由多个数据块构成,其中,具有同一信号灯相位和同一预测动作的样本数据被存在同一个所述数据块中,所述信号灯相位指信号灯可显示的颜色种类;
信号灯控制单元,用于根据所述预测动作控制所述信号灯。
本申请提供一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现一种基于强化学习模型的信息控制方法,包括以下步骤:
获取当前时间与信号灯所处路口的图像;
判断所述当前时间与所述信号灯所处路口的图像是否符合预设的信号灯动作预测模型的使用条件;
若所述当前时间与所述信号灯所处路口的图像符合预设的信号灯动作预测模型的使用条件,则从所述信号灯所处路口的图像中提取出指定状态特征;
将所述指定状态特征输入所述信号灯动作预测模型中,得到所述信号灯动作预测模型输出的预测动作;其中,所述信号灯动作预测模型基于强化学习模型并通过具有指定数据结构的样本数据训练得到的,所述指定数据结构由多个数据块构成,其中,具有同一信号灯相位和同一预测动作的样本数据被存在同一个所述数据块中,所述信号灯相位指信号灯可显示的颜色种类;
根据所述预测动作控制所述信号灯。
本申请提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现一种基于强化学习模型的信息控制方法,包括以下步骤:
获取当前时间与信号灯所处路口的图像;
判断所述当前时间与所述信号灯所处路口的图像是否符合预设的信号灯动作预测模型的使用条件;
若所述当前时间与所述信号灯所处路口的图像符合预设的信号灯动作预测模型的使用条件,则从所述信号灯所处路口的图像中提取出指定状态特征;
将所述指定状态特征输入所述信号灯动作预测模型中,得到所述信号灯动作预测模型输出的预测动作;其中,所述信号灯动作预测模型基于强化学习模型并通过具有指定数据结构的样本数据训练得到的,所述指定数据结构由多个数据块构成,其中,具有同一信号灯相位和同一预测动作的样本数据被存在同一个所述数据块中,所述信号灯相位指信号灯可显示的颜色种类;
根据所述预测动作控制所述信号灯。
本申请的基于强化学习模型的信息控制方法、装置、计算机设备和存储介质,获取当前时间与信号灯所处路口的图像;若所述当前时间与所述信号灯所处路口的图像符合预设的信号灯动作预测模型的使用条件,则从所述信号灯所处路口的图像中提取出指定状态特征;将所述指定状态特征输入所述信号灯动作预测模型中,得到所述信号灯动作预测模型输出的预测动作;其中,所述信号灯动作预测模型基于强化学习模型并通过具有指定数据结构的样本数据训练得到的,所述指定数据结构由多个数据块构成,其中,具有同一信号灯相位和同一预测动作的样本数据被存在同一个所述数据块中,所述信号灯相位指信号灯可显示的颜色种类;根据所述预测动作控制所述信号灯。从而使信号灯的控制更适用于更多的交通状况,更具鲁棒性。
图1为本申请一实施例的基于强化学习模型的信息控制方法的流程示意图;
图2为本申请一实施例的基于强化学习模型的信息控制装置的结构示意框图;
图3为本申请一实施例的计算机设备的结构示意框图。
本申请的最佳实施方式
参照图1,本申请实施例提供一种基于强化学习模型的信息控制方法,包括以下步骤:
S1、获取当前时间与信号灯所处路口的图像;
S2、判断所述当前时间与所述信号灯所处路口的图像是否符合预设的信号灯动作预测模型的使用条件;
S3、若所述当前时间与所述信号灯所处路口的图像符合预设的信号灯动作预测模型的使用条件,则从所述信号灯所处路口的图像中提取出指定状态特征;
S4、将所述指定状态特征输入所述信号灯动作预测模型中,得到所述信号灯动作预测模型输出的预测动作;其中,所述信号灯动作预测模型基于强化学习模型并通过具有指定数据结构的样本数据训练得到的,所述指定数据结构由多个数据块构成,其中,具有同一信号灯相位和同一预测动作的样本数据被存在同一个所述数据块中,所述信号灯相位指信号灯可显示的颜色种类;
S5、根据所述预测动作控制所述信号灯。
本申请涉及的基于强化学习模型的信息控制方法,针对的是单个信号灯的控制,即对某个路口的单灯号灯进行控制,因此本申请中所述信号灯均是同一 个信号灯。
如上述步骤S1所述,获取当前时间与信号灯所处路口的图像。其中所述信号灯指红绿灯,也可以为红黄绿灯。所述信号信所处路口的图像可以为整张能够反应整个路口的交通状况的图像,也可以通过多张分别反应所述路口的一部分交通状况的图像(例如所述路口的一个车道路的图像)以综合反应整个路口的交通状况的图像。相应地,所述信号灯所处路口的图像可以由一个图像采集装置获取,也可以由多个图像采集装置获取。其中,所述信号灯是用于指导所处路口的交通状况,因此一般包括在所处路口设置的多个信号灯。
如上述步骤S2所述,判断所述当前时间与所述信号灯所处路口的图像是否符合预设的信号灯动作预测模型的使用条件。由于强化学习模型是为了改善交通状况,而若路口的交通状况并不需要改善(例如午夜时分,车少,不存在拥堵的可能),或者路口的交通状况已经不可能改善了(例如车祸导致的某车道阻塞,则需要交警疏导),那就无需使用强化学习模型的信号灯控制方法。具体的判断过程例如为:判断所述当前时间是否属于预设的信号灯动作预测模型的使用时段;若所述当前时间属于预设的信号灯动作预测模型的使用时段,则分析所述信号灯所处路口的图像,从而判断所述路口的各车道中是否存在行驶轮悬空的车辆;若所述路口的各车道中不存在行驶轮悬空的车辆,则判定所述当前时间与所述信号灯所处路口的图像符合预设的信号灯动作预测模型的使用条件。
如上述步骤S3所述,若所述当前时间与所述信号灯所处路口的图像符合预设的信号灯动作预测模型的使用条件,则从所述信号灯所处路口的图像中提取出指定状态特征。若符合使用条件,则以所述信号灯所处路口的图像作为本申请的基于强化学习模型的信息控制方法决定相应动作的基础。据此从所述信号灯所处路口的图像中提取出指定状态特征,其中提取出指定状态特征的过程例如为:根据预设的图像特征获取方法,从所述信号灯所处路口的图像中提取出指定图像特征,其中所述指定图像特征至少包括多个车道的区域截图;分析所述信号灯所处路口的图像,从而得到指定数字特征,其中所述指定数字特征至少包括各车道的车辆数、各车道排队长度和各车道占有率;获取所述路口指示各车道的各信号灯的当前相位;将所述指定图像特征、所述指定数字特征和所述当前相位记为所述指定状态特征。进一步地,所述指定数据特征的获取方式还可以包括:利用预测的传感器(例如红外传感器、激光传感器等),获取指定数字特征,其中所述指定数字特征至少包括各车道的车辆数、各车道排队长度和各车道占有率。
如上述步骤S4所述,将所述指定状态特征输入所述信号灯动作预测模型中,得到所述信号灯动作预测模型输出的预测动作;其中,所述信号灯动作预测模型基于强化学习模型并通过具有指定数据结构的样本数据训练得到的,所述指定数据结构由多个数据块构成,其中,具有同一信号灯相位和同一预测动作的样本数据被存在同一个所述数据块中,所述信号灯相位指信号灯可显示的颜色种类。相对于普通的强化学习模型,本申请采用了具有指定数据结构的样 本数据训练得到的,所述指定数据结构由多个数据块构成,其中,具有同一信号灯相位和同一预测动作的样本数据被存在同一个所述数据块中,所述信号灯相位指信号灯可显示的颜色种类。所述信号灯相位根据信号灯的不同而不同,例如对于红绿两颜色的信号灯,其相位为红或绿;对于红黄绿灯而言,其相位为红或黄或绿;对于更多颜色的信号灯,其相位也更多。从而使得在训练所述信号灯动作预测模型时,不会忽视低频次相位-决策组合,从而在遇见一些特殊的路口状态时(即低频次相位-决策对应的状态),仍能有效疏导交通。具体获取所述预测动作的过程例如为:将所述指定状态特征输入所述信号灯动作预测模型中,并利用所述隐藏层对所述指定状态特征进行处理,从而获得最后一层隐藏层输出的多个初始预测动作对应的隐藏值;采用公式:
计算得到预测概率值,其中y(action
i)为第i个所述初始预测动作对应的预测概率值,action
i为第i个所述初始预测动作对应的隐藏值,共有Na个初始预测动作;获取多个预测概率值中数值最大的指定预测概率值,将所述指定预测概率值对应的初始预测动作记为最终预测动作,并输出所述最终预测动作。
如上述步骤S5所述,根据所述预测动作控制所述信号灯。其中所述预测动作例如为是否切换信号灯,应该如何切换信号灯,若不切换信号灯则应保持当前相位多少时间等。由于预测动作被视为能最有效疏导交通,因此根据所述预测动作控制所述信号灯,即可实现基于强化学习模型的信息控制。
在一个实施方式中,所述判断所述当前时间与所述信号灯所处路口的图像是否符合预设的信号灯动作预测模型的使用条件的步骤S2,包括:
S201、判断所述当前时间是否属于预设的信号灯动作预测模型的使用时段;
S202、若所述当前时间属于预设的信号灯动作预测模型的使用时段,则分析所述信号灯所处路口的图像,从而判断所述路口的各车道中是否存在行驶轮悬空的车辆;
S203、若所述路口的各车道中不存在行驶轮悬空的车辆,则判定所述当前时间与所述信号灯所处路口的图像符合预设的信号灯动作预测模型的使用条件。
如上所述,实现了判断所述当前时间与所述信号灯所处路口的图像是否符合预设的信号灯动作预测模型的使用条件。其中,本申请的目的在于判断是否为交通繁忙的时段、是否存在车祸现象,来决定是否使用基于强化学习模型的信息控制模型。具体地,若所述当前时间属于预设的信号灯动作预测模型的使用时段,并且所述路口的各车道中不存在行驶轮悬空的车辆,则表明符合预设的信号灯动作预测模型的使用条件。更进一步地,判断所述当前时间与所述信号灯所处路口的图像是否符合预设的信号灯动作预测模型的使用条件还可以包括:分析所述信号灯所处路口的图像,从而判断所述路口的各车道中是否存在面积大于预设面积的红颜色区域;若所述路口的各车道存在面积大于预设面积的红颜色区域,则判断所述红颜色区域的形状是否呈不规则形状;若所述红颜色区域的形状呈不规则形状,则认为出现了车祸,判定不符合预设的信号灯 动作预测模型的使用条件。其中红颜色区域代表了血液区域,由于一般车祸较少出现大面积的血液区域,因此当存在大面积的血液区域时,判定为重大交通事故,从而进一步确定不符合预设的信号灯动作预测模型的使用条件。
在一个实施方式中,所述从所述信号灯所处路口的图像中提取出指定状态特征的步骤S3,包括:
S301、根据预设的图像特征获取方法,从所述信号灯所处路口的图像中提取出指定图像特征,其中所述指定图像特征至少包括多个车道的区域截图;
S302、分析所述信号灯所处路口的图像,从而得到指定数字特征,其中所述指定数字特征至少包括各车道的车辆数、各车道排队长度和各车道占有率;
S303、获取所述路口指示各车道的各信号灯的当前相位;
S304、将所述指定图像特征、所述指定数字特征和所述当前相位记为所述指定状态特征。
如上所述,实现了从所述信号灯所处路口的图像中提取出指定状态特征。本申请以指定图像特征、指定数字特征和各信号灯的当前相位作为指定状态特征,将其作为所述信号灯动作预测模型的计算根据。本申请将图像特征与数字特征分离开来,使得后续信号灯动作预测模型更具有针对性,从而获得更准确的处理结果。进一步地,所述指定图像特征可为任意图像特征,例如为指定区域的截图,或者为进行灰度化处理后的所述路口的图像,或者为拆分为多个能反应多个车道交通状态的分图等等。后续可以采用卷积层对所述指定图像特征进行处理。据此,从所述信号灯所处路口的图像中提取出指定状态特征,而所述指定状态特征能够有效反应当前的交通状态。其中,所述各车道的车辆数、各车道排队长度可通过图像识别方法,识别所述信号灯所处路口的图像而得到。所述车道占有率可通过任意方式获取,例如采用公式:车道占有率=[(第一车辆占据时间×第一车辆的长度)+(第二车辆占据时间×第二车辆的长度)+...+(最后车辆占据时间×最后车辆的长度)]/(车道长度×总时间),计算得到车道占有率。其中所述车辆占据时间(即车辆进入车道与离开车道时的时间差,也即获取对应的两张图像的时间差)与所述车辆的长度均可通过分析所述信号灯所处路口的图像以获得。
在一个实施方式中,所述将所述指定状态特征输入所述信号灯动作预测模型中,得到所述信号灯动作预测模型输出的预测动作;其中,所述信号灯动作预测模型基于强化学习模型并通过具有指定数据结构的样本数据训练得到的,所述指定数据结构由多个数据块构成,其中,具有同一信号灯相位和同一预测动作的样本数据被存在同一个所述数据块中,所述信号灯相位指信号灯可显示的颜色种类的步骤S4之前,包括:
S31、获取所述信号灯的多个历史数据,每个历史数据均包括所述信号灯在同一时间下的历史相位、历史动作、历史状态、历史奖励和下一个历史状态;
S32、生成多个数据块P11,P12,…,Pik,…,Pmn,其中数据块P11存有具有历史相位被编号为1和历史动作被编号为1的历史数据,数据块P12存有具有历史相位被编号为1和历史动作被编号为2的历史数据,数据块Pik存有 具有历史相位被编号为i和历史动作被编号为k的历史数据,数据块Pmn存有具有历史相位被编号为m和历史动作被编号为n的历史数据,其中所述历史相位共具有m个编号,所述历史动作共具有n个编号,i为小于m的正整数,k为小于n的正整数;
S33、将所述多个数据块P11,P12,…,Pik,…,Pmn构成所述指定数据结构。
如上所述,实现了构建具有指定数据结构的样本数据。为了防止低频次的相位-决策(即历史动作)被忽视,本申请构建了多个数据块P11,P12,…,Pik,…,Pmn,将所述多个数据块P11,P12,…,Pik,…,Pmn构成所述指定数据结构。其中,由于同一数据块中的数据具有相同历史相位编号和相同的历史动作编号,因此低频次的相位-决策被标注了出来,以与其他数据块等同身份视之。在训练的过程中,由于存在所述多个数据块,因此能够通过挑选训练样本的方式使得低频次的相位-决策对应的样本数据量占总的训练样本的比重与其他的相位-决策所占的比重相近或相同,从而使训练得到的信号灯动作预测模型在低频次的相位-决策对应的交通状态中也能胜任,以有效疏导交通。
在一个实施方式中,所述将所述多个数据块P11,P12,…,Pik,…,Pmn构成所述指定数据结构的步骤S33之后,包括:
S331、根据预设的样本提取规则,从所述多个数据块P11,P12,…,Pik,…,Pmn中均分别提取出指定数量的训练用历史数据;
S332、根据预设的多轮次训练顺序,使用所述训练用历史数据对基于强化学习模型的信号灯动作预测模型进行训练,并采用梯度下降方式更新所述信号灯动作预测模型的网络参数,其中所述多轮次训练顺序的轮次数量与所述指定数量相同,每一轮次训练使用的训练用历史数据均来源于不同的数据块。
如上所述,实现了训练所述信号灯动作预测模型。其中所述信号灯动作预测模型的网络参数,例如包括决策网络和评估网络中的网络参数(所述信号灯动作预测模型包括决策网络和评估网络),例如采用以最小化预设的损失函数的形式,并采用反向传递法,更新所述信号灯动作预测模型中的网络参数,其中所述损失函数的公式为(其中所述信号灯动作预测模型包括决策网络和评估网络,所述决策网络包括具有相同网络结构但网络参数不同的第一预测网络和第一目标网络,所述评估网络包括具有相同网络结构但网络参数不同的第二预测网络和第二目标网络):
其中Loss为损失函数,共有N个决策时刻,t指第t个决策时刻,Q为所述评估网络输出的期望值,S
t为第t个决策时刻所述信号灯所处路口的状态特征,a
t为第t个决策时刻所述第一预测网络的输出,ω为所述第一目标网络的网络参数,ω
-为所述第二目标网络的网络参数,R
t+1为第t+1个决策时刻所述信号灯所处路口的各车道排队长度的平方和的负数,γ为预设参数,π为所述第一目标网络的输出,θ
-为所述第一目标网络的网络参数。因此对应地,更新的网络参 数包括上述第一目标网络的网络参数ω等。本申请通过从所述多个数据块P11,P12,…,Pik,…,Pmn中均分别提取出指定数量的训练用历史数据的方式,使不同数据块提供的训练样本数据均相同,均为指定数据,从而保证了训练得到的信号灯动作预测模型适用于所有交通状况。并且采用多轮次训练顺序的方式,使训练的过程更加均匀,以进一步保证训练得到的信号灯动作预测模型适用于所有交通状况。其中,每一轮次训练使用的训练用历史数据均来源于不同的数据块,也即是说,每一轮次的训练,均使用来自第一个数据块中的样本数据,…,直至来自最后一个数据块中的样本数据。其中所述预设的样本提取规则可为任意规则,只需保证提取得到的数量为指定数量即可,例如采用对奇数编号排名靠前指定数量的数据进行提取的方式。
在一个实施方式中,所述信号灯动作预测模型包括决策网络,所述决策网络包括多个隐藏层,所述将所述指定状态特征输入所述信号灯动作预测模型中,得到所述信号灯动作预测模型输出的预测动作的步骤S4,包括:
S401、将所述指定状态特征输入所述信号灯动作预测模型中,并利用所述隐藏层对所述指定状态特征进行处理,从而获得最后一层隐藏层输出的多个初始预测动作对应的隐藏值;
S403、获取多个预测概率值中数值最大的指定预测概率值,将所述指定预测概率值对应的初始预测动作记为最终预测动作,并输出所述最终预测动作。
如上所述,实现了将所述指定状态特征输入所述信号灯动作预测模型中,得到所述信号灯动作预测模型输出的预测动作。本申请采用多个隐藏层以获取多个初始预测动作对应的隐藏值,据此再计算得到对应的预测概率值,其中预测概率值反应了对应的初始预测动作符合当前的交通状况的程度,因此将多个预测概率值中数值最大的值记为指定预测概率值,将所述指定预测概率值对应的初始预测动作记为最终预测动作,并输出所述最终预测动作。例如不切换信号灯的预测概率值为80%,而其他动作对应的预测概率值之和仅为20%,因此将输出不切换信号灯的预测动作。
在一个实施方式中,所述信号灯动作预测模型包括决策网络和评估网络,所述决策网络包括具有相同网络结构但网络参数不同的第一预测网络和第一目标网络,所述评估网络包括具有相同网络结构但网络参数不同的第二预测网络和第二目标网络,所述根据所述预测动作控制所述信号灯的步骤S5之后,包括:
S51、以最小化预设的损失函数的形式,并采用反向传递法,更新所述信号灯动作预测模型中的网络参数,其中所述损失函数的公式为:
其中Loss为损失函数,共有N个决策时刻,t指第t个决策时刻,Q为所述评估网络输出的期望 值,S
t为第t个决策时刻所述信号灯所处路口的状态特征,a
t为第t个决策时刻所述第一预测网络的输出,ω为所述第一目标网络的网络参数,ω
-为所述第二目标网络的网络参数,R
t+1为第t+1个决策时刻所述信号灯所处路口的各车道排队长度的平方和的负数,γ为预设参数,π为所述第一目标网络的输出,θ
-为所述第一目标网络的网络参数。
如上所述,实现了以最小化预设的损失函数的形式,并采用反向传递法,更新所述信号灯动作预测模型中的网络参数。本申请采用的信号灯动作预测模型包括决策网络和评估网络,所述决策网络包括具有相同网络结构但网络参数不同的第一预测网络和第一目标网络,所述评估网络包括具有相同网络结构但网络参数不同的第二预测网络和第二目标网络。其中所述第一预测网络用于预测并输出预测的动作,以满足交通控制的需要。但由于强化学习的本质是试错,因此需要有其他的方法以反馈并修正。本申请则采用第一目标网络、第二预测网络和第二目标网络以反馈并修正网络参数,具体通过采用最小化预设的损失函数的形式,并采用反向传递法进行更新,其中所述损失函数的公式为:
参照图2,本申请实施例提供一种基于强化学习模型的信息控制装置,包括:
图像获取单元10,用于获取当前时间与信号灯所处路口的图像;
使用条件判断单元20,用于判断所述当前时间与所述信号灯所处路口的图像是否符合预设的信号灯动作预测模型的使用条件;
状态特征提取单元30,用于若所述当前时间与所述信号灯所处路口的图像符合预设的信号灯动作预测模型的使用条件,则从所述信号灯所处路口的图像中提取出指定状态特征;
预测动作获取单元40,用于将所述指定状态特征输入所述信号灯动作预测模型中,得到所述信号灯动作预测模型输出的预测动作;其中,所述信号灯动作预测模型基于强化学习模型并通过具有指定数据结构的样本数据训练得到的,所述指定数据结构由多个数据块构成,其中,具有同一信号灯相位和同一预测动作的样本数据被存在同一个所述数据块中,所述信号灯相位指信号灯可显示的颜色种类;
信号灯控制单元50,用于根据所述预测动作控制所述信号灯。
其中上述单元分别用于执行的操作与前述实施方式的基于强化学习模型的信息控制方法的步骤一一对应,在此不再赘述。
在一个实施方式中,所述使用条件判断单元20,包括:
使用时段判断子单元,用于判断所述当前时间是否属于预设的信号灯动作预测模型的使用时段;
车辆判断子单元,用于若所述当前时间属于预设的信号灯动作预测模型的使用时段,则分析所述信号灯所处路口的图像,从而判断所述路口的各车道中是否存在行驶轮悬空的车辆;
使用条件判定子单元,用于若所述路口的各车道中不存在行驶轮悬空的车 辆,则判定所述当前时间与所述信号灯所处路口的图像符合预设的信号灯动作预测模型的使用条件。
其中上述子单元分别用于执行的操作与前述实施方式的基于强化学习模型的信息控制方法的步骤一一对应,在此不再赘述。
在一个实施方式中,所述状态特征提取单元30,包括:
指定图像特征获取子单元,用于根据预设的图像特征获取方法,从所述信号灯所处路口的图像中提取出指定图像特征,其中所述指定图像特征至少包括多个车道的区域截图;
指定数字特征获取子单元,用于分析所述信号灯所处路口的图像,从而得到指定数字特征,其中所述指定数字特征至少包括各车道的车辆数、各车道排队长度和各车道占有率;
当前相位获取子单元,用于获取所述路口指示各车道的各信号灯的当前相位;
指定状态特征获取子单元,用于将所述指定图像特征、所述指定数字特征和所述当前相位记为所述指定状态特征。
其中上述子单元分别用于执行的操作与前述实施方式的基于强化学习模型的信息控制方法的步骤一一对应,在此不再赘述。
在一个实施方式中,所述装置,包括:
历史数据获取单元,用于获取所述信号灯的多个历史数据,每个历史数据均包括所述信号灯在同一时间下的历史相位、历史动作、历史状态、历史奖励和下一个历史状态;
数据块生成单元,用于生成多个数据块P11,P12,…,Pik,…,Pmn,其中数据块P11存有具有历史相位被编号为1和历史动作被编号为1的历史数据,数据块P12存有具有历史相位被编号为1和历史动作被编号为2的历史数据,数据块Pik存有具有历史相位被编号为i和历史动作被编号为k的历史数据,数据块Pmn存有具有历史相位被编号为m和历史动作被编号为n的历史数据,其中所述历史相位共具有m个编号,所述历史动作共具有n个编号,i为小于m的正整数,k为小于n的正整数;
指定数据结构构成单元,用于将所述多个数据块P11,P12,…,Pik,…,Pmn构成所述指定数据结构。
其中上述单元分别用于执行的操作与前述实施方式的基于强化学习模型的信息控制方法的步骤一一对应,在此不再赘述。
在一个实施方式中,所述装置,包括:
训练用历史数据提取单元,用于根据预设的样本提取规则,从所述多个数据块P11,P12,…,Pik,…,Pmn中均分别提取出指定数量的训练用历史数据;
多轮次训练单元,用于根据预设的多轮次训练顺序,使用所述训练用历史数据对基于强化学习模型的信号灯动作预测模型进行训练,并采用梯度下降方式更新所述信号灯动作预测模型的网络参数,其中所述多轮次训练顺序的轮次数量与所述指定数量相同,每一轮次训练使用的训练用历史数据均来源于不同 的数据块。
其中上述单元分别用于执行的操作与前述实施方式的基于强化学习模型的信息控制方法的步骤一一对应,在此不再赘述。
在一个实施方式中,所述信号灯动作预测模型包括决策网络,所述决策网络包括多个隐藏层,所述预测动作获取单元40,包括:
隐藏值获取子单元,用于将所述指定状态特征输入所述信号灯动作预测模型中,并利用所述隐藏层对所述指定状态特征进行处理,从而获得最后一层隐藏层输出的多个初始预测动作对应的隐藏值;
预测概率值获取子单元,用于采用公式:
计算得到预测概率值,其中y(action
i)为第i个所述初始预测动作对应的预测概率值,action
i为第i个所述初始预测动作对应的隐藏值,共有Na个初始预测动作;
最终预测动作输出子单元,用于获取多个预测概率值中数值最大的指定预测概率值,将所述指定预测概率值对应的初始预测动作记为最终预测动作,并输出所述最终预测动作。
其中上述子单元分别用于执行的操作与前述实施方式的基于强化学习模型的信息控制方法的步骤一一对应,在此不再赘述。
在一个实施方式中,所述信号灯动作预测模型包括决策网络和评估网络,所述决策网络包括具有相同网络结构但网络参数不同的第一预测网络和第一目标网络,所述评估网络包括具有相同网络结构但网络参数不同的第二预测网络和第二目标网络,所述装置,包括:
网络参数更新单元,用于以最小化预设的损失函数的形式,并采用反向传递法,更新所述信号灯动作预测模型中的网络参数,其中所述损失函数的公式为:
其中Loss为损失函数,共有N个决策时刻,t指第t个决策时刻,Q为所述评估网络输出的期望值,S
t为第t个决策时刻所述信号灯所处路口的状态特征,a
t为第t个决策时刻所述第一预测网络的输出,ω为所述第一目标网络的网络参数,ω
-为所述第二目标网络的网络参数,R
t+1为第t+1个决策时刻所述信号灯所处路口的各车道排队长度的平方和的负数,γ为预设参数,π为所述第一目标网络的输出,θ
-为所述第一目标网络的网络参数。
其中上述单元分别用于执行的操作与前述实施方式的基于强化学习模型的信息控制方法的步骤一一对应,在此不再赘述。
参照图3,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据 库用于存储基于强化学习模型的信息控制方法所用数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种基于强化学习模型的信息控制方法。
上述处理器执行上述基于强化学习模型的信息控制方法,其中所述方法包括的步骤分别与执行前述实施方式的基于强化学习模型的信息控制方法的步骤一一对应,在此不再赘述。所述信息控制方法包括:获取当前时间与信号灯所处路口的图像;判断所述当前时间与所述信号灯所处路口的图像是否符合预设的信号灯动作预测模型的使用条件;若所述当前时间与所述信号灯所处路口的图像符合预设的信号灯动作预测模型的使用条件,则从所述信号灯所处路口的图像中提取出指定状态特征;将所述指定状态特征输入所述信号灯动作预测模型中,得到所述信号灯动作预测模型输出的预测动作;其中,所述信号灯动作预测模型基于强化学习模型并通过具有指定数据结构的样本数据训练得到的,所述指定数据结构由多个数据块构成,其中,具有同一信号灯相位和同一预测动作的样本数据被存在同一个所述数据块中,所述信号灯相位指信号灯可显示的颜色种类;根据所述预测动作控制所述信号灯。
本申请一实施例还提供一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现基于强化学习模型的信息控制方法,所述存储介质为易失性存储介质或非易失性存储介质,其中所述方法包括的步骤分别与执行前述实施方式的基于强化学习模型的信息控制方法的步骤一一对应,在此不再赘述。所述信息控制方法包括:获取当前时间与信号灯所处路口的图像;判断所述当前时间与所述信号灯所处路口的图像是否符合预设的信号灯动作预测模型的使用条件;若所述当前时间与所述信号灯所处路口的图像符合预设的信号灯动作预测模型的使用条件,则从所述信号灯所处路口的图像中提取出指定状态特征;将所述指定状态特征输入所述信号灯动作预测模型中,得到所述信号灯动作预测模型输出的预测动作;其中,所述信号灯动作预测模型基于强化学习模型并通过具有指定数据结构的样本数据训练得到的,所述指定数据结构由多个数据块构成,其中,具有同一信号灯相位和同一预测动作的样本数据被存在同一个所述数据块中,所述信号灯相位指信号灯可显示的颜色种类;根据所述预测动作控制所述信号灯。
Claims (20)
- 一种基于强化学习模型的信息控制方法,其中,包括:获取当前时间与信号灯所处路口的图像;判断所述当前时间与所述信号灯所处路口的图像是否符合预设的信号灯动作预测模型的使用条件;若所述当前时间与所述信号灯所处路口的图像符合预设的信号灯动作预测模型的使用条件,则从所述信号灯所处路口的图像中提取出指定状态特征;将所述指定状态特征输入所述信号灯动作预测模型中,得到所述信号灯动作预测模型输出的预测动作;其中,所述信号灯动作预测模型基于强化学习模型并通过具有指定数据结构的样本数据训练得到的,所述指定数据结构由多个数据块构成,其中,具有同一信号灯相位和同一预测动作的样本数据被存在同一个所述数据块中,所述信号灯相位指信号灯可显示的颜色种类;根据所述预测动作控制所述信号灯。
- 根据权利要求1所述的基于强化学习模型的信息控制方法,其中,所述判断所述当前时间与所述信号灯所处路口的图像是否符合预设的信号灯动作预测模型的使用条件的步骤,包括:判断所述当前时间是否属于预设的信号灯动作预测模型的使用时段;若所述当前时间属于预设的信号灯动作预测模型的使用时段,则分析所述信号灯所处路口的图像,从而判断所述路口的各车道中是否存在行驶轮悬空的车辆;若所述路口的各车道中不存在行驶轮悬空的车辆,则判定所述当前时间与所述信号灯所处路口的图像符合预设的信号灯动作预测模型的使用条件。
- 根据权利要求1所述的基于强化学习模型的信息控制方法,其中,所述从所述信号灯所处路口的图像中提取出指定状态特征的步骤,包括:根据预设的图像特征获取方法,从所述信号灯所处路口的图像中提取出指定图像特征,其中所述指定图像特征至少包括多个车道的区域截图;分析所述信号灯所处路口的图像,从而得到指定数字特征,其中所述指定数字特征至少包括各车道的车辆数、各车道排队长度和各车道占有率;获取所述路口指示各车道的各信号灯的当前相位;将所述指定图像特征、所述指定数字特征和所述当前相位记为所述指定状态特征。
- 根据权利要求1所述的基于强化学习模型的信息控制方法,其中,所述将所述指定状态特征输入所述信号灯动作预测模型中,得到所述信号灯动作预测模型输出的预测动作;其中,所述信号灯动作预测模型基于强化学习模型并通过具有指定数据结构的样本数据训练得到的,所述指定数据结构由多个数据块构成,其中,具有同一信号灯相位和同一预测动作的样本数据被存在同一个所述数据块中,所述信号灯相位指信号灯可显示的颜色种类的步骤之前,包括:获取所述信号灯的多个历史数据,每个历史数据均包括所述信号灯在同一 时间下的历史相位、历史动作、历史状态、历史奖励和下一个历史状态;生成多个数据块P11,P12,…,Pik,…,Pmn,其中数据块P11存有具有历史相位被编号为1和历史动作被编号为1的历史数据,数据块P12存有具有历史相位被编号为1和历史动作被编号为2的历史数据,数据块Pik存有具有历史相位被编号为i和历史动作被编号为k的历史数据,数据块Pmn存有具有历史相位被编号为m和历史动作被编号为n的历史数据,其中所述历史相位共具有m个编号,所述历史动作共具有n个编号,i为小于m的正整数,k为小于n的正整数;将所述多个数据块P11,P12,…,Pik,…,Pmn构成所述指定数据结构。
- 根据权利要求4所述的基于强化学习模型的信息控制方法,其中,所述将所述多个数据块P11,P12,…,Pik,…,Pmn构成所述指定数据结构的步骤之后,包括:根据预设的样本提取规则,从所述多个数据块P11,P12,…,Pik,…,Pmn中均分别提取出指定数量的训练用历史数据;根据预设的多轮次训练顺序,使用所述训练用历史数据对基于强化学习模型的信号灯动作预测模型进行训练,并采用梯度下降方式更新所述信号灯动作预测模型的网络参数,其中所述多轮次训练顺序的轮次数量与所述指定数量相同,每一轮次训练使用的训练用历史数据均来源于不同的数据块。
- 根据权利要求1所述的基于强化学习模型的信息控制方法,其中,所述信号灯动作预测模型包括决策网络,所述决策网络包括多个隐藏层,所述将所述指定状态特征输入所述信号灯动作预测模型中,得到所述信号灯动作预测模型输出的预测动作的步骤,包括:将所述指定状态特征输入所述信号灯动作预测模型中,并利用所述隐藏层对所述指定状态特征进行处理,从而获得最后一层隐藏层输出的多个初始预测动作对应的隐藏值;获取多个预测概率值中数值最大的指定预测概率值,将所述指定预测概率值对应的初始预测动作记为最终预测动作,并输出所述最终预测动作。
- 根据权利要求1所述的基于强化学习模型的信息控制方法,其中,所述信号灯动作预测模型包括决策网络和评估网络,所述决策网络包括具有相同网络结构但网络参数不同的第一预测网络和第一目标网络,所述评估网络包括具有相同网络结构但网络参数不同的第二预测网络和第二目标网络,所述根据所述预测动作控制所述信号灯的步骤之后,包括:以最小化预设的损失函数的形式,并采用反向传递法,更新所述信号灯动作预测模型中的网络参数,其中所述损失函数的公式为:
- 一种基于强化学习模型的信息控制装置,其中,包括:图像获取单元,用于获取当前时间与信号灯所处路口的图像;使用条件判断单元,用于判断所述当前时间与所述信号灯所处路口的图像是否符合预设的信号灯动作预测模型的使用条件;状态特征提取单元,用于若所述当前时间与所述信号灯所处路口的图像符合预设的信号灯动作预测模型的使用条件,则从所述信号灯所处路口的图像中提取出指定状态特征;预测动作获取单元,用于将所述指定状态特征输入所述信号灯动作预测模型中,得到所述信号灯动作预测模型输出的预测动作;其中,所述信号灯动作预测模型基于强化学习模型并通过具有指定数据结构的样本数据训练得到的,所述指定数据结构由多个数据块构成,其中,具有同一信号灯相位和同一预测动作的样本数据被存在同一个所述数据块中,所述信号灯相位指信号灯可显示的颜色种类;信号灯控制单元,用于根据所述预测动作控制所述信号灯。
- 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其中,所述处理器执行所述计算机程序时实现一种基于强化学习模型的信息控制方法,所述方法包括:获取当前时间与信号灯所处路口的图像;判断所述当前时间与所述信号灯所处路口的图像是否符合预设的信号灯动作预测模型的使用条件;若所述当前时间与所述信号灯所处路口的图像符合预设的信号灯动作预测模型的使用条件,则从所述信号灯所处路口的图像中提取出指定状态特征;将所述指定状态特征输入所述信号灯动作预测模型中,得到所述信号灯动作预测模型输出的预测动作;其中,所述信号灯动作预测模型基于强化学习模型并通过具有指定数据结构的样本数据训练得到的,所述指定数据结构由多个数据块构成,其中,具有同一信号灯相位和同一预测动作的样本数据被存在同一个所述数据块中,所述信号灯相位指信号灯可显示的颜色种类;根据所述预测动作控制所述信号灯。
- 根据权利要求9所述的计算机设备,其中,所述判断所述当前时间与所述信号灯所处路口的图像是否符合预设的信号灯动作预测模型的使用条件的步骤,包括:判断所述当前时间是否属于预设的信号灯动作预测模型的使用时段;若所述当前时间属于预设的信号灯动作预测模型的使用时段,则分析所述 信号灯所处路口的图像,从而判断所述路口的各车道中是否存在行驶轮悬空的车辆;若所述路口的各车道中不存在行驶轮悬空的车辆,则判定所述当前时间与所述信号灯所处路口的图像符合预设的信号灯动作预测模型的使用条件。
- 根据权利要求9所述的计算机设备,其中,所述从所述信号灯所处路口的图像中提取出指定状态特征的步骤,包括:根据预设的图像特征获取方法,从所述信号灯所处路口的图像中提取出指定图像特征,其中所述指定图像特征至少包括多个车道的区域截图;分析所述信号灯所处路口的图像,从而得到指定数字特征,其中所述指定数字特征至少包括各车道的车辆数、各车道排队长度和各车道占有率;获取所述路口指示各车道的各信号灯的当前相位;将所述指定图像特征、所述指定数字特征和所述当前相位记为所述指定状态特征。
- 根据权利要求9所述的计算机设备,其中,所述将所述指定状态特征输入所述信号灯动作预测模型中,得到所述信号灯动作预测模型输出的预测动作;其中,所述信号灯动作预测模型基于强化学习模型并通过具有指定数据结构的样本数据训练得到的,所述指定数据结构由多个数据块构成,其中,具有同一信号灯相位和同一预测动作的样本数据被存在同一个所述数据块中,所述信号灯相位指信号灯可显示的颜色种类的步骤之前,包括:获取所述信号灯的多个历史数据,每个历史数据均包括所述信号灯在同一时间下的历史相位、历史动作、历史状态、历史奖励和下一个历史状态;生成多个数据块P11,P12,…,Pik,…,Pmn,其中数据块P11存有具有历史相位被编号为1和历史动作被编号为1的历史数据,数据块P12存有具有历史相位被编号为1和历史动作被编号为2的历史数据,数据块Pik存有具有历史相位被编号为i和历史动作被编号为k的历史数据,数据块Pmn存有具有历史相位被编号为m和历史动作被编号为n的历史数据,其中所述历史相位共具有m个编号,所述历史动作共具有n个编号,i为小于m的正整数,k为小于n的正整数;将所述多个数据块P11,P12,…,Pik,…,Pmn构成所述指定数据结构。
- 根据权利要求12所述的计算机设备,其中,所述将所述多个数据块P11,P12,…,Pik,…,Pmn构成所述指定数据结构的步骤之后,包括:根据预设的样本提取规则,从所述多个数据块P11,P12,…,Pik,…,Pmn中均分别提取出指定数量的训练用历史数据;根据预设的多轮次训练顺序,使用所述训练用历史数据对基于强化学习模型的信号灯动作预测模型进行训练,并采用梯度下降方式更新所述信号灯动作预测模型的网络参数,其中所述多轮次训练顺序的轮次数量与所述指定数量相同,每一轮次训练使用的训练用历史数据均来源于不同的数据块。
- 根据权利要求9所述的计算机设备,其中,所述信号灯动作预测模型包括决策网络,所述决策网络包括多个隐藏层,所述将所述指定状态特征输入 所述信号灯动作预测模型中,得到所述信号灯动作预测模型输出的预测动作的步骤,包括:将所述指定状态特征输入所述信号灯动作预测模型中,并利用所述隐藏层对所述指定状态特征进行处理,从而获得最后一层隐藏层输出的多个初始预测动作对应的隐藏值;获取多个预测概率值中数值最大的指定预测概率值,将所述指定预测概率值对应的初始预测动作记为最终预测动作,并输出所述最终预测动作。
- 根据权利要求9所述的计算机设备,其中,所述信号灯动作预测模型包括决策网络和评估网络,所述决策网络包括具有相同网络结构但网络参数不同的第一预测网络和第一目标网络,所述评估网络包括具有相同网络结构但网络参数不同的第二预测网络和第二目标网络,所述根据所述预测动作控制所述信号灯的步骤之后,包括:以最小化预设的损失函数的形式,并采用反向传递法,更新所述信号灯动作预测模型中的网络参数,其中所述损失函数的公式为:
- 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现一种基于强化学习模型的信息控制方法,所述方法包括:获取当前时间与信号灯所处路口的图像;判断所述当前时间与所述信号灯所处路口的图像是否符合预设的信号灯动作预测模型的使用条件;若所述当前时间与所述信号灯所处路口的图像符合预设的信号灯动作预测模型的使用条件,则从所述信号灯所处路口的图像中提取出指定状态特征;将所述指定状态特征输入所述信号灯动作预测模型中,得到所述信号灯动作预测模型输出的预测动作;其中,所述信号灯动作预测模型基于强化学习模型并通过具有指定数据结构的样本数据训练得到的,所述指定数据结构由多个数据块构成,其中,具有同一信号灯相位和同一预测动作的样本数据被存在同一个所述数据块中,所述信号灯相位指信号灯可显示的颜色种类;根据所述预测动作控制所述信号灯。
- 根据权利要求16所述的计算机可读存储介质,其中,所述判断所述当前时间与所述信号灯所处路口的图像是否符合预设的信号灯动作预测模型的使用条件的步骤,包括:判断所述当前时间是否属于预设的信号灯动作预测模型的使用时段;若所述当前时间属于预设的信号灯动作预测模型的使用时段,则分析所述信号灯所处路口的图像,从而判断所述路口的各车道中是否存在行驶轮悬空的车辆;若所述路口的各车道中不存在行驶轮悬空的车辆,则判定所述当前时间与所述信号灯所处路口的图像符合预设的信号灯动作预测模型的使用条件。
- 根据权利要求16所述的计算机可读存储介质,其中,所述从所述信号灯所处路口的图像中提取出指定状态特征的步骤,包括:根据预设的图像特征获取方法,从所述信号灯所处路口的图像中提取出指定图像特征,其中所述指定图像特征至少包括多个车道的区域截图;分析所述信号灯所处路口的图像,从而得到指定数字特征,其中所述指定数字特征至少包括各车道的车辆数、各车道排队长度和各车道占有率;获取所述路口指示各车道的各信号灯的当前相位;将所述指定图像特征、所述指定数字特征和所述当前相位记为所述指定状态特征。
- 根据权利要求16所述的计算机可读存储介质,其中,所述将所述指定状态特征输入所述信号灯动作预测模型中,得到所述信号灯动作预测模型输出的预测动作;其中,所述信号灯动作预测模型基于强化学习模型并通过具有指定数据结构的样本数据训练得到的,所述指定数据结构由多个数据块构成,其中,具有同一信号灯相位和同一预测动作的样本数据被存在同一个所述数据块中,所述信号灯相位指信号灯可显示的颜色种类的步骤之前,包括:获取所述信号灯的多个历史数据,每个历史数据均包括所述信号灯在同一时间下的历史相位、历史动作、历史状态、历史奖励和下一个历史状态;生成多个数据块P11,P12,…,Pik,…,Pmn,其中数据块P11存有具有历史相位被编号为1和历史动作被编号为1的历史数据,数据块P12存有具有历史相位被编号为1和历史动作被编号为2的历史数据,数据块Pik存有具有历史相位被编号为i和历史动作被编号为k的历史数据,数据块Pmn存有具有历史相位被编号为m和历史动作被编号为n的历史数据,其中所述历史相位共具有m个编号,所述历史动作共具有n个编号,i为小于m的正整数,k为小于n的正整数;将所述多个数据块P11,P12,…,Pik,…,Pmn构成所述指定数据结构。
- 根据权利要求19所述的计算机可读存储介质,其中,所述将所述多个数据块P11,P12,…,Pik,…,Pmn构成所述指定数据结构的步骤之后,包括:根据预设的样本提取规则,从所述多个数据块P11,P12,…,Pik,…,Pmn中均分别提取出指定数量的训练用历史数据;根据预设的多轮次训练顺序,使用所述训练用历史数据对基于强化学习模 型的信号灯动作预测模型进行训练,并采用梯度下降方式更新所述信号灯动作预测模型的网络参数,其中所述多轮次训练顺序的轮次数量与所述指定数量相同,每一轮次训练使用的训练用历史数据均来源于不同的数据块。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910882718.0A CN110738860B (zh) | 2019-09-18 | 2019-09-18 | 基于强化学习模型的信息控制方法、装置和计算机设备 |
CN201910882718.0 | 2019-09-18 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021051870A1 true WO2021051870A1 (zh) | 2021-03-25 |
Family
ID=69268192
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/093432 WO2021051870A1 (zh) | 2019-09-18 | 2020-05-29 | 基于强化学习模型的信息控制方法、装置和计算机设备 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110738860B (zh) |
WO (1) | WO2021051870A1 (zh) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113643528A (zh) * | 2021-07-01 | 2021-11-12 | 腾讯科技(深圳)有限公司 | 信号灯控制方法、模型训练方法、系统、装置及存储介质 |
CN113763723A (zh) * | 2021-09-06 | 2021-12-07 | 武汉理工大学 | 基于强化学习与动态配时的交通信号灯控制系统及方法 |
CN114548298A (zh) * | 2022-02-25 | 2022-05-27 | 阿波罗智联(北京)科技有限公司 | 模型训练、交通信息处理方法、装置、设备和存储介质 |
CN114639255A (zh) * | 2022-03-28 | 2022-06-17 | 浙江大华技术股份有限公司 | 一种交通信号控制方法、装置、设备和介质 |
US20230098014A1 (en) * | 2021-09-24 | 2023-03-30 | Autonomous A2Z | Method for Predicting Traffic Light Information by Using Lidar and Server Using the Same |
CN117746303A (zh) * | 2024-02-20 | 2024-03-22 | 山东大学 | 一种基于感知相关性网络的零样本视觉导航方法及系统 |
CN118153175A (zh) * | 2024-05-09 | 2024-06-07 | 中建科工集团绿色科技有限公司 | 基于强化学习的模块单元拆分方法、系统、设备及介质 |
CN118609386A (zh) * | 2024-08-08 | 2024-09-06 | 香港科技大学(广州) | 基于强化学习的交通信号控制方法、装置、设备、介质及产品 |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110738860B (zh) * | 2019-09-18 | 2021-11-23 | 平安科技(深圳)有限公司 | 基于强化学习模型的信息控制方法、装置和计算机设备 |
CN111696370B (zh) * | 2020-06-16 | 2021-09-03 | 西安电子科技大学 | 基于启发式深度q网络的交通灯控制方法 |
CN111753855B (zh) * | 2020-07-30 | 2021-06-08 | 腾讯科技(深圳)有限公司 | 一种数据处理方法、装置、设备及介质 |
CN112863206B (zh) * | 2021-01-07 | 2022-08-09 | 北京大学 | 一种基于强化学习的交通信号灯控制方法与系统 |
CN114926980B (zh) * | 2022-04-22 | 2023-04-14 | 阿里巴巴(中国)有限公司 | 交通数据挖掘方法、装置、电子设备及计算机程序产品 |
CN115512554B (zh) * | 2022-09-02 | 2023-07-28 | 北京百度网讯科技有限公司 | 参数模型训练及交通信号控制方法、装置、设备和介质 |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006113682A (ja) * | 2004-10-12 | 2006-04-27 | Toyota Motor Corp | 交通信号機制御装置 |
CN104269064A (zh) * | 2014-09-26 | 2015-01-07 | 张久明 | 交通信号灯控制方法 |
CN106530762A (zh) * | 2016-12-26 | 2017-03-22 | 东软集团股份有限公司 | 交通信号控制方法和装置 |
CN106971563A (zh) * | 2017-04-01 | 2017-07-21 | 中国科学院深圳先进技术研究院 | 智能交通信号灯控制方法及系统 |
CN107134156A (zh) * | 2017-06-16 | 2017-09-05 | 上海集成电路研发中心有限公司 | 一种基于深度学习的智能交通灯系统及其控制交通灯的方法 |
CN109035808A (zh) * | 2018-07-20 | 2018-12-18 | 上海斐讯数据通信技术有限公司 | 一种基于深度学习的红绿灯切换方法及系统 |
CN110246345A (zh) * | 2019-05-31 | 2019-09-17 | 闽南师范大学 | 一种基于HydraCNN的信号灯智能控制方法和系统 |
CN110738860A (zh) * | 2019-09-18 | 2020-01-31 | 平安科技(深圳)有限公司 | 基于强化学习模型的信息控制方法、装置和计算机设备 |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101472366A (zh) * | 2007-12-26 | 2009-07-01 | 奥城同立科技开发(北京)有限公司 | 适于拥堵路口控制的交通信号灯控制系统 |
CN102142197B (zh) * | 2011-03-31 | 2013-11-20 | 汤一平 | 基于全方位计算机视觉的智能交通信号灯控制装置 |
CN105046987B (zh) * | 2015-06-17 | 2017-07-07 | 苏州大学 | 一种基于强化学习的路面交通信号灯协调控制方法 |
CN117910543A (zh) * | 2015-11-12 | 2024-04-19 | 渊慧科技有限公司 | 使用优先化经验存储器训练神经网络 |
CN106355905B (zh) * | 2016-10-28 | 2018-11-30 | 银江股份有限公司 | 一种基于卡口数据的高架信号控制方法 |
CN107241213B (zh) * | 2017-04-28 | 2020-05-05 | 东南大学 | 一种基于深度强化学习的Web服务组合方法 |
WO2019165616A1 (zh) * | 2018-02-28 | 2019-09-06 | 华为技术有限公司 | 信号灯控制方法、相关设备及系统 |
CN109035812B (zh) * | 2018-09-05 | 2021-07-27 | 平安科技(深圳)有限公司 | 交通信号灯的控制方法、装置、计算机设备及存储介质 |
CN109948054A (zh) * | 2019-03-11 | 2019-06-28 | 北京航空航天大学 | 一种基于强化学习的自适应学习路径规划系统 |
CN109947931B (zh) * | 2019-03-20 | 2021-05-14 | 华南理工大学 | 基于无监督学习的文本自动摘要方法、系统、设备及介质 |
CN110047278B (zh) * | 2019-03-30 | 2021-06-08 | 北京交通大学 | 一种基于深度强化学习的自适应交通信号控制系统及方法 |
CN110164151A (zh) * | 2019-06-21 | 2019-08-23 | 西安电子科技大学 | 基于分布式深度循环q网络的交通灯控制方法 |
-
2019
- 2019-09-18 CN CN201910882718.0A patent/CN110738860B/zh active Active
-
2020
- 2020-05-29 WO PCT/CN2020/093432 patent/WO2021051870A1/zh active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006113682A (ja) * | 2004-10-12 | 2006-04-27 | Toyota Motor Corp | 交通信号機制御装置 |
CN104269064A (zh) * | 2014-09-26 | 2015-01-07 | 张久明 | 交通信号灯控制方法 |
CN106530762A (zh) * | 2016-12-26 | 2017-03-22 | 东软集团股份有限公司 | 交通信号控制方法和装置 |
CN106971563A (zh) * | 2017-04-01 | 2017-07-21 | 中国科学院深圳先进技术研究院 | 智能交通信号灯控制方法及系统 |
CN107134156A (zh) * | 2017-06-16 | 2017-09-05 | 上海集成电路研发中心有限公司 | 一种基于深度学习的智能交通灯系统及其控制交通灯的方法 |
CN109035808A (zh) * | 2018-07-20 | 2018-12-18 | 上海斐讯数据通信技术有限公司 | 一种基于深度学习的红绿灯切换方法及系统 |
CN110246345A (zh) * | 2019-05-31 | 2019-09-17 | 闽南师范大学 | 一种基于HydraCNN的信号灯智能控制方法和系统 |
CN110738860A (zh) * | 2019-09-18 | 2020-01-31 | 平安科技(深圳)有限公司 | 基于强化学习模型的信息控制方法、装置和计算机设备 |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113643528A (zh) * | 2021-07-01 | 2021-11-12 | 腾讯科技(深圳)有限公司 | 信号灯控制方法、模型训练方法、系统、装置及存储介质 |
CN113763723A (zh) * | 2021-09-06 | 2021-12-07 | 武汉理工大学 | 基于强化学习与动态配时的交通信号灯控制系统及方法 |
CN113763723B (zh) * | 2021-09-06 | 2023-01-17 | 武汉理工大学 | 基于强化学习与动态配时的交通信号灯控制系统及方法 |
US20230098014A1 (en) * | 2021-09-24 | 2023-03-30 | Autonomous A2Z | Method for Predicting Traffic Light Information by Using Lidar and Server Using the Same |
US11643093B2 (en) * | 2021-09-24 | 2023-05-09 | Autonmous A2Z | Method for predicting traffic light information by using lidar and server using the same |
CN114548298A (zh) * | 2022-02-25 | 2022-05-27 | 阿波罗智联(北京)科技有限公司 | 模型训练、交通信息处理方法、装置、设备和存储介质 |
CN114639255A (zh) * | 2022-03-28 | 2022-06-17 | 浙江大华技术股份有限公司 | 一种交通信号控制方法、装置、设备和介质 |
CN114639255B (zh) * | 2022-03-28 | 2023-06-09 | 浙江大华技术股份有限公司 | 一种交通信号控制方法、装置、设备和介质 |
CN117746303A (zh) * | 2024-02-20 | 2024-03-22 | 山东大学 | 一种基于感知相关性网络的零样本视觉导航方法及系统 |
CN117746303B (zh) * | 2024-02-20 | 2024-05-17 | 山东大学 | 一种基于感知相关性网络的零样本视觉导航方法及系统 |
CN118153175A (zh) * | 2024-05-09 | 2024-06-07 | 中建科工集团绿色科技有限公司 | 基于强化学习的模块单元拆分方法、系统、设备及介质 |
CN118609386A (zh) * | 2024-08-08 | 2024-09-06 | 香港科技大学(广州) | 基于强化学习的交通信号控制方法、装置、设备、介质及产品 |
Also Published As
Publication number | Publication date |
---|---|
CN110738860B (zh) | 2021-11-23 |
CN110738860A (zh) | 2020-01-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021051870A1 (zh) | 基于强化学习模型的信息控制方法、装置和计算机设备 | |
WO2022121510A1 (zh) | 基于随机策略梯度的交通信号控制方法、系统及电子设备 | |
WO2021227502A1 (zh) | 一种信号交叉口交通信号灯和车辆轨迹控制方法 | |
CN111260937B (zh) | 一种基于强化学习的十字路口交通信号灯控制方法 | |
WO2021249071A1 (zh) | 一种车道线的检测方法及相关设备 | |
CN112400192B (zh) | 多模态深度交通信号控制的方法和系统 | |
CN108597235B (zh) | 基于交通视频数据的交叉口信号参数优化及效果评估方法 | |
CN108831168B (zh) | 一种基于关联路口视觉识别的交通信号灯控制方法与系统 | |
CN109360429B (zh) | 一种基于模拟优化的城市道路交通调度方法及系统 | |
WO2021051930A1 (zh) | 基于动作预测模型的信号调节方法、装置和计算机设备 | |
CN106710215B (zh) | 瓶颈上游车道级交通状态预测系统及实现方法 | |
CN105528587B (zh) | 目标检测的方法及装置 | |
CN111613070B (zh) | 交通信号灯控制方法、装置、电子设备和计算机存储介质 | |
CN113643528A (zh) | 信号灯控制方法、模型训练方法、系统、装置及存储介质 | |
CN113257016B (zh) | 一种交通信号控制方法、装置以及可读存储介质 | |
CN111951575B (zh) | 基于提前强化学习的交通信号灯自适应控制方法 | |
CN111341107A (zh) | 一种基于云平台数据的共享式交通控制方法 | |
CN112435473A (zh) | 一种结合历史数据的快速路交通流溯源及匝道调控方法 | |
CN113780624A (zh) | 一种基于博弈均衡理论的城市路网信号协调控制方法 | |
CN113160585A (zh) | 交通灯配时优化方法、系统及存储介质 | |
CN111126687A (zh) | 一种交通信号的单点离线优化系统及方法 | |
WO2023206248A1 (zh) | 交通灯的控制方法、装置、路网系统、电子设备和介质 | |
CN113487857A (zh) | 一种区域多路口可变车道协同控制决策方法 | |
CN110619340A (zh) | 一种自动驾驶汽车换道规则的生成方法 | |
Huang et al. | Traffic congestion level prediction based on recurrent neural networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20865801 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20865801 Country of ref document: EP Kind code of ref document: A1 |