CN115062926A

CN115062926A - Congestion relieving strategy determination method and device based on reinforcement learning and digital twinning

Info

Publication number: CN115062926A
Application number: CN202210605848.1A
Authority: CN
Inventors: 胡铮; 张春红; 聂凌云; 温志刚; 张宸宇
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2022-09-16

Abstract

The invention provides a congestion relief strategy determination method and a device based on reinforcement learning and digital twinning, wherein the method comprises the following steps: in a virtual simulation environment constructed based on digital twins, inputting the people stream data of each region in a target range into a target intelligent body model to obtain a target action set at the current moment, wherein the target action set is an optimal action sequence output by the target intelligent body model at the current moment; and obtaining an optimal congestion relief strategy based on the target action set, wherein the optimal congestion relief strategy is a strategy for minimizing the average congestion degree of each area. The congestion relieving strategy determining method and device based on reinforcement learning and digital twinning can rapidly obtain the optimal congestion relieving strategy, so that the congestion degree of people in each area can be reduced.

Description

Congestion relieving strategy determination method and device based on reinforcement learning and digital twinning

Technical Field

The invention relates to the technical field of reinforcement learning, in particular to a congestion relieving strategy determining method and device based on reinforcement learning and digital twins.

Background

People flow management has important research significance in service operation in public places. For pedestrians in public places, the degree of congestion greatly affects daily travel experience, and therefore the flow of people in public places needs to be properly regulated and controlled during normal travel.

Existing simulation designs are all special events designed according to respective specific requirements, and although strategies for controlling and regulating pedestrian flow are more and more interested, no comprehensive framework exists for researching and integrating pedestrian-specific dynamic management systems and dynamic control strategies.

Simulation and prediction of pedestrian traffic has been widely covered by many different modeling schemes, but control and guidance strategies remain to be explored. In public places, traditional dynamic management algorithms aimed at reducing pedestrian delay time do not meet the need to keep each area of the public place within a safe, controllable capacity. Most of these control strategies are modeled with the final optimization goal of shortening the delay time of pedestrians, and the degree of congestion of each area cannot be relieved.

Disclosure of Invention

The invention provides a congestion relieving strategy determining method and device based on reinforcement learning and digital twins, which are used for solving the technical problem that a pedestrian control strategy in the prior art cannot rapidly and effectively obtain the crowding relieving degree.

The invention provides a congestion relief strategy determination method based on reinforcement learning and digital twinning, which comprises the following steps:

in a virtual simulation environment constructed based on digital twins, inputting the people stream data of each region in a target range into a target intelligent body model to obtain a target action set at the current moment, wherein the target action set is an optimal action sequence output by the target intelligent body model at the current moment;

obtaining an optimal congestion relief strategy based on the target action set, wherein the optimal congestion relief strategy is a strategy for minimizing the average congestion degree of each area;

wherein the target agent model is determined by:

constructing an intelligent agent model based on a state space, an action space and a reward value function in a simulation cycle, wherein the state space is a set of people in each area, the action space is a set of crowd drainage actions executed in each area, the reward value function is used for indicating the crowding degree of each area in action interval time, and the action interval time is a time difference value of executing the crowd drainage actions of two adjacent people;

training the agent model based on a first action, a second action and a third action to obtain a trained target agent model, wherein the first action is any action in the action space, the second action is an action determined based on demonstration data, and the third action is an action determined based on the agent model.

In some embodiments, before said training said agent model based on said first action, said second action, and said third action, further comprises:

determining the first action under the condition that the exploration probability is greater than 0 and smaller than a preset exploration weight;

determining the second action under the condition that the exploration probability is greater than or equal to the preset exploration weight and smaller than a dynamic influence factor, wherein the dynamic influence factor is a parameter selected during each round of training of the intelligent agent model;

and determining the third action when the search probability is equal to or greater than the dynamic influence factor and equal to or less than 1.

In some embodiments, the dynamic impact factor is determined by:

wherein, y _ep E is the dynamic influence factor, e is the preset exploration weight, beta is the preset hyper-parameter, ep is the training round of the intelligent agent model, ep _ all is the total training round of the intelligent agent model, r _ep The average reward for the ep training.

In some embodiments, the state space is: s _t ＝{S _1,t ，S _2,t ，S _3,t ，...，S _i,t }，S _t Set of persons in each of said areas at time t, S _i,t Specifically determined by the following formula:

wherein S is _i,t Indicates the number of people in the ith area at time t, N _i Denotes the ith, n _i,j,t Representing the number of people in the jth section of the ith area at the time t.

In some embodiments, the crowd drainage action comprises at least one of:

issuing an electronic ticket to a terminal located in each of the areas, the electronic ticket indicating a consumption area;

providing a route indication means in each of said areas;

obstacle guide means are provided in each of the areas.

In some embodiments, the reward value function is:

wherein r is _t As a function of the prize value, r _t Indicates the degree of congestion of each region in a period t to t + w, w indicates the operation interval time, S _n-max Indicating the number of people in the nth zone reaching the threshold number of crowded people, S _n,m Indicates the number of persons in the nth area at the mth time, r _m,n Indicates the nth time at the mth timeThe bonus of each zone.

The invention also provides a congestion relief strategy determination device based on reinforcement learning and digital twinning, which comprises the following steps:

the system comprises a first determining module, a second determining module and a third determining module, wherein the first determining module is used for inputting people stream data of each area in a target range into a target intelligent body model in a virtual simulation environment constructed based on a digital twin to obtain a target action set at the current moment, and the target action set is an optimal action sequence of the current moment output by the target intelligent body model;

a second determining module, configured to obtain an optimal congestion relief policy based on the target action set, where the optimal congestion relief policy is a policy that minimizes an average congestion degree of each area;

wherein the target agent model is determined by:

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the congestion relief strategy determination method based on reinforcement learning and digital twinning as described in any of the above.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a congestion relief policy determination method based on reinforcement learning and digital twinning as any of the above.

The present invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a congestion mitigation strategy determination method based on reinforcement learning and digital twinning as described in any of the above.

The congestion relieving strategy determining method and device based on reinforcement learning and digital twinning are used for stably and quickly obtaining the optimal congestion relieving strategy through the design of a reinforcement learning algorithm, the design of an optimized exploration space, the design of a neural network training framework and the training and learning design of a model.

Drawings

In order to more clearly illustrate the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic flow chart of a congestion relief strategy determination method based on reinforcement learning and digital twinning provided by the present invention;

FIG. 2 is a schematic diagram of model training using the reinforcement learning and digital twin-based congestion relief strategy determination method provided by the present invention;

FIG. 3 is a schematic diagram of a simulation system applying the reinforcement learning and digital twin-based congestion relief strategy determination method provided by the present invention;

fig. 4 is a schematic flow chart of a congestion relief strategy model to which the congestion relief strategy determination method based on reinforcement learning and digital twinning provided by the invention is applied;

FIG. 5 is one of simulation diagrams to which the reinforcement learning and digital twin-based congestion relief strategy determination method provided by the present invention is applied;

fig. 6 is a second simulation schematic diagram of the congestion relief strategy determination method based on reinforcement learning and digital twinning provided by the present invention;

fig. 7 is a schematic structural diagram of a congestion relief strategy determination device based on reinforcement learning and digital twinning, provided by the invention;

fig. 8 is a schematic physical structure diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The execution subject of the congestion relief strategy determination method based on reinforcement learning and digital twinning provided by the invention can be an electronic device, a component in the electronic device, an integrated circuit or a chip. The electronic device may be a mobile electronic device or a non-mobile electronic device. By way of example, the Mobile electronic device may be a Mobile phone, a tablet Computer, a notebook Computer, a palm top Computer, a vehicle-mounted electronic device, a wearable device, an Ultra-Mobile Personal Computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-Mobile electronic device may be a server, a Network Attached Storage (NAS), a Personal Computer (Personal Computer, PC), a teller machine, a self-service machine, and the like, and the present invention is not limited in particular.

The technical solution of the present invention will be described in detail below by taking an example in which a computer executes the congestion relief policy determination method based on reinforcement learning and digital twinning provided by the present invention.

Fig. 1 is a schematic flow chart of a congestion relief strategy determination method based on reinforcement learning and digital twinning provided by the invention. Referring to fig. 1, the present invention provides a congestion relief policy determination method based on reinforcement learning and digital twinning, which may include: step 110.

Step 110, inputting the people stream data of each region in the target range into a target intelligent body model in a virtual simulation environment constructed based on the digital twin to obtain a target action set at the current moment, wherein the target action set is an optimal action sequence output by the target intelligent body model at the current moment.

And 120, obtaining an optimal congestion relief strategy based on the target action set, wherein the optimal congestion relief strategy is a strategy for minimizing the average congestion degree of each area.

Before step 110, the target agent model is determined by:

the method comprises the steps of constructing an intelligent agent model based on a state space, an action space and a reward value function in a simulation cycle, wherein the state space is a set of people in each area, the action space is a set of people drainage actions executed in each area, the reward value function is used for indicating the crowding degree of each area in action interval time, and the action interval time is a time difference value of executing the drainage actions of two adjacent people. It is understood that the number of people in each area in the simulation cycle may be real data or simulated data, and is not limited in particular.

The congestion relieving strategy determining method based on reinforcement learning and digital twinning provided by the invention is used for constructing an intelligent agent model based on a reinforcement learning algorithm.

It should be noted that reinforcement learning is inspired by trial and error in animal learning, and the intelligent agent is trained by using a reward value obtained by interaction between the intelligent agent and the environment as a feedback signal. Reinforcement learning can be generally expressed by Markov Decision Process (MDP), with the main elements containing (S, a, R, T, γ).

Where S represents the environmental state in which it is located, A represents the action taken by the agent, R represents the resulting reward value, T represents the state transition probability, and γ represents the discount factor.

The policy of the agent pi represents a mapping of the state space to the action space. When the agent is in the form of an agentState is S _t Then take action a according to policy π _t (a _t E.g. a)) and then transition to the next state S according to the state transition probability T _t+1 While receiving the reward value r of the environmental feedback _t (r _t ∈R)。

The goal of reinforcement learning is to continually optimize the strategy of the agent to obtain the maximum reward value. The reward value function and the action value function of the intelligent agent are respectively V (S) _t ) And Q (S) _t ，a _t ) For evaluating the state S of the agent _t The expectation of a long-term reward that can be achieved. The optimal strategy of the agent can be obtained by optimizing a value function.

Deep Reinforcement Learning (DRL) algorithms are mainly classified into two categories: a value function algorithm and a policy gradient algorithm.

The value function algorithm indirectly obtains the strategy of the intelligent agent through iterative updating of the value function, and when the iteration of the value function reaches the optimum, the optimum strategy of the intelligent agent is obtained through the optimum value function.

The strategy gradient algorithm directly adopts a function approximation method to establish a strategy network, obtains an incentive value through the strategy network selection action, and optimizes the strategy network parameters along the gradient direction to obtain an optimized strategy maximization incentive value.

The invention can establish a pedestrian movement simulation scheme facing public places, design a simulation environment based on a pedestrian movement model, and simultaneously provide a pedestrian congestion relieving optimization scheme in the public places through design optimization of a Deep Q-Network (DQN) based reinforcement learning algorithm, design of a training frame of the DQN model and Network training. The method can rapidly learn the congestion relieving strategy in the simulation environment of the public place, and the DQN model can better and more rapidly converge.

The environment is modeled prior to building the agent model. The environment model is a multi-layer map containing static map information and crowd-related information. The static map information describes physical attributes of the environment, and the crowd-related information includes crowd location information, crowd quantity information, crowd density information, and the like.

The physical properties of the environment are described abstractly through a map. The map can be formed by a grid formed by grids, the grid can comprise grids corresponding to different areas, and the intelligent agent can move through the grids of the areas. For example: the area corresponding to the public place can be divided into a plurality of grids, and the pedestrians can move in each grid.

Pedestrians flow in public places based on the crowd moving model, and under the condition of no interference and guidance measures, the pedestrians are crowded, specifically, the pedestrians in a certain area have the maximum number of people exceeding the safe and controllable range.

In this embodiment, the following definitions are first made for the agent model:

defining states of an agent model

The states of the agent model are defined based on the number of people in each region, and all the state sets may constitute a state space.

In some embodiments, the state space is: s _t ＝{S _1,t ，S _2,t ，S _3,t ，...，S _i,t }，S _t Representing the set of persons in each area at time t, S _i,t Specifically determined by the following formula:

In actual execution, state S _t Representing the set of people in each area at time t. The total number I of the areas can be set according to requirements, and the value range of the time T is [0, T-1 ]]And T represents the total running time number under one-time simulation.

The crowd relieving strategy determining method based on reinforcement learning and digital twinning provided by the invention provides flexible pedestrian movement data, and has certain openness and flexibility.

Second, defining actions of the intelligent agent model

The crowd-sourcing actions performed in each region are defined as the actions of the agent model, and all sets of actions may constitute an action space. The crowd drainage action is an excitation signal of the intelligent body model.

In some embodiments, the crowd draining action comprises at least one of:

issuing an electronic ticket to a terminal located in each area, the electronic ticket indicating a consumption area;

setting a route indication tool in each area;

obstacle guide means are provided in each area.

In actual execution, action a of the agent model _t Defined as the next action a at time t _t+w The previous execution action. At [ t, t + w-1]In which only action a is performed _t 。

Wherein the motion space can be represented by a. A may include all sets of actions in the action space, each action representing a crowd drainage measure performed in each area. w represents the action interval time and is a fixed value.

The crowd guiding measures can be used for issuing electronic coupons to terminals located in each area, wherein the electronic coupons can be consumer coupons, scenic spot fast channel coupons or shop coupons and the like, and the electronic coupons can probabilistically guide tourists to flow to the designated areas.

The consumption coupon or the shop coupon can guide the pedestrian to move to the target consumption area, and the scenic spot fast passage coupon can guide the pedestrian to move to the target scenic spot. The invention only considers the drainage efficacy of the electronic ticket, and does not consider economic loss.

For example: by issuing a consumption ticket to a pedestrian selected in a physical vicinity, the pedestrian, after acquiring the consumption ticket, will choose with probability p whether to use the consumption ticket to the destination before the next action selection. The simulation time was set to 540 minutes, and the issuance of the consumption ticket was performed every thirty minutes. The issuing area, the number and the type of the consumption ticket are required to be set every time the consumption ticket is issued, C _i Generic consumption tickets indicate pedestrian's travel toThe area with serial number i will get a certain benefit. The action space is formed by the combination of different setting parameters.

The crowd guiding means may also be arranged in each area with route indicating means, which may be for example crowd evacuation route signs or robot guiding means or the like.

The crowd drainage measure can also be that an obstacle guiding tool is arranged in each area, and the obstacle guiding tool can be a road block or a barrier and the like.

It will be appreciated that the regulation of crowd flow may of course also be performed by guiding personnel.

The congestion strategy relieving method provided by the invention better solves the strategy problem of guiding the pedestrian flow by using a drainage measure in the field of pedestrian flow traffic control, and better reduces the congestion degree of each area in a public place scene.

Third, define the reward value function of the intelligent agent model

The reward value function of the agent model may be defined as the degree of congestion of each zone within the action time interval.

In some embodiments, the reward value function is:

wherein r is _t As a function of the prize value, r _t Indicates the degree of congestion of each region in a period t to t + w, w indicates the interval time of action, S _{n_max} Indicates the number of people in the nth zone reaching the crowding threshold, S _n,m Indicates the number of persons in the nth area at the mth time, r _m,n Indicating the prize for the nth zone at the mth time.

In actual execution, at S _n,m <S _{n_max} In case of (2), the award r _m,n Is 1; at S _n,m ≥S _{n_max} In case of (2), the award r _m,n Is-1.

And fourthly, defining a reinforcement learning selection algorithm.

After the state space, action space, and reward value function definitions are completed, the agent model may be constructed. And (3) adopting a DQN algorithm based on a value function, adopting a three-layer convolutional neural network for constructing a value network, and initializing the scale setting of a memory pool and a training pool.

In the action selection in the training process, a priori exploration strategy is introduced, namely, an action selection function in the intelligent agent exploration process is optimized, and an intelligent agent model is optimized and updated through a Loss function, wherein the Loss function specifically comprises the following steps:

Loss＝(r _t +γQ _max (a′,s′)-Q(a,s)) ²

wherein r is _t For the reward value function, γ is the discount factor, Q (a, S) is the expectation that the benefit will be obtained by taking action a (a ∈ A) in state S (S ∈ S) at some time t, and Q (a ', S') is the expectation that the benefit will be obtained by taking action a '(a ∈ A) in state S' (S 'S ∈ S) at time t' immediately preceding time t.

The congestion relieving strategy determining method based on reinforcement learning and digital twins can improve the training speed and the training quality of an intelligent model based on a reinforcement learning algorithm for exploration and optimization of demonstration data.

Training the intelligent body model based on a first action, a second action and a third action to obtain a trained target intelligent body model, wherein the first action is any action in an action space, the second action is an action determined based on demonstration data, and the third action is an action determined based on the intelligent body model.

It can be appreciated that reinforcement learning algorithms learn a better strategy among different tasks in a trial and error manner when interacting with the environment through the agent.

However, the balancing of exploration (exploration) and utilization (exploration) and blind selection of a strategy without sufficient exploration can cause problems, since the model may fall into local optimality or even fail to converge at all.

Most traditional reinforcement learning methods rely on heuristic exploration strategies such as e-greedy algorithm and parametric noise method, when the return of the task is low and the reward is sparse, the agent must visit a large number of states to obtain some meaningful information through repeated experiments, and the long-standing problem results in slow convergence and even the failure of the ongoing task.

Random exploration is therefore not a suitable solution for exploring spatially complex tasks.

When the strategy space is explored by the DQN algorithm, the adopted method is a greedy strategy, but when the method faces a scene with a huge strategy space, the intelligent agent cannot easily find the optimal solution, and in the process of exploring the intelligent agent and the environment, the situation of the only local optimal solution is easy to occur. Therefore, it is necessary to search for random operations a plurality of times, which is time-consuming and labor-consuming.

Considering that there exists a scheme for fixedly optimizing an exploration space by demonstrating data, the following is specific:

wherein, P(s) _t ) Action probabilities of the presentation data learned for a Sequence driven adaptive network with Policy Gradient (SeqGAN). This approach results in exploration space being limited in the presentation data, resulting in the inability of the agent to explore more possibilities through open exploration.

In some embodiments, before training the agent model based on the first action, the second action, and the third action, further comprising:

determining a first action under the condition that the exploration probability is greater than 0 and smaller than a preset exploration weight;

determining a second action under the condition that the exploration probability is greater than or equal to the preset exploration weight and smaller than a dynamic influence factor, wherein the dynamic influence factor is a parameter selected during each round of training of the intelligent agent model;

when the search probability is equal to or greater than the dynamic influence factor and equal to or less than 1, the third action is determined.

The invention provides a method for dynamically regulating and controlling open exploration and optimizing exploration space exploration weight by using a dynamic influence factor, which combines a traditional greedy strategy and an optimized exploration space strategy based on demonstration data in exploration, and has the following specific interactive action selection modes:

a _gd ＝π _* (a|s)

wherein, a _t The action at the moment of t, p is the exploration probability, e is the preset exploration weight, y _ep As dynamic influence factor, a _rd As a first action, a _gd For the second action, a _ag For the third action, s _t+1 The state is at the next time t +1 from the time t.

a _rd Is any randomly selected motion in the motion space.

a _gd An action determined based on the presentation data, i.e. a tutorial action given based on the presentation data. In particular, a _gd Can refer to actions generated from human management experience that can positively impact congestion relief.

a _ag Is an action determined based on the agent model. Pi _* Is a general strategy for learning based on presentation data.

The invention adopts the strategy exploration optimization based on the prior knowledge, and compared with the DQN model using the traditional strategy, the intelligent agent model can better converge. Meanwhile, in the scene of crowded people, the application of the reinforcement learning method and the setting of states, rewards, actions and the like are realized.

According to the congestion relieving strategy determining method based on reinforcement learning and digital twins, the exploration space is optimized through balancing of action exploration guidance and random actions of the intelligent body based on the demonstration data, and the intelligent body can better and faster adapt to the environment.

In the step, states, feedback actions and the like are obtained through multiple rounds of training, so that the intelligent agent model is optimized, and the target intelligent agent model is finally obtained.

In step 110, in the virtual simulation environment constructed based on the digital twin, the people stream data of each region in the target range is input to the target intelligent agent model, that is, the state at a moment is input to the target intelligent agent model, the optimal action at the current moment output by the target intelligent agent model can be obtained through a single interaction, and the optimal action sequence at the current moment output by the target intelligent agent model, that is, the target action set, can be obtained through multiple interactions.

The target range is a pre-selected crowd movable range, and the target range can be divided into a plurality of areas.

In step 120, based on the target action set, an optimal congestion relief policy may be obtained, where the optimal congestion relief policy is a policy that minimizes the average congestion degree of each area.

In actual execution, people stream data of each area is obtained first, and the people stream data is input into the target intelligent agent model. The people flow data can be data of the number of people, crowd density, crowd position or crowd moving track and the like in each area.

Based on the target intelligent agent model, the optimal congestion relief strategy under different scenes can be obtained. For example: the issuing of the consumption ticket is carried out every thirty minutes, and the average crowding degree in the regulation process is minimized through the regulation and control guidance. Each issuance of the consumption ticket requires setting of an issuance area, number, and kind.

The congestion relieving strategy determining method based on reinforcement learning and digital twinning is used for stably and quickly obtaining the optimal congestion relieving strategy through the design of a reinforcement learning algorithm, the design of an optimized exploration space, the design of a neural network training framework and the training and learning design of a model.

In some embodiments, the dynamic impact factor is determined by:

wherein, y _ep For dynamic influence factors, epsilon is a preset exploration weight, beta is a preset hyper-parameter, ep is the training round of the intelligent agent model, ep _ all is the total training round of the intelligent agent model, r _ep The average reward for the ep training.

In actual implementation, the discriminator weights are optimized from w by the following loss function _i To w _i+1 ：

In which is shown

Denotes expectation, τ _i Representing the strategy to be learned, τ _E Representing presentation data policies, D _w (s, a) represents the probability that the discriminator identifies the strategy to be learned.

By passing

Optimizing generator theta _t To theta _t+1 。

As shown in figure 2 of the drawings, in which,

representing the Nth real track data in the format of<s,a>。τ _N Representing the Nth trace data generated by the generator in the same format<s,a>。

Considering that the action exploration guidance strategy given based on the demonstration data is a general strategy, the strategy model of the intelligent agent in the training process gradually exceeds the action guidance strategy model given based on the demonstration data, the reference value provided for the intelligent agent by the guidance action based on the demonstration data is slowly reduced, and the frequency of the guidance action is considered to be reduced when the intelligent agent is in an exploration space. Proposing a dynamic influence factor formula based on average reward influence:

The congestion relieving strategy determining method based on reinforcement learning and digital twinning enables an intelligent model to learn an optimal congestion relieving strategy quickly.

In some embodiments, based on the concept of digital twin, the construction of a public place-oriented passenger flow simulation system is provided, and the strategy research for relieving congestion is carried out in a virtual simulation world by mapping a real world into the virtual simulation world.

Fig. 3 is a schematic diagram of a simulation system applying the congestion relief strategy determination method based on reinforcement learning and digital twinning provided by the invention. Referring to fig. 3, the simulation system designed by the invention mainly comprises a data acquisition module, a model construction module, a virtual world simulation module, a service configuration module and a strategy learning module.

In the configuration of various strategy simulation scenes under the condition of relieving congestion by independently developing a model modeling module and a service configuration module, and in the condition of relieving congestion of passenger flow in public places, the simulation system applies reinforcement learning to explore strategies and interact with the simulation system.

The data acquisition module mainly comprises a device for acquiring people stream movement information and environment information, such as personal tracks of each person and a two-dimensional plane map of a public place. Besides, the environment data information includes hardware information and the like required by subsequent strategy learning.

In the whole simulation system, due to the privacy of data, the real people stream data collection in the data collection module is not necessary, and the real people stream data collection module can also be used if an encrypted people stream movement model generator provided by a data company exists.

The model building module mainly comprises people flow mobile modeling and environment modeling. Modeling is carried out through data provided by a data acquisition module, and the modeling mainly comprises people flow mobile modeling and environment modeling.

In environmental modeling, movable locations of areas, attributes of buildings such as service facilities, and other attributes that can be added are defined. People stream modeling is not necessary, and the encrypted people stream generation model of the third party can be directly used due to the privacy of data.

The virtual world simulation module enables the people flow moving model and the environment model to be driving engines of the virtual simulation world and drives the self-operation of the simulation system. By programming the simulation environment entity and entity environment, the method adapts to model access of various driving logics, such as import of historical data, Markov transition probability matrix, LSTM model prediction and other logics.

The passenger flow simulation environment facing to public places provides a flexible pedestrian movement driving interface, can set a city road network and construct a service facility model, has certain openness and flexibility, and provides the possibility of data interfaces except for attributes of travel people and attributes of service facilities.

And the service configuration module is used for establishing a virtual world simulation model facing to the upper layer. The service configuration module is an open module, and programming configuration is carried out on the required service by applying the digital stream data of the virtual simulation world. For example, the data monitoring service can display the number of people in each area of the virtual public place and can also generate a people flow distribution thermodynamic diagram.

And the strategy interface service module can transmit the action instructions fed back from the strategy learning module to the simulation module to regulate and control the crowd.

And the visualization service module can visualize the virtual public place into the terminal. High expansibility for relieving congestion in public places is achieved through configuration of services and coordination and butt joint of virtual simulation modules.

Extensibility here refers to two aspects: the method is characterized in that firstly, the service configuration module has high expansibility, various different services can be generated through programming, any data flow required by service configuration is theoretically provided through a digital virtual simulation model mapped by the real world, and secondly, the method is oriented to high expansibility for relieving congestion strategy optimization.

By modeling the environment, theoretically, various drainage measures can be set, such as all measures of providing consumption coupon drainage, setting roadblocks, guiding personnel and the like, and the strategy interface service is in butt joint with the coordination of the people flow modeling and the environment modeling through configuration interface parameters so as to achieve the realization of simulation.

The strategy learning module, namely the decision optimization module, completes the optimization of the decision by calling various services of the service configuration module, takes reinforcement learning as an example, the intelligent agent interacts with the simulation environment through the data monitoring service and the strategy interface service to obtain the state, the feedback action and the like, and optimizes the decision model through multiple rounds of iteration.

The simulation system provided by the invention has good expansibility and adaptability to algorithms such as reinforcement learning and the like. The invention designs a digital twin-based public place simulation system which comprises a data acquisition module, a model construction module, a virtual world simulation module, a service configuration module and a strategy learning module. The programmability of the service configuration module and the environment modeling enables the simulation system to have high expansion performance. The configuration of various strategy simulation scenes under the crowding relieving scene is completed through independent development of the model modeling module and the service configuration module.

Fig. 4 is a schematic flow chart of the construction of a congestion relief strategy model using the congestion relief strategy determination method based on reinforcement learning and digital twinning provided by the present invention. Referring to fig. 4, the present invention provides a method for constructing a congestion mitigation strategy model, which may include the following steps:

and step 410, generating a pedestrian movement model and a public place physical model according to the real world data.

And step 420, importing a pedestrian movement model and a physical model based on the menge design simulation environment. The initialization of the range simulation graph is shown in fig. 5, and the operation of the range simulation graph is shown in fig. 6. In actual implementation, the pedestrian movement model and the physical model can also be visualized and simulated based on Unity.

Step 430, designing a DQN reinforcement learning algorithm, and defining a state, an action, a reward and a loss function.

And 440, designing a strategy exploration method based on the prior knowledge.

And step 450, the model is in butt joint with the simulation environment through an interface, and interaction between the intelligent agent and the environment is realized.

And 460, obtaining a congestion relief strategy model corresponding to the public place.

In actual implementation, the preparation phase requires the following steps:

acquiring people stream movement data, environmental information or scene information of a public place;

constructing a scene physical model of a public place and a pedestrian movement model of the public place;

loading the model to a simulation module;

business programming configuration and defining each service interface parameter;

initializing a memory pool of reinforcement learning, training the scale of the pool and other parameters of reinforcement learning;

the settings of state, action, reward and Loss functions in the DQN algorithm are defined.

In actual implementation, the training phase requires the following steps:

initializing the Q network and initializing the target Q network.

Cycle epoch 1, 2.., M; initialization State S ₁ (ii) a Circularly traversing a time step T, which is 1, 2,. and T; wherein a time step represents a certain time instant.

When the time step is divisible by 30, an action a interacting with the environment is generated using a greedy algorithm based on priors _t . Generating a range of [0,1]The random number p. 0<p<At 0.1, randomly selecting an action from the action space, 0.1<p<y, selecting the expert action given by the prior knowledge as y<p<1, selecting the action with the maximum output value of the target Q network。

And executing the selected action, and calculating the returned reward and the state under the next action time step by counting the reward and the state between the current moment and the next action time step.

Will data pair (S) _t ，a _t ，r _t ，S _t+1 ) Stored in a memory pool.

And randomly extracting a batch of data from the memory pool at intervals of c time steps to train and update the network.

The acquired state and the sent action are interacted through an interface provided by a policy service module in the service configuration module.

The congestion relief strategy determination device based on reinforcement learning and digital twin according to the present invention is described below, and the congestion relief strategy determination device based on reinforcement learning and digital twin described below and the congestion relief strategy determination method based on reinforcement learning and digital twin described above may be referred to in correspondence with each other.

Fig. 7 is a schematic structural diagram of a congestion relief strategy determination device based on reinforcement learning and digital twinning, provided by the invention. Referring to fig. 7, the present invention provides a congestion relief policy determination apparatus based on reinforcement learning and digital twinning, which may include: a first determination module 710 and a second determination module 720.

A first determining module 710, configured to, in a virtual simulation environment constructed based on a digital twin, input people stream data of each region in a target range to a target agent model to obtain a target action set at a current time, where the target action set is an optimal action sequence output by the target agent model at the current time;

a second determining module 720, configured to obtain an optimal congestion relief policy based on the target action set, where the optimal congestion relief policy is a policy that minimizes an average congestion degree of each area;

wherein the target agent model is determined by:

The congestion relieving strategy determining device based on reinforcement learning and digital twinning is used for stably and quickly obtaining the optimal congestion relieving strategy through the design of a reinforcement learning algorithm, the design of an optimized exploration space, the design of a neural network training framework and the training and learning design of a model.

In some embodiments, the apparatus further comprises:

the third determining module is used for determining the first action under the condition that the exploration probability is greater than 0 and smaller than the preset exploration weight;

In some embodiments, the dynamic impact factor is determined by:

wherein, y _ep For the dynamic influence factor, e is the presetThe exploration weight is beta is a preset hyper-parameter, ep is the training round of the intelligent agent model, ep-all is the total training round of the intelligent agent model, r _ep The average reward for the ep training.

In some embodiments, the crowd drainage action comprises at least one of:

providing a route indication means in each of said areas;

obstacle guide means are provided in each of the areas.

In some embodiments, the reward value function is:

wherein r is _t As a function of the prize value, r _t Indicates the degree of congestion of each region in a period t to t + w, w indicates the interval time of motion, S _{n_max} Indicating the number of people in the nth zone reaching the threshold number of crowded people, S _n,m Indicates the number of persons in the nth area at the mth time, r _m,n Indicating the prize for the nth zone at the mth time.

Fig. 8 illustrates a physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device may include: a processor (processor)810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. Processor 810 may invoke logic instructions in memory 830 to perform a reinforcement learning and digital twin based congestion mitigation strategy determination method comprising:

wherein the target agent model is determined by:

In addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention further provides a computer program product, the computer program product including a computer program, the computer program being stored on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, the computer is capable of executing the reinforced learning and digital twin-based congestion relief policy determination method provided by the above methods, the method including:

wherein the target agent model is determined by:

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor, implements a method for determining congestion mitigation strategy based on reinforcement learning and digital twinning provided by the above methods, the method comprising:

wherein the target agent model is determined by:

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A congestion relief strategy determination method based on reinforcement learning and digital twinning is characterized by comprising the following steps:

wherein the target agent model is determined by:

2. The reinforcement learning and digital twin based congestion relief strategy determination method of claim 1, wherein prior to said training the intelligent model based on the first action, the second action and the third action, further comprising:

3. The reinforcement learning and digital twin based congestion relief strategy determination method of claim 2, wherein the dynamic impact factor is determined by the following equation:

wherein, y _ep E is the dynamic influence factor, is the preset exploration weight, beta is a preset hyper-parameter, ep is the training round of the intelligent agent model, ep _ all is the total training round of the intelligent agent model, r _ep The average reward for the ep training.

4. The reinforcement learning and digital twin based congestion relief strategy determination method according to any of claims 1-3, wherein the state space is: s _t ＝{S _1,t ，S _2,t ，S _3,t ，...，S _i,t }，S _t Representing the set of persons in said each area at time t, S _i,t Specifically determined by the following formula:

5. The reinforcement learning and digital twin based congestion relief strategy determination method according to any of claims 1-3, wherein the crowd draining actions comprise at least one of:

providing a route indication means in each of said areas;

obstacle guide means are provided in each of the areas.

6. The reinforcement learning and digital twin based congestion relief strategy determination method according to any of claims 1-3, wherein the reward value function is:

wherein r is _t As a function of the prize value, r _t Indicating the degree of congestion of each of the regions in a period t to t + w, w indicating the operation interval time, S _{n_max} Indicating the number of people in the nth zone reaching the threshold number of crowded people, S _n,m Indicates the number of persons in the nth area at the mth time, r _m,n Indicating the prize for the nth zone at the mth time.

7. A congestion relief strategy determination apparatus based on reinforcement learning and digital twinning, comprising:

wherein the target agent model is determined by:

8. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the reinforcement learning and digital twin based congestion relief strategy determination method according to any one of claims 1 to 6 when executing the program.

9. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the reinforcement learning and digital twin based congestion mitigation strategy determination method according to any one of claims 1 to 6.

10. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements a reinforcement learning and digital twinning based congestion mitigation strategy determination method according to any of claims 1 to 6.