CN112231967A

CN112231967A - Crowd evacuation simulation method and system based on deep reinforcement learning

Info

Publication number: CN112231967A
Application number: CN202010942444.2A
Authority: CN
Inventors: 刘弘; 李信金; 孟祥栋; 赵缘
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2020-09-09
Filing date: 2020-09-09
Publication date: 2021-01-15
Anticipated expiration: 2040-09-09
Also published as: CN112231967B

Abstract

The disclosed crowd evacuation simulation method and system based on deep reinforcement learning comprise: initializing the constructed evacuation scene simulation model according to the scene information and the crowd parameter information; grouping the crowds, and dividing a leader and a follower of each group; and obtaining the evacuation path of the crowd by adopting a hierarchical path planning method, wherein the leader in the upper-layer group carries out global path planning through an E-MADDPG algorithm to obtain an optimal evacuation path, and the follower in the lower-layer group carries out evacuation along the optimal evacuation path by avoiding obstacles and following the leader. A learning curve and a high-priority experience playback strategy are introduced on the basis of the traditional MADDPG algorithm, an E-MADDPG algorithm is formed, the learning efficiency of the algorithm is improved, a hierarchical path planning method is provided on the basis of the E-MADDPG algorithm and used for planning the evacuation path of people, the path planning time is effectively shortened, people can be guided to evacuate better, and the crowd evacuation efficiency is improved.

Description

Crowd evacuation simulation method and system based on deep reinforcement learning

Technical Field

The disclosure relates to a crowd evacuation simulation method and system based on deep reinforcement learning.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

With the increasingly frequent occurrence of public safety problems, the problem of large-scale crowd evacuation becomes an important link which cannot be ignored in emergency treatment. In a place with dense crowds, once dangerous accidents happen, the crowds can rush to escape from a scene in order to avoid dangers, and therefore crowding is caused in the crowd evacuation process. If the people cannot be evacuated in time, collision and trampling accidents can be caused, and secondary damage is caused to the evacuated people. Meanwhile, large-scale crowd evacuation is a complex process, and large-scale crowd evacuation experiments are difficult to develop due to the problems of difficult organization, high cost, personnel safety and the like. Therefore, the computer simulation technology becomes a main means for analyzing the evacuation process and evaluating the evacuation efficiency.

How to improve the crowd evacuation efficiency and avoid secondary damage is always a problem of great concern of researchers. Reinforcement learning is one of the research hotspots in the field of artificial intelligence in recent years. The combination of reinforcement learning and path planning provides a new idea for improving crowd evacuation efficiency. The path planning algorithm based on multi-agent reinforcement learning greatly improves the efficiency of path planning, and has certain adaptability to dynamic environment because of continuous learning, and the practicability is stronger. However, most of real evacuation scenes are complex, the problem that the traditional reinforcement learning method is difficult to process is solved, and the deep learning can effectively process high-dimensional input and can better process complex real scenes. Therefore, the reinforcement learning and the deep learning are combined, and the learning strategy of the reinforcement learning and the capability of the deep learning to solve the high-dimensional input problem are combined, so that the method can be better applied to crowd evacuation simulation. The Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm proposed by Lowe et al is a new Multi-Agent Deep reinforcement learning algorithm, but the algorithm also has the problems of invariable state space, random experience playback and the like, and the learning efficiency of the algorithm is seriously influenced. Meanwhile, as the number of the intelligent agents for guiding evacuation increases and the complexity of the environment increases, huge state space is inevitably brought, and the application effect of the algorithm in the field of crowd evacuation is seriously influenced by the problems.

Disclosure of Invention

The invention provides a crowd evacuation simulation method and system based on Deep reinforcement learning to solve the problems, a learning curve and a high-priority experience playback strategy are introduced on the basis of the traditional MADDPG algorithm to form an Efficient Multi-Agent Deep Deterministic Policy Gradient (E-MADDPG) algorithm, the learning efficiency of the algorithm is improved, a hierarchical path planning method is provided on the basis of the E-MADDPG algorithm to plan the evacuation path of the crowd, the time of path planning is effectively shortened, the crowd can be better guided to be evacuated, and the crowd evacuation efficiency is improved.

In order to achieve the purpose, the following technical scheme is adopted in the disclosure:

in one or more embodiments, a crowd evacuation simulation method based on deep reinforcement learning is provided, including:

initializing the constructed evacuation scene simulation model according to the scene information and the crowd parameter information;

grouping the crowds, and dividing a leader and a follower of each group;

and obtaining the evacuation path of the crowd by adopting a hierarchical path planning method, wherein the leader in the upper-layer group carries out global path planning through an E-MADDPG algorithm to obtain an optimal evacuation path, and the follower in the lower-layer group carries out evacuation along the optimal evacuation path by avoiding obstacles and following the leader.

And further, receiving a real scene database of a market, and acquiring a pedestrian movement stopping point as a state space of the E-MADDPG algorithm.

Furthermore, variation parameters are added to the experience pool capacity and the number of sampling samples in the MADDPG algorithm to form an experience pool curve and a sampling sample curve of the E-MADDPG algorithm, and the size of the experience pool and the number of sampling samples are adjusted through the variation parameters, so that the state space of the E-MADDPG algorithm is dynamically variable.

Furthermore, during network training of the E-MADDPG algorithm, samples with high value are selected for experience replay.

In one or more embodiments, a deep reinforcement learning crowd evacuation simulation system based on experience pool optimization is provided, comprising:

the initialization setting module is used for carrying out initialization setting on parameters in the evacuation scene simulation model according to the scene information and the crowd parameter information;

the in-group guidance selection module is used for grouping the crowds; selecting a leader in the group;

and the evacuation simulation module acquires the evacuation path of the crowd by adopting a hierarchical path planning method, wherein the leader in the upper-layer group carries out global path planning through an E-MADDPG algorithm to acquire an optimal evacuation path, and the follower in the lower-layer group carries out evacuation along the optimal evacuation path by avoiding obstacles and following the leader.

In one or more embodiments, an electronic device is provided, comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, the computer instructions, when executed by the processor, performing the steps of the deep reinforcement learning-based crowd evacuation simulation method.

In one or more embodiments, a computer-readable storage medium is provided for storing computer instructions which, when executed by a processor, perform the steps of the deep reinforcement learning-based crowd evacuation simulation method.

Compared with the prior art, the beneficial effect of this disclosure is:

1. the multi-agent deep reinforcement learning algorithm is applied to the path planning of crowd evacuation, and the crowd evacuation efficiency is improved.

2. The method takes the defects of a multi-agent deep reinforcement learning algorithm into consideration, provides an E-MADDPG algorithm on the basis of the MADDPG algorithm, improves learning efficiency by combining a learning curve to enable an experience pool to be dynamically variable, improves a random sampling mode of the algorithm to improve learning effectiveness, improves a state space of the algorithm, extracts a motion stopping point from a pedestrian video as the state space, and effectively solves the problem of dimension disaster.

3. The crowd evacuation route is acquired by adopting a hierarchical route planning method, the crowd is divided into a leader and a follower in consideration of the crowd psychology of people, the large-scale crowd evacuation simulation problem is divided into a group of sub-problems, and the evacuation is guided by the crowd grouping and the leader, so that the evacuation efficiency of public places can be effectively improved, and the safety of people in emergencies is ensured

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

Fig. 1 is a flow chart of example 1 of the present disclosure;

fig. 2 is a pedestrian motion trajectory extracted by a YOLO V3 method in embodiment 1 of the present disclosure;

fig. 3 is an evacuation scenario diagram constructed in embodiment 1 of the present disclosure;

FIG. 4 is a schematic diagram of a crowd grouping in accordance with embodiment 1 of the present disclosure;

fig. 5 is a schematic view of crowd evacuation in embodiment 1 of the present disclosure;

fig. 6 is a schematic diagram of the evacuation end time of the crowd according to embodiment 1 of the disclosure.

The specific implementation mode is as follows:

the present disclosure is further described with reference to the following drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

In the present disclosure, terms such as "upper", "lower", "left", "right", "front", "rear", "vertical", "horizontal", "side", "bottom", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only relational terms determined for convenience in describing structural relationships of the parts or elements of the present disclosure, and do not refer to any parts or elements of the present disclosure, and are not to be construed as limiting the present disclosure.

In the present disclosure, terms such as "fixedly connected", "connected", and the like are to be understood in a broad sense, and mean either a fixed connection or an integrally connected or detachable connection; may be directly connected or indirectly connected through an intermediate. The specific meanings of the above terms in the present disclosure can be determined on a case-by-case basis by persons skilled in the relevant art or technicians, and are not to be construed as limitations of the present disclosure.

Example 1

The embodiment discloses a crowd evacuation simulation method based on deep reinforcement learning, which comprises the following steps:

grouping the crowds, and dividing a leader and a follower of each group;

Further, a real scene database of a market is received, and a moving stopping point of the pedestrian is obtained from the pedestrian video by adopting a YOLO V3 method and is used as a state space of the E-MADDPG algorithm.

Further, the group leader performs global path planning through an E-MADDPG algorithm to obtain an optimal evacuation path, specifically:

acquiring all evacuation paths of the leader according to the exit position and the initial position of the leader;

calculating a reward value for each evacuation path;

and selecting the evacuation path with the maximum reward value as the optimal evacuation path.

Further, the intra-group followers avoid obstacles based on the RVO algorithm and follow the leaders to evacuate along the optimal evacuation path, and the method specifically comprises the following steps:

calculating all the speeds of the followers in collision and the optimal collision-free speed, wherein the direction of the optimal collision-free speed is the direction of the leader in the group moving along the optimal evacuation path;

acquiring the current position of a follower;

when the optimal collision-free speed of the follower is obtained, the position of the follower is updated.

The method for simulating crowd evacuation based on deep reinforcement learning is specifically described with reference to fig. 1 to 6, and includes the following steps:

step 1: receiving a real scene database of a shopping mall, and extracting a pedestrian movement parking point in a video as a state space by using a YOLO V3 method;

the state information of the deep reinforcement learning represents the environment information perceived by the intelligent agent and the change caused by the change of the self action, the state information is the basis for the intelligent agent to make a decision and evaluate the long-term benefit of the intelligent agent, the condition of the state design directly determines whether the deep reinforcement learning algorithm can be converged and the convergence speed is high or low, and as the evacuation scene is enlarged and refined, the explosion of the state space is inevitably caused, which is called a dimension disaster, in order to solve the problem, the embodiment provides a new state representation method, and the stationary point of the pedestrian motion track is extracted from the real pedestrian video by adopting a YOLO V3 method to obtain the corresponding state change point, on the basis, all the state change points in the scene can be used as the state space, and the process is shown in FIG. 2.

Step 2: creating an evacuation scene model and a character model according to preset evacuation scene parameter information, as shown in fig. 3, introducing the character model into the evacuation scene model, initializing the crowd parameter information as preset evacuation crowd parameter information, grouping the crowds, and dividing a leader and a follower for each group of crowds as shown in fig. 4;

the initialized population is grouped, the leader and the followers are divided, and each population in the space has one leader to evacuate the followers, as shown in fig. 5. The leader and follower should have the following characteristics:

(1) the leader needs to know the location of the exits.

(2) The follower should always follow the leader during evacuation.

The result after grouping is shown in fig. 4.

And step 3: in the evacuation process, a layered path planning method is adopted to obtain evacuation paths of crowds, wherein the upper layer realizes global path planning for each group of leaders by using an E-MADDPG algorithm to obtain an optimal evacuation path, the bottom layer realizes collision avoidance for followers in each group by using an RVO algorithm, and the followers in the group are followed to evacuate along the optimal evacuation path;

in the top layer: when the exit position p is known_jAnd the initial position of leader i

When the operation is executed, the next position is reached

This position then continues to perform the operation as the current position and repeats the operation until he reaches the exit position p_jAnd grouping the bitsThe sequence is regarded as the evacuation path of the leader i, and the successfully evacuated k paths are temporarily stored in a path Buffer_iIn (1). But the reward per path may be very different due to the simultaneous influence of other agents. The disclosure derives reward values for a set of sequences of positions in k paths in a buffer

When k candidate paths in the Path buffer area are traversed, the Path with the maximum reward R in the Path buffer area is selected as the optimal evacuation Path of the leader i, and a Path set Path is output_i。

In the bottom layer: the process of using RVO to realize pedestrian obstacle avoidance can be realized in two steps, firstly calculating all speeds v of collision between an individual i and an individual j in a neighborhood_cThen, the individual i is calculated to select the collision-free velocity and to select the optimal collision-free velocity v_bWherein the optimum speed v_bIs a vector having a direction and a magnitude, the direction being

p_iRepresenting the global path of the individual obtained by the upper layer,

representing the position of the individual i at time t. Once the collision-free velocity of the individual is obtained, its position is updated to

wherein

Is the current position. This makes all pedestrians' movements during evacuation the optimal path based on upper route planning training.

And 4, step 4: and finishing the evacuation process when the number of the finally exported persons is equal to the total number of the persons.

That is, it ends when the number of people coming out of the final exit is equal to the total number of people, as shown in fig. 6.

A learning curve and a high-priority experience playback strategy are introduced on the basis of a traditional MADDPG algorithm, so that an Efficient Multi-Agent Deep Deterministic Policy Gradient (E-MADDPG) algorithm is formed.

(1) Combined learning curve

Because the external environment provides little information, reinforcement learning is learned in a "trial and error" manner, by which the reinforcement learning system gains experience in the action-evaluation environment, improving the action profile to suit the environment. From the learning curve theory, it can be seen that as the knowledge increases, the learning efficiency is enhanced, and the learning efficiency is undoubtedly affected by the experience pool with fixed capacity in the maddppg algorithm.

Among the many learning curves, Wright learning curve is most widely used, and its learning curve equation is as follows:

y(x)＝kx^α (1)

α＝lg x/log 2 (2)

wherein x is the number of exploration times, k is the first learning effect, y (x) is the time used for the x time, and alpha is the learning coefficient.

The method is characterized in that a relation between the production capacity and the time in a learning curve theory is referred, a change parameter is added to the capacity of an experience pool and the number of sampling samples in an algorithm, an experience pool curve and a sampling sample curve are provided, the experience pool of the MADDPG algorithm is improved by combining the experience pool curve, a change parameter 3 is added, the size of the experience pool is adjusted through the change parameter, the experience pool is dynamically changed, the influence of undersize or oversize of the experience pool on the learning efficiency in the learning process is eliminated, and the change function is as follows after the improvement:

wherein, R (t) is the current size, and t is the number of learning times.

Also along with sample figure greatly increased, the efficiency of study is probably influenced to fixed sample collection figure, and this disclosure adjusts the sampling figure through changing parameter 3, and its change function is after improving:

wherein, n (t) is the current sample collection number, and t is the learning times.

(2) Playback with priority experience

In the conventional experience replay mechanism, the random sampling method makes the experience transmitted to the network training completely random, so that the training efficiency of the network is low. To address the above-mentioned problems, the present disclosure selects valuable samples in the replay buffer, and the core idea of priority empirical replay is to replay very successful attempts more frequently or to render extremely bad correlated samples, thereby increasing the efficiency of learning. The idea of priority empirical replay comes from a preferential cleaning, which replays samples that are more useful to learning at a high frequency, the present disclosure uses TD-error to measure the magnitude of the effect. The meaning of TD-error is the difference between the estimated value of a certain action and the value output by the current value function, the larger TD-error indicates that the sample has higher value, and the high-value samples are played back more frequently, which can help the intelligent body to improve the effectiveness of learning, thereby improving the overall performance and learning more from the sample.

The present disclosure is intended to ensure that a new sample, for which TD-error is temporarily unknown, is played back at least once, placing it first, and thereafter, playing back the sample for which TD-error is largest each time.

The present disclosure selects the absolute value | δ of the TD-error of the sample_tL serves as a criterion for evaluating the value of the sample. | δ_tThe equation for | is as follows:

δ_t＝r(s_i，a_i)+γQ′(s_i+1，μ′(s_i+1|θ^μ′)|θ^Q′)-Q(s_i，a_i|θ^Q) (6)

wherein ,r(s_i，a_i) For the reward function, γ ∈ (0, 1) is the discount factor, Q' (s, a | θ)^Q′) Is the target action value network, Q (s, a | θ)^Q) Is a network of action values, μ (s | θ)^μ) Is an actor network, theta^QAnd theta^μIs a network parameter.

The present disclosure defines a multi-agent reinforcement learning based standard reinforcement learning element, specifically defined as follows:

definition 1 (state): is denoted by S, S_tEs can be expressed as the pedestrian' S position at time t. In the learning process, S includes the leader' S current location and the set of waypoints for the path plan.

Definition 2 (action): is marked as A, a_te.A represents the action of the agent selecting the next state from the current state.

Define 3 (reward function): denoted R, represents the reward of the environment for the action after the execution of the a action. In multi-agent path planning, two tasks are mainly required to be completed: and the destination is reached, and collision is avoided. The reward function should be closely related to both tasks. The reward function in this disclosure is defined as follows:

where r is an adaptation function defined as:

r＝μ₁*(d_i-d_i+1)+μ₂*(d_k-d_k+1)+μ₃*(c_j-c_j+1) (8)

wherein ,μ₁，μ₂，μ₃Usually a positive value, and μ₁+μ₂+μ₃＝1。d_iDenotes the minimum distance, d, from the current position to the exit_obsRepresenting the distance from the current position to the nearest obstacle, c_jTo representCongestion level of egress j. The definitions are as follows:

c_j＝p_j/b_j (11)

wherein ,(x_i，y_i) Is the current location of the leader, (x)_j，y_j) J ∈ (1, m) as the exit position; (x)_k，y_k) K ∈ (1, n) is the position of the obstacle. p is a radical of_jRepresenting the number of persons at the target point j, by b_jRepresenting the number of persons passing the target point j per unit time.

The proposed E-MADDPG algorithm specifically comprises the following steps:

randomly initializing operator network and critical network parameters: theta^QAnd theta^μ；

Initializing a target network parameter θ^QAnd theta^μ；

Initializing a buffer D, and sampling the absolute value | delta of the minimum value of the number N, TD-error_minI, pointer P is 1;

for episode＝1，M do

initializing a random process phi for behavioral exploration

Receiving an initial observation state s₁

for t＝1，T do

For agent_iSelecting actions based on current strategy and exploration noise

Performing action a_tReturn of the prize value r_tAnd a new state s_t+1

Store experience e_t＝(s_t，a_t，r_t，s_t+1) To replay buffer D

Calculating samples e_tIs (d) | delta_t|

If|δ_t|＞|δ_min|，then

Insert into e_tQuerying TD-error minimum value update | delta_min|；

P＝P+1：

End if

For agent_i＝1，X do

Selecting N samples from D

Setting 1y_i＝r_i+γQ′(S_i+1|θ^μ)|θ^Q′

The criticc network is trained by minimizing the loss function L:

updating the operator strategy by adopting a gradient strategy:

End for

updating the target network:

End for

in the embodiment, the multi-agent deep reinforcement learning algorithm is applied to the path planning of crowd evacuation, so that the crowd evacuation efficiency is improved.

In the embodiment, the defects of the multi-agent deep reinforcement learning algorithm are considered, the E-MADDPG algorithm is provided on the basis of the MADDPG algorithm, the learning efficiency is improved by combining the learning curve to enable the experience pool to be dynamically variable, and then the learning effectiveness is improved by improving the algorithm random sampling mode. And the state space of the algorithm is improved, and the motion stopping point is extracted from the pedestrian video to be used as the state space, so that the problem of dimension disaster is effectively solved.

In the embodiment, a hierarchical path planning method is adopted, the crowd is divided into a leader and a follower in consideration of the crowd psychology of people, and the large-scale crowd evacuation simulation problem is divided into a group of sub-problems. The evacuation is guided by the crowd grouping and the leader, so that the evacuation efficiency of the public places can be effectively improved, and the safety of people in emergencies is ensured.

Example 2

In this embodiment, a crowd evacuation simulation system based on deep reinforcement learning with experience pool optimization is disclosed, which includes:

Example 3

In this embodiment, an electronic device is disclosed, comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of the deep reinforcement learning-based crowd evacuation simulation method.

Example 4

In this embodiment, a computer readable storage medium is disclosed for storing computer instructions which, when executed by a processor, perform the steps of the deep reinforcement learning based crowd evacuation simulation method.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims

1. The crowd evacuation simulation method based on deep reinforcement learning is characterized by comprising the following steps:

grouping the crowds, and dividing a leader and a follower of each group;

2. The deep reinforcement learning-based crowd evacuation simulation method according to claim 1, wherein a database of real scenes in a shopping mall is received, and a YOLO V3 method is used to obtain pedestrian motion stopping points from a pedestrian video as a state space of an E-madpg algorithm.

3. The deep reinforcement learning-based crowd evacuation simulation method according to claim 1, wherein variation parameters are added to the experience pool capacity and the number of sampling samples in the MADDPG algorithm to form an experience pool curve and a sampling sample curve of the E-MADDPG algorithm, and the state space of the E-MADDPG algorithm is dynamically variable by adjusting the experience pool size and the number of sampling samples through the variation parameters.

4. The deep reinforcement learning-based crowd evacuation simulation method according to claim 1, wherein during network training of the E-MADDPG algorithm, samples with high value are selected for experience replay.

5. The deep reinforcement learning-based crowd evacuation simulation method according to claim 1, wherein the group leader performs global path planning by using an E-madpg algorithm to obtain an optimal evacuation path, specifically:

calculating a reward value for each evacuation path;

6. The deep reinforcement learning-based crowd evacuation simulation method according to claim 5, wherein the exit selected by the leader is rewarded to obtain the reward value of the evacuation path according to whether the leader reaches the exit and whether the collision occurs.

7. The deep reinforcement learning-based crowd evacuation simulation method according to claim 1, wherein the group followers avoid obstacles based on the RVO algorithm to follow the leaders to evacuate along the optimal evacuation path, and the method comprises the following specific steps:

acquiring the current position of a follower;

8. Crowd evacuation simulation system of degree of depth reinforcement study based on experience pond is optimized, its characterized in that includes:

the intra-group leader selection module is used for grouping all the individuals; selecting a leader in the group;

9. An electronic device comprising a memory and a processor and computer instructions stored on the memory and executable on the processor, the computer instructions when executed by the processor performing the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of any one of claims 1 to 7.