CN112231967A - Crowd evacuation simulation method and system based on deep reinforcement learning - Google Patents
Crowd evacuation simulation method and system based on deep reinforcement learning Download PDFInfo
- Publication number
- CN112231967A CN112231967A CN202010942444.2A CN202010942444A CN112231967A CN 112231967 A CN112231967 A CN 112231967A CN 202010942444 A CN202010942444 A CN 202010942444A CN 112231967 A CN112231967 A CN 112231967A
- Authority
- CN
- China
- Prior art keywords
- evacuation
- leader
- crowd
- path
- algorithm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 59
- 230000002787 reinforcement Effects 0.000 title claims abstract description 39
- 238000004088 simulation Methods 0.000 title claims abstract description 34
- 238000005070 sampling Methods 0.000 claims description 16
- 238000012549 training Methods 0.000 claims description 6
- 238000003860 storage Methods 0.000 claims description 3
- 239000003795 chemical substances by application Substances 0.000 description 19
- 230000008569 process Effects 0.000 description 13
- 230000009471 action Effects 0.000 description 11
- 230000008859 change Effects 0.000 description 9
- 230000006870 function Effects 0.000 description 9
- 238000013135 deep learning Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 241000282320 Panthera leo Species 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000000875 corresponding effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- NRNCYVBFPDDJNE-UHFFFAOYSA-N pemoline Chemical compound O1C(N)=NC(=O)C1C1=CC=CC=C1 NRNCYVBFPDDJNE-UHFFFAOYSA-N 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
- G06Q10/047—Optimisation of routes or paths, e.g. travelling salesman problem
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/26—Government or public services
- G06Q50/265—Personal security, identity or safety
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A10/00—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE at coastal zones; at river basins
- Y02A10/40—Controlling or monitoring, e.g. of flood or hurricane; Forecasting, e.g. risk assessment or mapping
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Tourism & Hospitality (AREA)
- Economics (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Development Economics (AREA)
- Marketing (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computer Hardware Design (AREA)
- Health & Medical Sciences (AREA)
- Computer Security & Cryptography (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Software Systems (AREA)
- Educational Administration (AREA)
- Geometry (AREA)
- General Engineering & Computer Science (AREA)
- Game Theory and Decision Science (AREA)
- Entrepreneurship & Innovation (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Alarm Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The disclosed crowd evacuation simulation method and system based on deep reinforcement learning comprise: initializing the constructed evacuation scene simulation model according to the scene information and the crowd parameter information; grouping the crowds, and dividing a leader and a follower of each group; and obtaining the evacuation path of the crowd by adopting a hierarchical path planning method, wherein the leader in the upper-layer group carries out global path planning through an E-MADDPG algorithm to obtain an optimal evacuation path, and the follower in the lower-layer group carries out evacuation along the optimal evacuation path by avoiding obstacles and following the leader. A learning curve and a high-priority experience playback strategy are introduced on the basis of the traditional MADDPG algorithm, an E-MADDPG algorithm is formed, the learning efficiency of the algorithm is improved, a hierarchical path planning method is provided on the basis of the E-MADDPG algorithm and used for planning the evacuation path of people, the path planning time is effectively shortened, people can be guided to evacuate better, and the crowd evacuation efficiency is improved.
Description
Technical Field
The disclosure relates to a crowd evacuation simulation method and system based on deep reinforcement learning.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
With the increasingly frequent occurrence of public safety problems, the problem of large-scale crowd evacuation becomes an important link which cannot be ignored in emergency treatment. In a place with dense crowds, once dangerous accidents happen, the crowds can rush to escape from a scene in order to avoid dangers, and therefore crowding is caused in the crowd evacuation process. If the people cannot be evacuated in time, collision and trampling accidents can be caused, and secondary damage is caused to the evacuated people. Meanwhile, large-scale crowd evacuation is a complex process, and large-scale crowd evacuation experiments are difficult to develop due to the problems of difficult organization, high cost, personnel safety and the like. Therefore, the computer simulation technology becomes a main means for analyzing the evacuation process and evaluating the evacuation efficiency.
How to improve the crowd evacuation efficiency and avoid secondary damage is always a problem of great concern of researchers. Reinforcement learning is one of the research hotspots in the field of artificial intelligence in recent years. The combination of reinforcement learning and path planning provides a new idea for improving crowd evacuation efficiency. The path planning algorithm based on multi-agent reinforcement learning greatly improves the efficiency of path planning, and has certain adaptability to dynamic environment because of continuous learning, and the practicability is stronger. However, most of real evacuation scenes are complex, the problem that the traditional reinforcement learning method is difficult to process is solved, and the deep learning can effectively process high-dimensional input and can better process complex real scenes. Therefore, the reinforcement learning and the deep learning are combined, and the learning strategy of the reinforcement learning and the capability of the deep learning to solve the high-dimensional input problem are combined, so that the method can be better applied to crowd evacuation simulation. The Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm proposed by Lowe et al is a new Multi-Agent Deep reinforcement learning algorithm, but the algorithm also has the problems of invariable state space, random experience playback and the like, and the learning efficiency of the algorithm is seriously influenced. Meanwhile, as the number of the intelligent agents for guiding evacuation increases and the complexity of the environment increases, huge state space is inevitably brought, and the application effect of the algorithm in the field of crowd evacuation is seriously influenced by the problems.
Disclosure of Invention
The invention provides a crowd evacuation simulation method and system based on Deep reinforcement learning to solve the problems, a learning curve and a high-priority experience playback strategy are introduced on the basis of the traditional MADDPG algorithm to form an Efficient Multi-Agent Deep Deterministic Policy Gradient (E-MADDPG) algorithm, the learning efficiency of the algorithm is improved, a hierarchical path planning method is provided on the basis of the E-MADDPG algorithm to plan the evacuation path of the crowd, the time of path planning is effectively shortened, the crowd can be better guided to be evacuated, and the crowd evacuation efficiency is improved.
In order to achieve the purpose, the following technical scheme is adopted in the disclosure:
in one or more embodiments, a crowd evacuation simulation method based on deep reinforcement learning is provided, including:
initializing the constructed evacuation scene simulation model according to the scene information and the crowd parameter information;
grouping the crowds, and dividing a leader and a follower of each group;
and obtaining the evacuation path of the crowd by adopting a hierarchical path planning method, wherein the leader in the upper-layer group carries out global path planning through an E-MADDPG algorithm to obtain an optimal evacuation path, and the follower in the lower-layer group carries out evacuation along the optimal evacuation path by avoiding obstacles and following the leader.
And further, receiving a real scene database of a market, and acquiring a pedestrian movement stopping point as a state space of the E-MADDPG algorithm.
Furthermore, variation parameters are added to the experience pool capacity and the number of sampling samples in the MADDPG algorithm to form an experience pool curve and a sampling sample curve of the E-MADDPG algorithm, and the size of the experience pool and the number of sampling samples are adjusted through the variation parameters, so that the state space of the E-MADDPG algorithm is dynamically variable.
Furthermore, during network training of the E-MADDPG algorithm, samples with high value are selected for experience replay.
In one or more embodiments, a deep reinforcement learning crowd evacuation simulation system based on experience pool optimization is provided, comprising:
the initialization setting module is used for carrying out initialization setting on parameters in the evacuation scene simulation model according to the scene information and the crowd parameter information;
the in-group guidance selection module is used for grouping the crowds; selecting a leader in the group;
and the evacuation simulation module acquires the evacuation path of the crowd by adopting a hierarchical path planning method, wherein the leader in the upper-layer group carries out global path planning through an E-MADDPG algorithm to acquire an optimal evacuation path, and the follower in the lower-layer group carries out evacuation along the optimal evacuation path by avoiding obstacles and following the leader.
In one or more embodiments, an electronic device is provided, comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, the computer instructions, when executed by the processor, performing the steps of the deep reinforcement learning-based crowd evacuation simulation method.
In one or more embodiments, a computer-readable storage medium is provided for storing computer instructions which, when executed by a processor, perform the steps of the deep reinforcement learning-based crowd evacuation simulation method.
Compared with the prior art, the beneficial effect of this disclosure is:
1. the multi-agent deep reinforcement learning algorithm is applied to the path planning of crowd evacuation, and the crowd evacuation efficiency is improved.
2. The method takes the defects of a multi-agent deep reinforcement learning algorithm into consideration, provides an E-MADDPG algorithm on the basis of the MADDPG algorithm, improves learning efficiency by combining a learning curve to enable an experience pool to be dynamically variable, improves a random sampling mode of the algorithm to improve learning effectiveness, improves a state space of the algorithm, extracts a motion stopping point from a pedestrian video as the state space, and effectively solves the problem of dimension disaster.
3. The crowd evacuation route is acquired by adopting a hierarchical route planning method, the crowd is divided into a leader and a follower in consideration of the crowd psychology of people, the large-scale crowd evacuation simulation problem is divided into a group of sub-problems, and the evacuation is guided by the crowd grouping and the leader, so that the evacuation efficiency of public places can be effectively improved, and the safety of people in emergencies is ensured
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
Fig. 1 is a flow chart of example 1 of the present disclosure;
fig. 2 is a pedestrian motion trajectory extracted by a YOLO V3 method in embodiment 1 of the present disclosure;
fig. 3 is an evacuation scenario diagram constructed in embodiment 1 of the present disclosure;
FIG. 4 is a schematic diagram of a crowd grouping in accordance with embodiment 1 of the present disclosure;
fig. 5 is a schematic view of crowd evacuation in embodiment 1 of the present disclosure;
fig. 6 is a schematic diagram of the evacuation end time of the crowd according to embodiment 1 of the disclosure.
The specific implementation mode is as follows:
the present disclosure is further described with reference to the following drawings and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
In the present disclosure, terms such as "upper", "lower", "left", "right", "front", "rear", "vertical", "horizontal", "side", "bottom", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only relational terms determined for convenience in describing structural relationships of the parts or elements of the present disclosure, and do not refer to any parts or elements of the present disclosure, and are not to be construed as limiting the present disclosure.
In the present disclosure, terms such as "fixedly connected", "connected", and the like are to be understood in a broad sense, and mean either a fixed connection or an integrally connected or detachable connection; may be directly connected or indirectly connected through an intermediate. The specific meanings of the above terms in the present disclosure can be determined on a case-by-case basis by persons skilled in the relevant art or technicians, and are not to be construed as limitations of the present disclosure.
Example 1
The embodiment discloses a crowd evacuation simulation method based on deep reinforcement learning, which comprises the following steps:
initializing the constructed evacuation scene simulation model according to the scene information and the crowd parameter information;
grouping the crowds, and dividing a leader and a follower of each group;
and obtaining the evacuation path of the crowd by adopting a hierarchical path planning method, wherein the leader in the upper-layer group carries out global path planning through an E-MADDPG algorithm to obtain an optimal evacuation path, and the follower in the lower-layer group carries out evacuation along the optimal evacuation path by avoiding obstacles and following the leader.
Further, a real scene database of a market is received, and a moving stopping point of the pedestrian is obtained from the pedestrian video by adopting a YOLO V3 method and is used as a state space of the E-MADDPG algorithm.
Furthermore, variation parameters are added to the experience pool capacity and the number of sampling samples in the MADDPG algorithm to form an experience pool curve and a sampling sample curve of the E-MADDPG algorithm, and the size of the experience pool and the number of sampling samples are adjusted through the variation parameters, so that the state space of the E-MADDPG algorithm is dynamically variable.
Furthermore, during network training of the E-MADDPG algorithm, samples with high value are selected for experience replay.
Further, the group leader performs global path planning through an E-MADDPG algorithm to obtain an optimal evacuation path, specifically:
acquiring all evacuation paths of the leader according to the exit position and the initial position of the leader;
calculating a reward value for each evacuation path;
and selecting the evacuation path with the maximum reward value as the optimal evacuation path.
Further, the intra-group followers avoid obstacles based on the RVO algorithm and follow the leaders to evacuate along the optimal evacuation path, and the method specifically comprises the following steps:
calculating all the speeds of the followers in collision and the optimal collision-free speed, wherein the direction of the optimal collision-free speed is the direction of the leader in the group moving along the optimal evacuation path;
acquiring the current position of a follower;
when the optimal collision-free speed of the follower is obtained, the position of the follower is updated.
The method for simulating crowd evacuation based on deep reinforcement learning is specifically described with reference to fig. 1 to 6, and includes the following steps:
step 1: receiving a real scene database of a shopping mall, and extracting a pedestrian movement parking point in a video as a state space by using a YOLO V3 method;
the state information of the deep reinforcement learning represents the environment information perceived by the intelligent agent and the change caused by the change of the self action, the state information is the basis for the intelligent agent to make a decision and evaluate the long-term benefit of the intelligent agent, the condition of the state design directly determines whether the deep reinforcement learning algorithm can be converged and the convergence speed is high or low, and as the evacuation scene is enlarged and refined, the explosion of the state space is inevitably caused, which is called a dimension disaster, in order to solve the problem, the embodiment provides a new state representation method, and the stationary point of the pedestrian motion track is extracted from the real pedestrian video by adopting a YOLO V3 method to obtain the corresponding state change point, on the basis, all the state change points in the scene can be used as the state space, and the process is shown in FIG. 2.
Step 2: creating an evacuation scene model and a character model according to preset evacuation scene parameter information, as shown in fig. 3, introducing the character model into the evacuation scene model, initializing the crowd parameter information as preset evacuation crowd parameter information, grouping the crowds, and dividing a leader and a follower for each group of crowds as shown in fig. 4;
the initialized population is grouped, the leader and the followers are divided, and each population in the space has one leader to evacuate the followers, as shown in fig. 5. The leader and follower should have the following characteristics:
(1) the leader needs to know the location of the exits.
(2) The follower should always follow the leader during evacuation.
The result after grouping is shown in fig. 4.
And step 3: in the evacuation process, a layered path planning method is adopted to obtain evacuation paths of crowds, wherein the upper layer realizes global path planning for each group of leaders by using an E-MADDPG algorithm to obtain an optimal evacuation path, the bottom layer realizes collision avoidance for followers in each group by using an RVO algorithm, and the followers in the group are followed to evacuate along the optimal evacuation path;
in the top layer: when the exit position p is knownjAnd the initial position of leader iWhen the operation is executed, the next position is reachedThis position then continues to perform the operation as the current position and repeats the operation until he reaches the exit position pjAnd grouping the bitsThe sequence is regarded as the evacuation path of the leader i, and the successfully evacuated k paths are temporarily stored in a path BufferiIn (1). But the reward per path may be very different due to the simultaneous influence of other agents. The disclosure derives reward values for a set of sequences of positions in k paths in a bufferWhen k candidate paths in the Path buffer area are traversed, the Path with the maximum reward R in the Path buffer area is selected as the optimal evacuation Path of the leader i, and a Path set Path is outputi。
In the bottom layer: the process of using RVO to realize pedestrian obstacle avoidance can be realized in two steps, firstly calculating all speeds v of collision between an individual i and an individual j in a neighborhoodcThen, the individual i is calculated to select the collision-free velocity and to select the optimal collision-free velocity vbWherein the optimum speed vbIs a vector having a direction and a magnitude, the direction beingpiRepresenting the global path of the individual obtained by the upper layer,representing the position of the individual i at time t. Once the collision-free velocity of the individual is obtained, its position is updated to wherein Is the current position. This makes all pedestrians' movements during evacuation the optimal path based on upper route planning training.
And 4, step 4: and finishing the evacuation process when the number of the finally exported persons is equal to the total number of the persons.
That is, it ends when the number of people coming out of the final exit is equal to the total number of people, as shown in fig. 6.
A learning curve and a high-priority experience playback strategy are introduced on the basis of a traditional MADDPG algorithm, so that an Efficient Multi-Agent Deep Deterministic Policy Gradient (E-MADDPG) algorithm is formed.
(1) Combined learning curve
Because the external environment provides little information, reinforcement learning is learned in a "trial and error" manner, by which the reinforcement learning system gains experience in the action-evaluation environment, improving the action profile to suit the environment. From the learning curve theory, it can be seen that as the knowledge increases, the learning efficiency is enhanced, and the learning efficiency is undoubtedly affected by the experience pool with fixed capacity in the maddppg algorithm.
Among the many learning curves, Wright learning curve is most widely used, and its learning curve equation is as follows:
y(x)=kxα (1)
α=lg x/log 2 (2)
wherein x is the number of exploration times, k is the first learning effect, y (x) is the time used for the x time, and alpha is the learning coefficient.
The method is characterized in that a relation between the production capacity and the time in a learning curve theory is referred, a change parameter is added to the capacity of an experience pool and the number of sampling samples in an algorithm, an experience pool curve and a sampling sample curve are provided, the experience pool of the MADDPG algorithm is improved by combining the experience pool curve, a change parameter 3 is added, the size of the experience pool is adjusted through the change parameter, the experience pool is dynamically changed, the influence of undersize or oversize of the experience pool on the learning efficiency in the learning process is eliminated, and the change function is as follows after the improvement:
wherein, R (t) is the current size, and t is the number of learning times.
Also along with sample figure greatly increased, the efficiency of study is probably influenced to fixed sample collection figure, and this disclosure adjusts the sampling figure through changing parameter 3, and its change function is after improving:
wherein, n (t) is the current sample collection number, and t is the learning times.
(2) Playback with priority experience
In the conventional experience replay mechanism, the random sampling method makes the experience transmitted to the network training completely random, so that the training efficiency of the network is low. To address the above-mentioned problems, the present disclosure selects valuable samples in the replay buffer, and the core idea of priority empirical replay is to replay very successful attempts more frequently or to render extremely bad correlated samples, thereby increasing the efficiency of learning. The idea of priority empirical replay comes from a preferential cleaning, which replays samples that are more useful to learning at a high frequency, the present disclosure uses TD-error to measure the magnitude of the effect. The meaning of TD-error is the difference between the estimated value of a certain action and the value output by the current value function, the larger TD-error indicates that the sample has higher value, and the high-value samples are played back more frequently, which can help the intelligent body to improve the effectiveness of learning, thereby improving the overall performance and learning more from the sample.
The present disclosure is intended to ensure that a new sample, for which TD-error is temporarily unknown, is played back at least once, placing it first, and thereafter, playing back the sample for which TD-error is largest each time.
The present disclosure selects the absolute value | δ of the TD-error of the sampletL serves as a criterion for evaluating the value of the sample. | δtThe equation for | is as follows:
δt=r(si,ai)+γQ′(si+1,μ′(si+1|θμ′)|θQ′)-Q(si,ai|θQ) (6)
wherein ,r(si,ai) For the reward function, γ ∈ (0, 1) is the discount factor, Q' (s, a | θ)Q′) Is the target action value network, Q (s, a | θ)Q) Is a network of action values, μ (s | θ)μ) Is an actor network, thetaQAnd thetaμIs a network parameter.
The present disclosure defines a multi-agent reinforcement learning based standard reinforcement learning element, specifically defined as follows:
definition 1 (state): is denoted by S, StEs can be expressed as the pedestrian' S position at time t. In the learning process, S includes the leader' S current location and the set of waypoints for the path plan.
Definition 2 (action): is marked as A, ate.A represents the action of the agent selecting the next state from the current state.
Define 3 (reward function): denoted R, represents the reward of the environment for the action after the execution of the a action. In multi-agent path planning, two tasks are mainly required to be completed: and the destination is reached, and collision is avoided. The reward function should be closely related to both tasks. The reward function in this disclosure is defined as follows:
where r is an adaptation function defined as:
r=μ1*(di-di+1)+μ2*(dk-dk+1)+μ3*(cj-cj+1) (8)
wherein ,μ1,μ2,μ3Usually a positive value, and μ1+μ2+μ3=1。diDenotes the minimum distance, d, from the current position to the exitobsRepresenting the distance from the current position to the nearest obstacle, cjTo representCongestion level of egress j. The definitions are as follows:
cj=pj/bj (11)
wherein ,(xi,yi) Is the current location of the leader, (x)j,yj) J ∈ (1, m) as the exit position; (x)k,yk) K ∈ (1, n) is the position of the obstacle. p is a radical ofjRepresenting the number of persons at the target point j, by bjRepresenting the number of persons passing the target point j per unit time.
The proposed E-MADDPG algorithm specifically comprises the following steps:
randomly initializing operator network and critical network parameters: thetaQAnd thetaμ;
Initializing a target network parameter θQAnd thetaμ;
Initializing a buffer D, and sampling the absolute value | delta of the minimum value of the number N, TD-errorminI, pointer P is 1;
for episode=1,M do
initializing a random process phi for behavioral exploration
Receiving an initial observation state s1
for t=1,T do
Performing action atReturn of the prize value rtAnd a new state st+1
Store experience et=(st,at,rt,st+1) To replay buffer D
Calculating samples etIs (d) | deltat|
If|δt|>|δmin|,then
Insert into etQuerying TD-error minimum value update | deltamin|;
P=P+1:
End if
For agenti=1,X do
Selecting N samples from D
Setting 1yi=ri+γQ′(Si+1|θμ)|θQ′
The criticc network is trained by minimizing the loss function L:
updating the operator strategy by adopting a gradient strategy:
End for
updating the target network:
End for
End for
in the embodiment, the multi-agent deep reinforcement learning algorithm is applied to the path planning of crowd evacuation, so that the crowd evacuation efficiency is improved.
In the embodiment, the defects of the multi-agent deep reinforcement learning algorithm are considered, the E-MADDPG algorithm is provided on the basis of the MADDPG algorithm, the learning efficiency is improved by combining the learning curve to enable the experience pool to be dynamically variable, and then the learning effectiveness is improved by improving the algorithm random sampling mode. And the state space of the algorithm is improved, and the motion stopping point is extracted from the pedestrian video to be used as the state space, so that the problem of dimension disaster is effectively solved.
In the embodiment, a hierarchical path planning method is adopted, the crowd is divided into a leader and a follower in consideration of the crowd psychology of people, and the large-scale crowd evacuation simulation problem is divided into a group of sub-problems. The evacuation is guided by the crowd grouping and the leader, so that the evacuation efficiency of the public places can be effectively improved, and the safety of people in emergencies is ensured.
Example 2
In this embodiment, a crowd evacuation simulation system based on deep reinforcement learning with experience pool optimization is disclosed, which includes:
the initialization setting module is used for carrying out initialization setting on parameters in the evacuation scene simulation model according to the scene information and the crowd parameter information;
the in-group guidance selection module is used for grouping the crowds; selecting a leader in the group;
and the evacuation simulation module acquires the evacuation path of the crowd by adopting a hierarchical path planning method, wherein the leader in the upper-layer group carries out global path planning through an E-MADDPG algorithm to acquire an optimal evacuation path, and the follower in the lower-layer group carries out evacuation along the optimal evacuation path by avoiding obstacles and following the leader.
Example 3
In this embodiment, an electronic device is disclosed, comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of the deep reinforcement learning-based crowd evacuation simulation method.
Example 4
In this embodiment, a computer readable storage medium is disclosed for storing computer instructions which, when executed by a processor, perform the steps of the deep reinforcement learning based crowd evacuation simulation method.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.
Claims (10)
1. The crowd evacuation simulation method based on deep reinforcement learning is characterized by comprising the following steps:
initializing the constructed evacuation scene simulation model according to the scene information and the crowd parameter information;
grouping the crowds, and dividing a leader and a follower of each group;
and obtaining the evacuation path of the crowd by adopting a hierarchical path planning method, wherein the leader in the upper-layer group carries out global path planning through an E-MADDPG algorithm to obtain an optimal evacuation path, and the follower in the lower-layer group carries out evacuation along the optimal evacuation path by avoiding obstacles and following the leader.
2. The deep reinforcement learning-based crowd evacuation simulation method according to claim 1, wherein a database of real scenes in a shopping mall is received, and a YOLO V3 method is used to obtain pedestrian motion stopping points from a pedestrian video as a state space of an E-madpg algorithm.
3. The deep reinforcement learning-based crowd evacuation simulation method according to claim 1, wherein variation parameters are added to the experience pool capacity and the number of sampling samples in the MADDPG algorithm to form an experience pool curve and a sampling sample curve of the E-MADDPG algorithm, and the state space of the E-MADDPG algorithm is dynamically variable by adjusting the experience pool size and the number of sampling samples through the variation parameters.
4. The deep reinforcement learning-based crowd evacuation simulation method according to claim 1, wherein during network training of the E-MADDPG algorithm, samples with high value are selected for experience replay.
5. The deep reinforcement learning-based crowd evacuation simulation method according to claim 1, wherein the group leader performs global path planning by using an E-madpg algorithm to obtain an optimal evacuation path, specifically:
acquiring all evacuation paths of the leader according to the exit position and the initial position of the leader;
calculating a reward value for each evacuation path;
and selecting the evacuation path with the maximum reward value as the optimal evacuation path.
6. The deep reinforcement learning-based crowd evacuation simulation method according to claim 5, wherein the exit selected by the leader is rewarded to obtain the reward value of the evacuation path according to whether the leader reaches the exit and whether the collision occurs.
7. The deep reinforcement learning-based crowd evacuation simulation method according to claim 1, wherein the group followers avoid obstacles based on the RVO algorithm to follow the leaders to evacuate along the optimal evacuation path, and the method comprises the following specific steps:
calculating all the speeds of the followers in collision and the optimal collision-free speed, wherein the direction of the optimal collision-free speed is the direction of the leader in the group moving along the optimal evacuation path;
acquiring the current position of a follower;
when the optimal collision-free speed of the follower is obtained, the position of the follower is updated.
8. Crowd evacuation simulation system of degree of depth reinforcement study based on experience pond is optimized, its characterized in that includes:
the initialization setting module is used for carrying out initialization setting on parameters in the evacuation scene simulation model according to the scene information and the crowd parameter information;
the intra-group leader selection module is used for grouping all the individuals; selecting a leader in the group;
and the evacuation simulation module acquires the evacuation path of the crowd by adopting a hierarchical path planning method, wherein the leader in the upper-layer group carries out global path planning through an E-MADDPG algorithm to acquire an optimal evacuation path, and the follower in the lower-layer group carries out evacuation along the optimal evacuation path by avoiding obstacles and following the leader.
9. An electronic device comprising a memory and a processor and computer instructions stored on the memory and executable on the processor, the computer instructions when executed by the processor performing the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010942444.2A CN112231967B (en) | 2020-09-09 | 2020-09-09 | Crowd evacuation simulation method and system based on deep reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010942444.2A CN112231967B (en) | 2020-09-09 | 2020-09-09 | Crowd evacuation simulation method and system based on deep reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112231967A true CN112231967A (en) | 2021-01-15 |
CN112231967B CN112231967B (en) | 2023-05-26 |
Family
ID=74117069
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010942444.2A Active CN112231967B (en) | 2020-09-09 | 2020-09-09 | Crowd evacuation simulation method and system based on deep reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112231967B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113156979A (en) * | 2021-05-27 | 2021-07-23 | 浙江农林大学 | Forest guard patrol path planning method and device based on improved MADDPG algorithm |
CN113359859A (en) * | 2021-07-16 | 2021-09-07 | 广东电网有限责任公司 | Combined navigation obstacle avoidance method and system, terminal device and storage medium |
CN114518771A (en) * | 2022-02-23 | 2022-05-20 | 深圳大漠大智控技术有限公司 | Multi-unmanned aerial vehicle path planning method and device and related components |
KR20220141576A (en) * | 2021-04-13 | 2022-10-20 | 한기성 | Evacuation route simulation device using machine learning and learning method |
CN115454074A (en) * | 2022-09-16 | 2022-12-09 | 北京华电力拓能源科技有限公司 | Evacuation path planning method and device, computer equipment and storage medium |
CN116167145A (en) * | 2023-04-23 | 2023-05-26 | 中铁第四勘察设计院集团有限公司 | Method and system for constructing space three-dimensional safety evacuation system of under-road complex |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109670270A (en) * | 2019-01-11 | 2019-04-23 | 山东师范大学 | Crowd evacuation emulation method and system based on the study of multiple agent deeply |
CN110491132A (en) * | 2019-07-11 | 2019-11-22 | 平安科技(深圳)有限公司 | Vehicle based on video frame picture analyzing, which is disobeyed, stops detection method and device |
CN111414681A (en) * | 2020-03-13 | 2020-07-14 | 山东师范大学 | In-building evacuation simulation method and system based on shared deep reinforcement learning |
-
2020
- 2020-09-09 CN CN202010942444.2A patent/CN112231967B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109670270A (en) * | 2019-01-11 | 2019-04-23 | 山东师范大学 | Crowd evacuation emulation method and system based on the study of multiple agent deeply |
CN110491132A (en) * | 2019-07-11 | 2019-11-22 | 平安科技(深圳)有限公司 | Vehicle based on video frame picture analyzing, which is disobeyed, stops detection method and device |
CN111414681A (en) * | 2020-03-13 | 2020-07-14 | 山东师范大学 | In-building evacuation simulation method and system based on shared deep reinforcement learning |
Non-Patent Citations (2)
Title |
---|
许诺等: "稀疏奖励下基于MADDPG算法的多智能体协同", 《现代计算机》 * |
郑尚菲: "基于深度强化学习的路径规划方法及应用", 《中国优秀硕士学位论文全文数据库 社会科学Ⅰ辑》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20220141576A (en) * | 2021-04-13 | 2022-10-20 | 한기성 | Evacuation route simulation device using machine learning and learning method |
KR102521990B1 (en) | 2021-04-13 | 2023-04-14 | 한기성 | Evacuation route simulation device using machine learning and learning method |
CN113156979A (en) * | 2021-05-27 | 2021-07-23 | 浙江农林大学 | Forest guard patrol path planning method and device based on improved MADDPG algorithm |
CN113156979B (en) * | 2021-05-27 | 2022-09-06 | 浙江农林大学 | Forest guard patrol path planning method and device based on improved MADDPG algorithm |
CN113359859A (en) * | 2021-07-16 | 2021-09-07 | 广东电网有限责任公司 | Combined navigation obstacle avoidance method and system, terminal device and storage medium |
CN113359859B (en) * | 2021-07-16 | 2023-09-08 | 广东电网有限责任公司 | Combined navigation obstacle avoidance method, system, terminal equipment and storage medium |
CN114518771A (en) * | 2022-02-23 | 2022-05-20 | 深圳大漠大智控技术有限公司 | Multi-unmanned aerial vehicle path planning method and device and related components |
CN115454074A (en) * | 2022-09-16 | 2022-12-09 | 北京华电力拓能源科技有限公司 | Evacuation path planning method and device, computer equipment and storage medium |
CN116167145A (en) * | 2023-04-23 | 2023-05-26 | 中铁第四勘察设计院集团有限公司 | Method and system for constructing space three-dimensional safety evacuation system of under-road complex |
Also Published As
Publication number | Publication date |
---|---|
CN112231967B (en) | 2023-05-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112231967A (en) | Crowd evacuation simulation method and system based on deep reinforcement learning | |
CN111142522B (en) | Method for controlling agent of hierarchical reinforcement learning | |
CN110276765B (en) | Image panorama segmentation method based on multitask learning deep neural network | |
CN106970615B (en) | A kind of real-time online paths planning method of deeply study | |
CN110874578B (en) | Unmanned aerial vehicle visual angle vehicle recognition tracking method based on reinforcement learning | |
CN111766782B (en) | Strategy selection method based on Actor-Critic framework in deep reinforcement learning | |
JP2022516383A (en) | Autonomous vehicle planning | |
JP2020126619A (en) | Method and device for short-term path planning of autonomous driving through information fusion by using v2x communication and image processing | |
EP3772710A1 (en) | Artificial intelligence server | |
CN111274438B (en) | Language description guided video time sequence positioning method | |
CN111476771B (en) | Domain self-adaption method and system based on distance countermeasure generation network | |
CN112231968A (en) | Crowd evacuation simulation method and system based on deep reinforcement learning algorithm | |
WO2022007867A1 (en) | Method and device for constructing neural network | |
CN110795833A (en) | Crowd evacuation simulation method, system, medium and equipment based on cat swarm algorithm | |
Szep et al. | Paralinguistic Classification of Mask Wearing by Image Classifiers and Fusion. | |
CN113888638A (en) | Pedestrian trajectory prediction method based on attention mechanism and through graph neural network | |
Hoy et al. | Learning to predict pedestrian intention via variational tracking networks | |
WO2020099854A1 (en) | Image classification, generation and application of neural networks | |
CN109508686A (en) | A kind of Human bodys' response method based on the study of stratification proper subspace | |
CN112121419A (en) | Virtual object control method, device, electronic equipment and storage medium | |
CN117455553B (en) | Subway station passenger flow volume prediction method | |
JP2021197184A (en) | Device and method for training and testing classifier | |
CN114548497B (en) | Crowd motion path planning method and system for realizing scene self-adaption | |
CN112947466B (en) | Parallel planning method and equipment for automatic driving and storage medium | |
CN112330043B (en) | Evacuation path planning method and system combining Q-learning and multi-swarm algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |