CN114626175A

CN114626175A - Multi-agent simulation method and platform adopting same

Info

Publication number: CN114626175A
Application number: CN202011442726.2A
Authority: CN
Inventors: 刘延东; 韩东; 王鲁佳; 须成忠
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2022-06-14
Also published as: WO2022120955A1

Abstract

The invention is suitable for the technical field of intelligent agent simulation, in particular to the technical field of simulation for verifying a multi-intelligent agent reinforcement learning algorithm, and relates to a multi-intelligent agent simulation method and a platform adopting the method, wherein the method sets a search space to be 10 x 10 of a limited space, so that the effectiveness and timeliness of an algorithm to be tested can be ensured, and the waiting time of an algorithm training process is reduced; each square corresponds to a pixel point and the unit length of a physical quantity at the same time, so that the compatibility of the pixel observation state and the non-pixel observation state in the training process is facilitated, and a user can test the algorithm and adjust the optimization conveniently. The simulation platform adopting the method also has the same technical effect.

Description

Multi-agent simulation method and platform adopting same

Technical Field

The invention belongs to the technical field of intelligent agent simulation, is suitable for verifying simulation of a multi-intelligent-agent reinforcement learning algorithm, and particularly relates to a multi-intelligent-agent simulation method and a platform adopting the method.

Background

The multi-agent simulation system is composed of a group of agents sharing the environment, sensing the environment and interacting with the environment, each agent independently interacts with the environment, takes action according to individual targets and influences the environment. In the real world, there are many examples of intelligent agent simulation systems, such as traffic congestion control, resource scheduling management, base station communication transmission, etc.

The existing multi-agent simulation platform mainly takes a game interaction scene, and the state space is based on images. Although the pixel point information provides partial observation information of the environment, the dimension is large, and the number of channels is large. Even through image preprocessing, such as image cutting, image reduction, channel number change and the like. Still have great state dimension, it is higher to the computer hardware requirement, the time of verifying the test algorithm is longer. Meanwhile, the multi-agent simulation environment state information is single, for example, the pixels of one frame of image are selected based on the interactive environment observation state of the game, and the position, the speed, the relative distance and other information of the object are generally selected based on the motion state of the object. The algorithm needs to design different network input dimensions in the algorithm according to the form of state information provided by the environment, and the adjustment process is complex and easy to make mistakes.

Disclosure of Invention

The invention aims to provide a multi-agent simulation method compatible with pixel observation and non-pixel observation and a platform adopting the method, and aims to solve the technical problems that the adjustment process of an algorithm to be tested is complicated due to more input dimensions, and errors are easy to occur in switching between pixel observation and non-pixel observation.

In one aspect, the present invention provides a multi-agent simulation method, comprising the steps of:

s1, constructing a search space with the length of 10 grids and the width of 10 grids;

s2, setting a preset number of intelligent bodies and obstacles in the search space;

s3, the intelligent agent uploads observation information formed by the search space, the obstacles and other intelligent agents to an algorithm to be tested;

s4, the algorithm to be tested guides the intelligent agent to move gradually according to task requirements and observation information; the scoring system scores each movement of the intelligent agent according to the task requirement of the algorithm to be tested until the intelligent agent completes the task;

s5, feeding back the total task score to the algorithm to be tested; the intelligent agent executes a plurality of tasks, and after the algorithm to be tested is continuously tested and debugged, an optimal strategy is obtained;

in the search space, one square corresponds to one pixel point and the unit length of one physical quantity; one of said agents occupying one of said squares, the unit time step of movement of which is one of said squares; one of the squares is occupied by one of the obstacles, and the agent cannot enter the square in which the obstacle is located.

In another aspect, the present invention further provides a multi-agent simulation platform, comprising:

the interactive environment unit is used for constructing a search space with the length of 10 grids and the width of 10 grids; setting a preset number of agents and barriers in the search space;

the scene unit is electrically connected with the interactive environment unit and used for providing the number and the positions of the intelligent bodies and the obstacles in the search space when a preset scene is provided;

the algorithm unit is electrically connected with the interactive environment unit and used for loading the algorithm to be tested and receiving the observation information fed back to the algorithm to be tested by the intelligent agent; outputting decision information of the algorithm to be tested on the motion direction of the intelligent agent;

the scoring unit is respectively electrically connected with the algorithm unit and the interactive environment unit and is used for scoring each movement of the intelligent agent according to the task requirement of the algorithm to be tested until the intelligent agent completes the task; feeding back the total task score to the algorithm to be tested; and the intelligent agent executes a plurality of tasks, and the optimal strategy is obtained after the algorithm to be tested is continuously tested and debugged.

The invention relates to a multi-agent simulation method and a platform adopting the method, wherein the minimum unit provided by the method corresponds to one pixel point and the unit length of one physical quantity, so that one grid can be used as one pixel point (a target point, a passable area and an obstacle with different pixel values) or the minimum unit distance (describing the position of an agent, the distance between the agent and the obstacle and the distance between the agent and the target point). The small checks are used as the dual attributes of the pixel points and the unit length. On a simulation platform, different environment state information can be provided for a user, namely pixel observation and non-pixel observation are visually compatible, and the user can test the algorithm conveniently.

Drawings

FIG. 1 is a flow chart of an implementation of a multi-agent simulation method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a preset simulation scenario provided by the present invention;

FIG. 3 is a schematic diagram of pixel-based observed states of a multi-agent in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a layout of a user-defined scene in an embodiment of the present invention;

FIG. 5 is a diagram illustrating the number of agents in an embodiment of the invention;

FIG. 6 is a functional block diagram of a multi-agent simulation platform provided in the second embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to fig. 1 to 6 and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The following detailed description of specific implementations of the present invention is provided in conjunction with specific embodiments:

the first embodiment is as follows:

FIG. 1 shows a multi-agent simulation method according to an embodiment of the present invention, which includes the following steps:

the step limits the search space in a certain space, so that the dimensionality of the pixels is reduced, and the waiting time of the algorithm training process is reduced.

S2, setting a preset number of intelligent bodies and obstacles in a search space;

s3, uploading observation information formed by the search space, the obstacles and other intelligent agents to an algorithm to be tested by the intelligent agents;

s4, successively guiding the intelligent agent to move by the algorithm to be tested according to task requirements and observation information; the scoring system scores each movement of the intelligent agent according to the task requirements of the algorithm to be tested until the intelligent agent finishes the task;

s5, feeding back the task total score to the algorithm to be tested; the intelligent agent executes a plurality of tasks, and obtains an optimal strategy after the algorithm to be tested is continuously tried out;

as shown in fig. 2-5, in the search space, one square corresponds to one pixel point and the unit length of one physical quantity; an agent occupies a square grid, and the unit time step of the movement of the agent is a square grid; an obstacle occupies a square grid and the agent cannot enter the square grid in which the obstacle is located.

In actual operation, the self-optimization and trial-and-error schemes of the algorithm to be tested are set by the algorithm, and the total score of each task and the motion mode and the fractional reward or punishment record of each agent are provided for the algorithm to be tested to be referred.

Preferably, the multi-agent simulation method is not limited to reinforcement learning algorithm, but can also be applied to other algorithms capable of realizing self-learning.

Preferably, in step S4, each movement of the agent is targeted to one of the upper, lower, left and right squares of its current square.

Preferably, in step S4, the scoring includes: bonus of bonus or penalty of bonus.

Preferably, in step S3, the observation range of the agent is 3 × 3 squares centered on the grid where the agent is located.

Preferably, in step S2, the number and positions of the agents and the obstacles are determined according to the requirements of the algorithm to be tested or the preset scenario.

The step provides a user-defined interface, and the user can modify the scene to a certain extent according to the self requirement. Such as changing the number of agents or the distribution of obstacles.

In the embodiment of the invention, setting the search space to be a limited space of 10 × 10 can ensure that the effectiveness and timeliness of the algorithm to be tested reduce the waiting time of the algorithm training process. Each square corresponds to a pixel point and the unit length of a physical quantity at the same time, so that the compatibility of pixel observation and non-pixel observation in the training process is facilitated, and the user test is facilitated.

Example two:

fig. 6 shows a multi-agent simulation platform using the simulation method according to the second embodiment of the present invention, which includes:

the interactive environment unit is used for constructing a search space with the length of 10 grids and the width of 10 grids; and sets a predetermined number of agents and obstacles in the search space.

Specifically, the environment interaction unit is responsible for transferring the observation of the intelligent agent from the environment to the algorithm and expressing the behavior of the algorithm decision on the visual interface. The intelligent agent observes the environmental information from the scene of the search space and transmits the observation state to the interaction module. The interaction module uploads the observation state to the algorithm, the algorithm predicts the behavior through the trained model, the intelligent agent executes the behavior in the interaction module, new observation is obtained from a scene, and information such as reward of the behavior and whether the training is finished is obtained.

And the scene unit is electrically connected with the interactive environment unit and is used for providing the number and the positions of the intelligent bodies and the barriers in the search space when a scene is preset.

As shown in fig. 2, three multi-agent common experimental scenes such as "pursuit-escape", "multi-target navigation", "location exchange" and the like are preset in a scene unit of the simulation platform, and a user-defined interface is provided. Wherein the solid polygon represents the agent and the hollow polygon represents the target of the agent. Different tasks of the intelligent body are represented by solid polygons with different shapes, such as a solid circle and a solid diamond in the figure. The moving targets are corresponding hollow circles and hollow diamonds.

In the scene unit, the scene is initialized first, and the number of agents, color and other attributes are determined. The position of the start and end points in the scene. The number and location of obstacles is determined. Then, the observation information of the agent is determined, and under different scenes, the observation information based on the image is different from that based on the non-image. The image-based observation information is pixels of squares around the agent, and the non-image-based observation information is a physical quantity (e.g., position, relative distance between agents, relative distance between agent and target) reflecting environmental information. And finally, judging whether a training process is finished or not, wherein all the agents in the scene finish tasks and the training process is finished.

The algorithm unit is electrically connected with the interactive environment unit and is used for loading the algorithm to be tested and receiving the observation information fed back to the algorithm to be tested by the intelligent agent; and outputting the decision information of the algorithm to be tested on the motion direction of the intelligent agent.

The specific user adds the algorithm to be tested in the algorithm unit, obtains the observation state of the scene fed back by the intelligent agent through the interface of the environment interaction unit, predicts the behavior and returns the behavior to the intelligent agent to guide the intelligent agent to move.

And the scoring unit is respectively and electrically connected with the algorithm unit and the interactive environment unit and is used for scoring each movement of the intelligent body according to the task requirement of the algorithm to be tested until the intelligent body finishes the task. After each task is finished, feeding back the total task score to the algorithm to be tested; and when the intelligent agent executes a plurality of tasks, the algorithm to be tested is continuously tested and debugged to obtain an optimal strategy.

Specifically, the scoring unit is to design an instant reward of the agent, i.e. a numerical reward for each behavior of the agent. Whether the reward design can reflect the quality of the behavior of the intelligent agent determines the effect of the optimization strategy.

In the search space, one square grid corresponds to one pixel point and the unit length of one physical quantity; by using the small checks as dual attributes of pixel points and unit lengths, different environment state information can be conveniently provided for users on one simulation platform, and the user test is facilitated.

An agent occupies a square grid, and the unit time step of the movement of the agent is a square grid; an obstacle occupies a square grid and the agent cannot enter the square grid in which the obstacle is located.

Preferably, in the interactive environment unit, each moving target of the agent is one of the upper, lower, left and right squares of the current square.

Preferably, the scoring of the result of the movement of the wisdom by the scoring unit comprises: bonus of bonus or penalty of bonus.

Preferably, in the interactive environment unit, the observation range of the agent is 3 × 3 grid areas centered on the grid in which the agent is located.

Wherein, based on the observation state of the image pixels, as shown in fig. 3, the solid circle and the implementation diamond are agents with different tasks. The hollow diamond is the target of the agent represented by the solid diamond, and the hollow square sleeve is the target of the agent represented by the solid circle. The 3x3 open boxes at the periphery of the agent represent the observation scope of the agent. And taking the intelligent agent as a center, and determining an observation range according to the observation depth set by the user. The pixels in each grid (RGB three channels) are taken as the observed state of the agent. Fig. 3 shows the observed state of 3 × 3. The observation state based on image pixels is commonly used on a game-related multi-agent reinforcement learning simulation platform.

The physical quantity commonly used as the observation state includes a position of the agent, a relative position between the agent and the target point, and the like. For different scenarios, the algorithm to be tested may select different physical quantities to represent the observed state.

Preferably, the configuration of the number and positions of the agents and the obstacles by the scene unit is determined according to the requirements of the algorithm to be tested or a preset scene.

Specifically, the user may change the scene setting of the search space according to the test requirement of the algorithm. The method mainly comprises two aspects: (1) changing the distribution of obstacles in the scene; (2) changing the number of agents in the scene. By changing the distribution of obstacles in the scene, the difficulty of testing can be increased. The solid circles and the solid diamonds represent agents for different tasks, respectively, as shown in fig. 4. Wherein the hollow rhombus is a solid round target, and the hollow square sleeve is a solid rhombus target. There is a section of the agent that must pass through and be the same area that needs to switch locations. To complete the task, the agent must learn collaborative navigation. One agent passes through the area first and another agent waits and passes through the area after it is accessible. By varying the number of agents in a scene, the performance of the algorithm can be tested. As shown in fig. 5, a solid hexagon, a solid triangle, a solid circle, and a solid diamond represent agents for different tasks, respectively. The hollow rhombus is a solid circular target, the hollow square sleeve is a solid rhombus target, the hollow hexagon is a solid hexagon target, and the hollow triangle is a solid triangle target. The number of agents that need to swap locations increases from two to four. The algorithm needs to coordinate the strategies of four agents and complete the navigation task in a narrower space. The global observation state and the joint action space become complex, the difficulty of task completion becomes high, and the performance requirement on the algorithm is higher. The customizability of the simulation platform can meet more requirements of users.

Example three:

the following table shows the behavior state of the agent and the related rewards applied in three preset scenes in the search space.

The multi-agent simulation platform provided by the embodiment of the invention is limited in a certain space (10 x 10 grids), so that the waiting time of the algorithm training process is reduced. The pixel point and the unit length attribute are fused to the minimum unit square of the simulation environment, and the defect that the existing method can only be based on one state information (image pixel or physical quantity) is overcome. Meanwhile, a user-defined interface is provided, and a user can modify the scene to a certain extent according to the self requirement to meet the test requirement of the algorithm.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A multi-agent simulation method, characterized in that the method comprises the steps of:

2. The method of claim 1, wherein in the step S4, each moving target of the agent is one of the upper, lower, left and right squares of its current square.

3. The method of claim 1, wherein in step S4, the scoring comprises: bonus of bonus or penalty of bonus.

4. The method according to claim 1, wherein in step S3, the observation range of the agent is 3 × 3 square areas centered on the square where the agent is located.

5. The method according to claim 1, wherein in the step S2, the number and positions of the agents and the obstacles are determined according to the requirements of the algorithm to be tested or a preset scenario.

6. A multi-agent simulation platform, the platform comprising:

the algorithm unit is electrically connected with the interactive environment unit and is used for loading the algorithm to be tested and receiving the observation information fed back to the algorithm to be tested by the intelligent agent; outputting decision information of the algorithm to be tested on the motion direction of the intelligent agent;

the scoring unit is respectively electrically connected with the algorithm unit and the interactive environment unit and is used for scoring each movement of the intelligent agent according to the task requirement of the algorithm to be tested until the intelligent agent completes the task; feeding back the total task score to the algorithm to be tested; the intelligent agent executes a plurality of tasks, and after the algorithm to be tested is continuously tested and debugged, an optimal strategy is obtained;

7. The platform of claim 6, wherein each motion target of the agent in the interactive environment unit is one of up, down, left, and right of its current pane.

8. The platform of claim 6, wherein the scoring unit to score the smart body exercise result comprises: bonus of bonus or penalty of bonus.

9. The platform of claim 6, wherein the observation scope of the agent in the interactive environment unit is 3x3 square areas centered on the square in which the agent is located.

10. The platform of claim 6, wherein the configuration of the number and location of agents and obstacles by the context unit is determined according to the requirements of the algorithm under test or a preset context.