WO2022120955A1

WO2022120955A1 - Multi-agent simulation method and platform using method

Info

Publication number: WO2022120955A1
Application number: PCT/CN2020/138782
Authority: WO
Inventors: 刘延东; 韩东; 王鲁佳; 须成忠
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2020-12-11
Filing date: 2020-12-24
Publication date: 2022-06-16
Also published as: CN114626175A

Abstract

The present invention is applicable to the technical field of agent simulation, is applicable in particular to the field of simulation technologies for verifying a multi-agent reinforcement learning algorithm, and relates to a multi-agent simulation method and a platform using the method. According to the method, by setting a search space to be a 10*10 limited space, the validity and timeliness of an algorithm to be tested can be ensured, thereby shortening the waiting time during an algorithm training process. Each square corresponds to one pixel and one unit length of a physical quantity, so that a pixel observation state and a non-pixel observation state during the training process can be compatible, and a user can conveniently test an algorithm thereof and fine-tune the algorithm. The simulation platform using the method also has the same technical effect.

Description

Multi-agent simulation method and platform using the method

technical field

The invention belongs to the technical field of agent simulation and is suitable for verifying the simulation of multi-agent reinforcement learning algorithms, in particular to a multi-agent simulation method and a platform using the method.

Background technique

A multi-agent simulation system is composed of a group of agents that share the environment, perceive the environment, and interact with the environment. Each agent interacts with the environment independently, takes actions according to individual goals, and affects the environment. In the real world, there are many examples of multi-agent simulation systems, such as traffic congestion control, resource scheduling management, base station communication transmission, etc.

The existing multi-agent simulation platforms are mainly based on game interaction scenes, and the state space is based on images. Although the pixel point information provides some observation information of the environment, it has a large dimension and a large number of channels. Even after image preprocessing, such as image cropping, reduction, channel number change and other methods. There is still a large state dimension, which requires high computer hardware and takes a long time to verify and test the algorithm. At the same time, the multi-agent simulation environment state information is single. For example, based on the interactive environment observation state of the game, the pixels of a frame of image are selected, while the position, speed, relative distance and other information of the object are generally selected based on the motion state of the object. The algorithm needs to design different network input dimensions in the algorithm according to the form of state information provided by the environment. The adjustment process is cumbersome and error-prone.

technical problem

The purpose of the present invention is to provide a multi-agent simulation method compatible with pixel observation and non-pixel observation and a platform using the method, aiming to solve the complicated adjustment process of the algorithm to be tested due to many input dimensions, pixel observation and non-pixel observation. Toggle error-prone technical issues.

technical solutions

In one aspect, the present invention provides a multi-agent simulation method, the method includes the following steps:

S1. Build a search space with a length of 10 squares and a width of 10 squares;

S2. Set a predetermined number of agents and obstacles in the search space;

S3. The agent uploads the observation information formed by the search space, the obstacle and other said agents to the algorithm to be tested;

S4. The algorithm to be tested guides the agent to move one by one according to task requirements and observation information; the scoring system scores each movement of the agent according to the task requirements of the algorithm to be tested, until the intelligent body to complete the task;

S5. Feed back the total score of the task to the algorithm to be tested; the agent performs multiple tasks, and the algorithm to be tested obtains the optimal strategy after continuous trial and error;

In the search space, one of the squares corresponds to a pixel and a unit length of a physical quantity; one of the agents occupies one of the squares, and the unit time step of its movement is one of the squares; The obstacle occupies one of the squares, and the agent cannot enter the square where the obstacle is located.

On the other hand, the present invention also provides a multi-agent simulation platform, comprising:

An interactive environment unit, constructing a search space with a length of 10 squares and a width of 10 squares; and setting a predetermined number of agents and obstacles in the search space;

a scene unit, electrically connected to the interactive environment unit, for providing the number and position of the agent and the obstacle in the search space in a preset scene;

an algorithm unit, electrically connected to the interactive environment unit, for loading the algorithm to be tested, and receiving observation information fed back to the algorithm to be tested by the agent; and outputting the decision of the algorithm to be tested on the movement direction of the agent information;

a scoring unit, which is electrically connected to the algorithm unit and the interactive environment unit respectively, and is used to score each movement of the agent according to the task requirements of the algorithm to be tested, until the agent completes the task; The total score of the task is fed back to the algorithm to be tested; the agent performs multiple tasks, and the algorithm to be tested obtains the optimal strategy after continuous trial and error.

beneficial effect

The multi-agent simulation method of the present invention and the minimum unit provided by the platform using the method, a square corresponds to both a pixel and a unit length of a physical quantity, so that a square can be used as a pixel (target Points, passable areas, obstacles with different pixel values), can also be used as a unit minimum distance (describe the position of the agent, the distance between the agent and the obstacle, the distance between the agent and the target point). By using small squares as dual attributes of pixel point and unit length. On a simulation platform, it can provide users with different environmental state information, that is, it is intuitively compatible with pixel observations and non-pixel observations, which is convenient for users to test their algorithms.

Description of drawings

Fig. 1 is the realization flow chart of the multi-agent simulation method provided by the first embodiment of the present invention;

2 is a schematic diagram of a preset simulation scene provided by the present invention;

3 is a schematic diagram of a pixel-based observation state of a multi-agent in an embodiment of the present invention;

4 is a schematic diagram of a user-defined scene layout in an embodiment of the present invention;

5 is a schematic diagram of the number of user-defined scene agents in an embodiment of the present invention;

FIG. 6 is a functional block diagram of a multi-agent simulation platform provided by Embodiment 2 of the present application.

Embodiments of the present invention

In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings 1-6 and the embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

The specific implementation of the present invention is described in detail below in conjunction with specific embodiments:

Example 1:

FIG. 1 shows a multi-agent simulation method provided by Embodiment 1 of the present invention, and the method includes the following steps:

S1. Build a search space with a length of 10 squares and a width of 10 squares;

In this step, the search space is limited to a certain space, which reduces the dimension of the pixel and reduces the waiting time of the algorithm training process.

S2. Set a predetermined number of agents and obstacles in the search space;

S3. The agent uploads the observation information composed of the search space, obstacles and other agents to the algorithm to be tested;

S4. The algorithm to be tested guides the agent to move one by one according to the task requirements and observation information; the scoring system scores each movement of the agent according to the task requirements of the algorithm to be tested, until the agent completes the task;

S5. Feed back the total score of the task to the algorithm to be tested; the agent performs multiple tasks, and the optimal strategy is obtained after the algorithm to be tested continues trial and error;

As shown in Figure 2-5, in the search space, a square corresponds to a pixel and a unit length of a physical quantity; an agent occupies a square, and the unit time step of its movement is one square; an obstacle occupies A square where the agent cannot enter the square where the obstacle is.

In actual operation, the self-tuning and trial-and-error scheme of the algorithm to be tested is set by the algorithm itself. This application provides the total score of each task, the movement mode of each agent, and the graded reward or punishment records for the to-be-tested algorithm. Test the algorithm for reference.

Preferably, the multi-agent simulation method of the present application is not limited to reinforcement learning algorithms, but can also be applied to other algorithms that can realize self-learning.

Preferably, in step S4, each movement target of the agent is one of the upper, lower, left and right squares of the current square.

Preferably, in step S4, the scoring includes: a reward for adding points or a penalty for deducting points.

Preferably, in step S3, the observation range of the agent is the area of 3*3 squares centered on the square where it is located.

Preferably, in step S2, the number and positions of the agents and obstacles are determined according to the requirements of the algorithm to be tested or the preset scene.

This step provides a user-defined interface, and users can modify the scene to a certain extent according to their own needs. Such as changing the number of agents or the distribution of obstacles.

In the embodiment of the present invention, setting the search space to a limited space of 10*10 can ensure the validity and timeliness of the algorithm to be tested and reduce the waiting time of the algorithm training process. Each square corresponds to a pixel point and a unit length of a physical quantity at the same time, which helps to be compatible with pixel observations and non-pixel observations in the training process and facilitates user testing.

Embodiment 2:

FIG. 6 shows a multi-agent simulation platform using the above simulation method provided by Embodiment 2 of the present invention, including:

The interactive environment unit constructs a search space with a length of 10 squares and a width of 10 squares; and sets a predetermined number of agents and obstacles in the search space.

Specifically, the environment interaction unit is responsible for transferring the observation of the agent from the environment to the algorithm, and displaying the behavior of the algorithm decision on the visual interface. The agent observes the environment information from the scene in the search space and transmits the observed state to the interaction module. The interaction module uploads the observation state to the algorithm, the algorithm predicts the behavior through the trained model, the agent executes the behavior in the interaction module, and obtains new observations from the scene, the reward for the behavior, and whether the training is completed.

The scene unit is electrically connected to the interactive environment unit, and is used to provide the number and position of the agent and the obstacle in the search space in the preset scene.

As shown in Figure 2, the scene unit of the simulation platform presets three common experimental scenarios for multi-agents, such as "pursuit-escape", "multi-target navigation", and "exchange position", and provides user-defined interfaces. Among them, the solid polygon represents the agent, and the hollow polygon represents the target of the agent. Agents with different tasks are represented by solid polygons with different shapes, such as solid circles and solid diamonds in the figure. Its moving targets are corresponding hollow circles and hollow diamonds.

In the scene unit, the scene is initialized first, and the properties such as the number and color of the agents are determined. The location of the start and end points in the scene. Determine the number and location of obstacles. Then, the observation information of the agent is determined. In different scenarios, the observation information based on images and non-images is different. The image-based observation information is the pixels of the grid around the agent, and the non-image-based observation information is the physical quantity that reflects the environmental information (such as position, relative distance between agents, and relative distance between the agent and the target). Finally, to judge whether a training process is over, it should be noted that this training process ends only when all the agents in the scene complete the task.

The algorithm unit is electrically connected with the interactive environment unit, and is used for loading the algorithm to be tested, receiving observation information fed back to the algorithm to be tested by the agent, and outputting the decision information of the algorithm to be tested on the moving direction of the agent.

The specific user adds the algorithm to be tested in the algorithm unit, obtains the observation state of the scene fed back by the agent through the interface of the environment interaction unit, predicts the behavior and returns the behavior to the agent, and guides the agent to move.

The scoring unit is electrically connected with the algorithm unit and the interactive environment unit respectively, and is used for scoring each movement of the agent according to the task requirements of the algorithm to be tested, until the agent completes the task. After each task is completed, the total score of the task is fed back to the algorithm to be tested; when the agent performs multiple tasks, the algorithm to be tested will get the optimal strategy after continuous trial and error.

Specifically, the scoring unit should design the instant reward of the agent, that is, the numerical reward for each behavior of the agent. Whether the reward design can reflect the behavior of the agent determines the effect of the optimization strategy.

In the search space, a square corresponds to a pixel and a unit length of a physical quantity; by using a small square as a dual attribute of a pixel and a unit length, on a simulation platform, it can easily provide users with different environmental states information to facilitate user testing.

An agent occupies a square, and the unit time step of its movement is one square; an obstacle occupies a square, and the agent cannot enter the square where the obstacle is located.

Preferably, in the interactive environment unit, each movement target of the agent is one of the upper, lower, left and right squares of the current square.

Preferably, the scoring of the intelligent body movement result by the scoring unit includes: a reward for adding points or a penalty for deducting points.

Preferably, in the interactive environment unit, the observation range of the agent is an area of 3*3 squares centered on the square where it is located.

Among them, based on the observation state of image pixels, as shown in Figure 3, the solid circle and the realization diamond are agents with different tasks. The hollow diamond is the target of the agent represented by the solid diamond, and the hollow circle is the target of the agent represented by the solid circle. The 3*3 hollow boxes around the agent represent the observation range of the agent. Taking the agent as the center, the observation range is determined according to the observation depth set by the user. Take the pixels in each grid (RGB three-channel) as the observed state of the agent. Figure 3 shows the observed state of 3x3. Image pixel-based observation states are often used in game-related multi-agent reinforcement learning simulation platforms.

Among them, based on the observation state of physical quantities, the physical quantities commonly used for the observation state include the position of the agent, the relative position between the agent and the agent, and the relative position between the agent and the target point, etc. For different scenarios, the algorithm to be tested can choose different physical quantities to represent the observed state.

Preferably, the configuration of the number and positions of the agents and obstacles by the scene unit is determined according to the requirements of the algorithm to be tested or a preset scene.

Specifically, the user can change the scene setting of the search space according to the test requirements of the algorithm. It is mainly divided into two aspects: (1) changing the distribution of obstacles in the scene; (2) changing the number of agents in the scene. The difficulty of the test can be increased by changing the distribution of obstacles in the scene. The solid circles and solid diamonds shown in Figure 4 represent agents for different tasks, respectively. The hollow rhombus is the target of the solid circle, and the hollow square is the target of the solid rhombus. Agents that need to swap positions have a segment that must pass through the same area. To complete the task, the agent must learn to navigate cooperatively. One agent passes through the area first, and the other agent waits and passes when the area is traversable. By varying the number of agents in the scene, the performance of the algorithm can be tested. As shown in Figure 5, solid hexagons, solid triangles, solid circles, and solid diamonds represent agents for different tasks, respectively. The hollow rhombus is the target of the solid circle, the hollow square is the target of the solid rhombus, the hollow hexagon is the target of the solid hexagon, and the hollow triangle is the target of the solid triangle. The number of agents that need to swap positions has increased from two to four. The algorithm needs to coordinate the strategies of the four agents to complete the navigation task in a narrower space. The global observation state and joint action space become complex, the task completion becomes more difficult, and the performance requirements of the algorithm are higher. The customizability of the simulation platform can meet more needs of users.

Embodiment three:

The following table shows the agent behavior states and related rewards of the three preset scenarios applied in the search space.

The multi-agent simulation platform provided by the embodiment of the present invention is limited to a certain space (10*10 squares), which reduces the waiting time of the algorithm training process. The pixel point and unit length attributes are integrated into the smallest unit of the simulation environment, one square, which solves the insufficiency that the existing methods can only be based on one kind of state information (image pixel or physical quantity). At the same time, a user-defined interface is also provided, and users can modify the scene to a certain extent according to their own needs to meet the testing needs of the algorithm.

The above descriptions are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention shall be included in the protection of the present invention. within the range.

Claims

A multi-agent simulation method, characterized in that the method comprises the following steps:

S1. Build a search space with a length of 10 squares and a width of 10 squares;

S2. Set a predetermined number of agents and obstacles in the search space;

S3. The agent uploads the observation information formed by the search space, the obstacle and other said agents to the algorithm to be tested;

S4. The algorithm to be tested guides the agent to move one by one according to task requirements and observation information; the scoring system scores each movement of the agent according to the task requirements of the algorithm to be tested, until the intelligent body to complete the task;

S5. Feed back the total score of the task to the algorithm to be tested; the agent performs multiple tasks, and the algorithm to be tested obtains the optimal strategy after continuous trial and error;

In the search space, one of the squares corresponds to a pixel and a unit length of a physical quantity; one of the agents occupies one of the squares, and the unit time step of its movement is one of the squares; The obstacle occupies one of the squares, and the agent cannot enter the square where the obstacle is located.
The method according to claim 1, characterized in that, in the step S4, the target of each movement of the agent is one of the upper, lower, left and right squares of the current square.
The method of claim 1, wherein, in the step S4, the scoring includes: a reward for adding points or a penalty for deducting points.
The method according to claim 1, characterized in that, in the step S3, the observation range of the agent is an area of 3*3 squares centered on the square where it is located.
The method according to claim 1, wherein in the step S2, the number and position of the agent and the obstacle are determined according to the requirements of the algorithm to be tested or a preset scene.
A multi-agent simulation platform, characterized in that the platform includes:

An interactive environment unit, constructing a search space with a length of 10 squares and a width of 10 squares; and setting a predetermined number of agents and obstacles in the search space;

a scene unit, electrically connected to the interactive environment unit, for providing the number and position of the agent and the obstacle in the search space in a preset scene;

an algorithm unit, electrically connected to the interactive environment unit, for loading the algorithm to be tested, and receiving observation information fed back to the algorithm to be tested by the agent; and outputting the decision of the algorithm to be tested on the movement direction of the agent information;

a scoring unit, which is electrically connected to the algorithm unit and the interactive environment unit respectively, and is used to score each movement of the agent according to the task requirements of the algorithm to be tested, until the agent completes the task; The total score of the task is fed back to the algorithm to be tested; the agent performs multiple tasks, and the algorithm to be tested obtains the optimal strategy after continuous trial and error;

In the search space, one of the squares corresponds to one pixel and a unit length of a physical quantity; one of the agents occupies one of the squares, and the unit time step of its movement is one of the squares; The obstacle occupies one of the squares, and the agent cannot enter the square where the obstacle is located.
The platform according to claim 6, wherein, in the interactive environment unit, the target of each movement of the agent is one of the upper, lower, left and right squares of the current square.
The platform according to claim 6, wherein the scoring of the intelligent body movement result by the scoring unit includes: a reward for adding points or a penalty for deducting points.
The platform according to claim 6, wherein, in the interactive environment unit, the observation range of the agent is an area of 3*3 squares centered on the square where it is located.
The platform according to claim 6, wherein the configuration of the number and positions of the agent and the obstacles by the scene unit is determined according to the requirements of the algorithm to be tested or a preset scene.