CN112819144B

CN112819144B - Method for improving convergence and training speed of neural network with multiple agents

Info

Publication number: CN112819144B
Application number: CN202110192255.2A
Authority: CN
Inventors: 陈晨
Original assignee: XIAMEN G-BITS NETWORK TECHNOLOGY CO LTD
Current assignee: XIAMEN G-BITS NETWORK TECHNOLOGY CO LTD
Priority date: 2021-02-20
Filing date: 2021-02-20
Publication date: 2024-02-13
Anticipated expiration: 2041-02-20
Also published as: CN112819144A

Abstract

The invention relates to a method, a device and a storable medium for improving convergence and training speed of a neural network with multiple agents, which can give directional rewards/penalties to rewards of the multiple agents, wherein for single agents under the task of the multiple agents, the agents which have made optimal decisions at present are encouraged and reserved, while the agents which have made wrong decisions are given directional penalties, and the neural network optimization process of other agents is not influenced. Based on the above, the multi-agent AI in the invention can clearly know the wrong agent object when in back propagation, thereby punishing the object only when seeking a derivative, accelerating the convergence and training speed of the neural network, and further improving the effect of the multi-agent AI.

Description

Method for improving convergence and training speed of neural network with multiple agents

Technical Field

The invention relates to the technical field of artificial intelligence reinforcement learning, in particular to a method for improving convergence and training speed of a neural network with multiple intelligent agents.

Background

As shown in fig. 1, reinforcement learning is learning by an Agent (Agent) in a "trial and error" manner, and obtains rewards by interacting with the environment, so that the Agent obtains the largest rewards, and the reinforcement learning is different from supervised learning in connection with sense learning and mainly represented on reinforcement signals, wherein reinforcement signals provided by the environment in reinforcement learning are used for evaluating whether to generate actions, and not for telling the reinforcement learning system RLS (reinforcement learning system) how to generate correct actions. Since little information is provided by the external environment, RLS must learn from its own experiences. In this way, the RLS obtains knowledge in the context of the action-assessment, improving the action plan to suit the context.

If a certain behavior strategy of an agent leads to a positive prize (signal enhancement) for the environment, the agent's later trend to develop this behavior strategy will be enhanced. The goal of the agent is to find the optimal strategy at each discrete state to maximize the desired discount rewards and. Reinforcement learning refers to learning as a heuristic evaluation process, in which an agent selects an action for an environment, the state of the environment changes after receiving the action, and a reinforcement signal (rewards or punishments) is generated and fed back to the agent, and the agent selects the next action according to the reinforcement signal and the current state of the environment, wherein the selection principle is that the probability of receiving positive reinforcement (rewards) is increased. The selected action affects not only the timely reinforcement signal, but also the state at the next moment in the environment and the final reinforcement signal. The goal of reinforcement learning system learning is to dynamically adjust parameters to achieve reinforcement signal maximization. For example, in the artificial intelligence training of weiqi, if an artificial intelligence AI falls on a position where a piece already exists, a penalty needs to be made on the action policy, so as to guide the AI to perform optimization. (the present invention calls positive scores as rewards and negative deductions as penalties.)

In reinforcement learning artificial intelligence training, there is a multi-agent rewards (review) setting problem. As shown in fig. 2, in the technical solution for processing multiple agents, rewards (Reward) are obtained for the whole of the multiple agents AI, so that the neural network is optimized by further back propagation according to the rewards and punishments. The disadvantage of uniformly finding rewards under the multi-agent problem is that when the multi-agent is optimized, which agent is better and which agent is worse is not known in fact, so that the multi-agent AI is guided to make more effective optimization. Because of this disadvantage, the multi-agent AI will not allow the single agent to make instructions to compare the benefits of the out-of-team at the time of optimization, thereby affecting the possibility that the multi-agent AI explores the best strategy and losing many opportunities to train the curiosity.

For example, in a round-robin game, the multi-agent AI operates as a character for the entire team. In round-making games, such as games with war mists, the multi-agent AI plays a team of players, each player observes different misting states because of their different viewing angles, and the multi-agent AI pieces the states into global information, thereby making further decisions to let each player execute different instructions, respectively. If there are multiple roles to operate in the team, then the error instruction they make needs to be passed to the multi-agent AI to make a penalty. The problem is that the Reward setting of AI shares a Reward report according to the existing technology, namely, a plurality of roles to be operated are stained and co-stained, a role error is subjected to full team penalty, even if only one role of the full team is in error, the Reward is still subjected to penalty (of course, penalty is lower than that of the full team error).

In view of the above, the present inventors have made intensive ideas to the above problems, and have made the present invention.

Disclosure of Invention

The invention aims to provide a method for improving convergence and training speed of a neural network with multiple agents, which can improve the convergence and training speed of the neural network by orienting rewards/penalties.

In order to achieve the above purpose, the invention adopts the following technical scheme:

the method is realized based on a multi-agent system, wherein the multi-agent system comprises a multi-agent master control and N agents, and a buried point is arranged in feedback of each agent and is used for judging whether instructions of the agents are wrong or not and making an excellent decision; the method comprises the following steps:

inputting state information, and transmitting the current state information to N intelligent agents;

the intelligent agent outputs respective instructions according to the respective neural networks and by combining the current state information;

the intelligent agent gives rewards and punishments feedback to the intelligent agent according to the instruction result and by combining with the buried point judgment in the feedback;

transmitting the rewards and punishments list of the N intelligent agents to a multi-intelligent agent master control;

and the master control of multiple agents reversely updates the neural network of each agent according to the rewarding and punishing list.

An apparatus for improving convergence and training speed for a neural network having multiple agents, the apparatus comprising a processor and a memory;

the memory is for storing one or more software programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method as described above.

A computer readable storage medium having instructions stored therein which, when run on a terminal device, cause the terminal device to perform a method as described above.

A computer software program product which, when run on a terminal device, causes the terminal device to perform a method as described above.

After the scheme is adopted, the invention makes directional rewards/penalties for rewards of multiple agents, and for single agents under the task of multiple agents, the agents which have made optimal decisions at present are encouraged and reserved, while the agents which have made wrong decisions are subjected to directional penalties, so that the neural network optimization process of other agents is not influenced. Based on the above, the multi-agent AI in the invention can clearly know the wrong agent object when in back propagation, thereby punishing the object only when seeking a derivative, accelerating the convergence and training speed of the neural network, and further improving the effect of the multi-agent AI. According to the invention, the multi-agent task is split into each independent individual from the team, so that the multi-agent AI is more colorful in global strategy and more remarkable under the training of the game AI.

Drawings

FIG. 1 is a schematic diagram of reinforcement learning;

FIG. 2 is a schematic diagram of a multi-agent reinforcement learning method in the prior art;

FIG. 3 is a schematic diagram of a multi-agent reinforcement learning method according to the present invention;

fig. 4 is a schematic diagram of a learning method according to an embodiment of the invention.

Detailed Description

As shown in fig. 3, the present invention discloses a method for improving convergence and training speed of a neural network with multiple agents, which is implemented based on a multi-agent system, wherein the multi-agent system comprises a multi-agent master control and N agents, and a buried point is arranged in feedback of each agent for judging whether instructions of the agents are wrong or not and making an excellent decision. The method comprises the following steps:

Fig. 4 shows an embodiment of the present invention, which is a total of three agents, namely agent a, agent B and agent C. After the state information is input, the intelligent agent A, the intelligent agent B and the intelligent agent C output instructions according to the respective neural networks and by combining the input state information; and then, according to the instruction result and the buried point judgment, giving out intelligent punishment values. In this embodiment, the prize value of agent a is 1, the prize value of agent B is 50, and the penalty value of agent C is 100. The agent A, the agent B and the agent C summarize respective rewards and punishments, form a rewards and punishments list { +1, +50, -100}, and transmit the rewards and punishments list { +1, +50, -100}, and the multi-agent master control updates the network of the agent according to the reward and punishments list counter-propagation: agent A updates its own neural network according to prize value 1 in the prize and punishment list, agent B updates its own neural network according to prize value 50 in the prize and punishment list, and agent C updates its own neural network according to punishment value 100 in the prize and punishment list.

The invention makes directional rewards/penalties for rewards of multiple intelligent agents, and for single intelligent agents under the task of multiple intelligent agents, the intelligent agent which has made the optimal decision at present is encouraged and reserved, while the intelligent agent which has made the wrong decision is punished directionally, so that the neural network optimization process of other intelligent agents is not influenced. Based on the above, the multi-agent AI in the invention can clearly know the wrong agent object when in back propagation, thereby punishing the object only when seeking a derivative, accelerating the convergence and training speed of the neural network, and further improving the effect of the multi-agent AI. According to the invention, the multi-agent task is split into each independent individual from the team, so that the multi-agent AI is more colorful in global strategy and more remarkable under the training of the game AI.

The invention is very suitable for decision tasks which are highly dependent on single agents to make individual characteristics under the problem of multi-agent AI. For example, a team may cooperate with a game, requiring a single agent to make a sacrifice, thereby preserving the revenue of the entire team. On such premise, the old solution does not allow the single agent to take extreme strategies off the team, since the team will share rewards and the neural network will not be considered as an excellent decision to be made by this single agent. The directional rewarding and punishing scheme of the invention gives each agent an independent rewarding value, considers that the sacrificing little my agent makes a very intelligent decision, gives a correct rewarding guide, encourages the multi-agent to continue to take such decisions in the same state in the future, and further correctly optimizes the multi-agent AI.

Based on the same inventive concept, the invention also discloses a device for improving convergence and training speed of the neural network with multiple intelligent agents, wherein the device comprises a processor and a memory; the memory is for storing one or more software programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method as described above.

The invention also discloses a computer readable storage medium having instructions stored therein which, when run on a terminal device, cause the terminal device to perform the method as described above.

The invention also discloses a computer software program product which, when run on a terminal device, causes the terminal device to perform the method as described above.

The foregoing embodiments of the present invention are not intended to limit the technical scope of the present invention, and therefore, any minor modifications, equivalent variations and modifications made to the above embodiments according to the technical principles of the present invention still fall within the scope of the technical proposal of the present invention.

Claims

1. A method for improving convergence and training speed of a neural network with multiple agents, which is characterized by comprising the following steps: the method is realized based on a multi-intelligent system, the multi-intelligent system comprises a multi-intelligent body master control and N intelligent bodies, and a buried point is arranged in the feedback of each intelligent body and is used for judging whether the instruction of the intelligent body is wrong or not and whether an excellent decision is made or not; the method comprises the following steps:

according to the instruction result, the intelligent body i combines with the buried point judgment in feedback to give the intelligent body a reward and punishment feedback Mi, wherein i=1, 2, … … and N;

transmitting N intelligent agent rewards and punishments to a multi-intelligent agent master control, wherein the N intelligent agent rewards and punishments are assembled into a rewards and punishments list { M1, M2, …, mi, …, MN };

the master control of multiple agents reversely updates the neural network of each agent according to the rewards and punishments list { M1, M2, …, mi, …, MN }, namely, each agent i updates the neural network of the agent according to the rewards and punishments value Mi.

2. The utility model provides a neural network promotes device of convergence and training speed with many agents which characterized in that: the apparatus includes a processor and a memory;

the memory is for storing one or more software programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of claim 1.

3. A computer-readable storage medium, characterized by: the computer readable storage medium has stored therein instructions which, when run on a terminal device, cause the terminal device to perform the method of claim 1.