CN112819144B - Method for improving convergence and training speed of neural network with multiple agents - Google Patents
Method for improving convergence and training speed of neural network with multiple agents Download PDFInfo
- Publication number
- CN112819144B CN112819144B CN202110192255.2A CN202110192255A CN112819144B CN 112819144 B CN112819144 B CN 112819144B CN 202110192255 A CN202110192255 A CN 202110192255A CN 112819144 B CN112819144 B CN 112819144B
- Authority
- CN
- China
- Prior art keywords
- agent
- intelligent
- agents
- neural network
- rewards
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 30
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000012549 training Methods 0.000 title claims abstract description 18
- 238000005457 optimization Methods 0.000 abstract description 6
- 230000000694 effects Effects 0.000 abstract description 3
- 239000003795 chemical substances by application Substances 0.000 description 109
- 230000002787 reinforcement Effects 0.000 description 20
- 238000013473 artificial intelligence Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000006399 behavior Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 1
- 238000012854 evaluation process Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 239000009891 weiqi Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- A—HUMAN NECESSITIES
- A63—SPORTS; GAMES; AMUSEMENTS
- A63F—CARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
- A63F13/00—Video games, i.e. games using an electronically generated display having two or more dimensions
- A63F13/60—Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor
- A63F13/67—Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor adaptively or by learning from player actions, e.g. skill level adjustment or by storing successful combat sequences for re-use
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- A—HUMAN NECESSITIES
- A63—SPORTS; GAMES; AMUSEMENTS
- A63F—CARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
- A63F2300/00—Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game
- A63F2300/60—Methods for processing data by generating or executing the game program
- A63F2300/6027—Methods for processing data by generating or executing the game program using adaptive systems learning from user actions, e.g. for skill level adjustment
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Multimedia (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Feedback Control In General (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to a method, a device and a storable medium for improving convergence and training speed of a neural network with multiple agents, which can give directional rewards/penalties to rewards of the multiple agents, wherein for single agents under the task of the multiple agents, the agents which have made optimal decisions at present are encouraged and reserved, while the agents which have made wrong decisions are given directional penalties, and the neural network optimization process of other agents is not influenced. Based on the above, the multi-agent AI in the invention can clearly know the wrong agent object when in back propagation, thereby punishing the object only when seeking a derivative, accelerating the convergence and training speed of the neural network, and further improving the effect of the multi-agent AI.
Description
Technical Field
The invention relates to the technical field of artificial intelligence reinforcement learning, in particular to a method for improving convergence and training speed of a neural network with multiple intelligent agents.
Background
As shown in fig. 1, reinforcement learning is learning by an Agent (Agent) in a "trial and error" manner, and obtains rewards by interacting with the environment, so that the Agent obtains the largest rewards, and the reinforcement learning is different from supervised learning in connection with sense learning and mainly represented on reinforcement signals, wherein reinforcement signals provided by the environment in reinforcement learning are used for evaluating whether to generate actions, and not for telling the reinforcement learning system RLS (reinforcement learning system) how to generate correct actions. Since little information is provided by the external environment, RLS must learn from its own experiences. In this way, the RLS obtains knowledge in the context of the action-assessment, improving the action plan to suit the context.
If a certain behavior strategy of an agent leads to a positive prize (signal enhancement) for the environment, the agent's later trend to develop this behavior strategy will be enhanced. The goal of the agent is to find the optimal strategy at each discrete state to maximize the desired discount rewards and. Reinforcement learning refers to learning as a heuristic evaluation process, in which an agent selects an action for an environment, the state of the environment changes after receiving the action, and a reinforcement signal (rewards or punishments) is generated and fed back to the agent, and the agent selects the next action according to the reinforcement signal and the current state of the environment, wherein the selection principle is that the probability of receiving positive reinforcement (rewards) is increased. The selected action affects not only the timely reinforcement signal, but also the state at the next moment in the environment and the final reinforcement signal. The goal of reinforcement learning system learning is to dynamically adjust parameters to achieve reinforcement signal maximization. For example, in the artificial intelligence training of weiqi, if an artificial intelligence AI falls on a position where a piece already exists, a penalty needs to be made on the action policy, so as to guide the AI to perform optimization. (the present invention calls positive scores as rewards and negative deductions as penalties.)
In reinforcement learning artificial intelligence training, there is a multi-agent rewards (review) setting problem. As shown in fig. 2, in the technical solution for processing multiple agents, rewards (Reward) are obtained for the whole of the multiple agents AI, so that the neural network is optimized by further back propagation according to the rewards and punishments. The disadvantage of uniformly finding rewards under the multi-agent problem is that when the multi-agent is optimized, which agent is better and which agent is worse is not known in fact, so that the multi-agent AI is guided to make more effective optimization. Because of this disadvantage, the multi-agent AI will not allow the single agent to make instructions to compare the benefits of the out-of-team at the time of optimization, thereby affecting the possibility that the multi-agent AI explores the best strategy and losing many opportunities to train the curiosity.
For example, in a round-robin game, the multi-agent AI operates as a character for the entire team. In round-making games, such as games with war mists, the multi-agent AI plays a team of players, each player observes different misting states because of their different viewing angles, and the multi-agent AI pieces the states into global information, thereby making further decisions to let each player execute different instructions, respectively. If there are multiple roles to operate in the team, then the error instruction they make needs to be passed to the multi-agent AI to make a penalty. The problem is that the Reward setting of AI shares a Reward report according to the existing technology, namely, a plurality of roles to be operated are stained and co-stained, a role error is subjected to full team penalty, even if only one role of the full team is in error, the Reward is still subjected to penalty (of course, penalty is lower than that of the full team error).
In view of the above, the present inventors have made intensive ideas to the above problems, and have made the present invention.
Disclosure of Invention
The invention aims to provide a method for improving convergence and training speed of a neural network with multiple agents, which can improve the convergence and training speed of the neural network by orienting rewards/penalties.
In order to achieve the above purpose, the invention adopts the following technical scheme:
the method is realized based on a multi-agent system, wherein the multi-agent system comprises a multi-agent master control and N agents, and a buried point is arranged in feedback of each agent and is used for judging whether instructions of the agents are wrong or not and making an excellent decision; the method comprises the following steps:
inputting state information, and transmitting the current state information to N intelligent agents;
the intelligent agent outputs respective instructions according to the respective neural networks and by combining the current state information;
the intelligent agent gives rewards and punishments feedback to the intelligent agent according to the instruction result and by combining with the buried point judgment in the feedback;
transmitting the rewards and punishments list of the N intelligent agents to a multi-intelligent agent master control;
and the master control of multiple agents reversely updates the neural network of each agent according to the rewarding and punishing list.
An apparatus for improving convergence and training speed for a neural network having multiple agents, the apparatus comprising a processor and a memory;
the memory is for storing one or more software programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method as described above.
A computer readable storage medium having instructions stored therein which, when run on a terminal device, cause the terminal device to perform a method as described above.
A computer software program product which, when run on a terminal device, causes the terminal device to perform a method as described above.
After the scheme is adopted, the invention makes directional rewards/penalties for rewards of multiple agents, and for single agents under the task of multiple agents, the agents which have made optimal decisions at present are encouraged and reserved, while the agents which have made wrong decisions are subjected to directional penalties, so that the neural network optimization process of other agents is not influenced. Based on the above, the multi-agent AI in the invention can clearly know the wrong agent object when in back propagation, thereby punishing the object only when seeking a derivative, accelerating the convergence and training speed of the neural network, and further improving the effect of the multi-agent AI. According to the invention, the multi-agent task is split into each independent individual from the team, so that the multi-agent AI is more colorful in global strategy and more remarkable under the training of the game AI.
Drawings
FIG. 1 is a schematic diagram of reinforcement learning;
FIG. 2 is a schematic diagram of a multi-agent reinforcement learning method in the prior art;
FIG. 3 is a schematic diagram of a multi-agent reinforcement learning method according to the present invention;
fig. 4 is a schematic diagram of a learning method according to an embodiment of the invention.
Detailed Description
As shown in fig. 3, the present invention discloses a method for improving convergence and training speed of a neural network with multiple agents, which is implemented based on a multi-agent system, wherein the multi-agent system comprises a multi-agent master control and N agents, and a buried point is arranged in feedback of each agent for judging whether instructions of the agents are wrong or not and making an excellent decision. The method comprises the following steps:
inputting state information, and transmitting the current state information to N intelligent agents;
the intelligent agent outputs respective instructions according to the respective neural networks and by combining the current state information;
the intelligent agent gives rewards and punishments feedback to the intelligent agent according to the instruction result and by combining with the buried point judgment in the feedback;
transmitting the rewards and punishments list of the N intelligent agents to a multi-intelligent agent master control;
and the master control of multiple agents reversely updates the neural network of each agent according to the rewarding and punishing list.
Fig. 4 shows an embodiment of the present invention, which is a total of three agents, namely agent a, agent B and agent C. After the state information is input, the intelligent agent A, the intelligent agent B and the intelligent agent C output instructions according to the respective neural networks and by combining the input state information; and then, according to the instruction result and the buried point judgment, giving out intelligent punishment values. In this embodiment, the prize value of agent a is 1, the prize value of agent B is 50, and the penalty value of agent C is 100. The agent A, the agent B and the agent C summarize respective rewards and punishments, form a rewards and punishments list { +1, +50, -100}, and transmit the rewards and punishments list { +1, +50, -100}, and the multi-agent master control updates the network of the agent according to the reward and punishments list counter-propagation: agent A updates its own neural network according to prize value 1 in the prize and punishment list, agent B updates its own neural network according to prize value 50 in the prize and punishment list, and agent C updates its own neural network according to punishment value 100 in the prize and punishment list.
The invention makes directional rewards/penalties for rewards of multiple intelligent agents, and for single intelligent agents under the task of multiple intelligent agents, the intelligent agent which has made the optimal decision at present is encouraged and reserved, while the intelligent agent which has made the wrong decision is punished directionally, so that the neural network optimization process of other intelligent agents is not influenced. Based on the above, the multi-agent AI in the invention can clearly know the wrong agent object when in back propagation, thereby punishing the object only when seeking a derivative, accelerating the convergence and training speed of the neural network, and further improving the effect of the multi-agent AI. According to the invention, the multi-agent task is split into each independent individual from the team, so that the multi-agent AI is more colorful in global strategy and more remarkable under the training of the game AI.
The invention is very suitable for decision tasks which are highly dependent on single agents to make individual characteristics under the problem of multi-agent AI. For example, a team may cooperate with a game, requiring a single agent to make a sacrifice, thereby preserving the revenue of the entire team. On such premise, the old solution does not allow the single agent to take extreme strategies off the team, since the team will share rewards and the neural network will not be considered as an excellent decision to be made by this single agent. The directional rewarding and punishing scheme of the invention gives each agent an independent rewarding value, considers that the sacrificing little my agent makes a very intelligent decision, gives a correct rewarding guide, encourages the multi-agent to continue to take such decisions in the same state in the future, and further correctly optimizes the multi-agent AI.
Based on the same inventive concept, the invention also discloses a device for improving convergence and training speed of the neural network with multiple intelligent agents, wherein the device comprises a processor and a memory; the memory is for storing one or more software programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method as described above.
The invention also discloses a computer readable storage medium having instructions stored therein which, when run on a terminal device, cause the terminal device to perform the method as described above.
The invention also discloses a computer software program product which, when run on a terminal device, causes the terminal device to perform the method as described above.
The foregoing embodiments of the present invention are not intended to limit the technical scope of the present invention, and therefore, any minor modifications, equivalent variations and modifications made to the above embodiments according to the technical principles of the present invention still fall within the scope of the technical proposal of the present invention.
Claims (3)
1. A method for improving convergence and training speed of a neural network with multiple agents, which is characterized by comprising the following steps: the method is realized based on a multi-intelligent system, the multi-intelligent system comprises a multi-intelligent body master control and N intelligent bodies, and a buried point is arranged in the feedback of each intelligent body and is used for judging whether the instruction of the intelligent body is wrong or not and whether an excellent decision is made or not; the method comprises the following steps:
inputting state information, and transmitting the current state information to N intelligent agents;
the intelligent agent outputs respective instructions according to the respective neural networks and by combining the current state information;
according to the instruction result, the intelligent body i combines with the buried point judgment in feedback to give the intelligent body a reward and punishment feedback Mi, wherein i=1, 2, … … and N;
transmitting N intelligent agent rewards and punishments to a multi-intelligent agent master control, wherein the N intelligent agent rewards and punishments are assembled into a rewards and punishments list { M1, M2, …, mi, …, MN };
the master control of multiple agents reversely updates the neural network of each agent according to the rewards and punishments list { M1, M2, …, mi, …, MN }, namely, each agent i updates the neural network of the agent according to the rewards and punishments value Mi.
2. The utility model provides a neural network promotes device of convergence and training speed with many agents which characterized in that: the apparatus includes a processor and a memory;
the memory is for storing one or more software programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of claim 1.
3. A computer-readable storage medium, characterized by: the computer readable storage medium has stored therein instructions which, when run on a terminal device, cause the terminal device to perform the method of claim 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110192255.2A CN112819144B (en) | 2021-02-20 | 2021-02-20 | Method for improving convergence and training speed of neural network with multiple agents |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110192255.2A CN112819144B (en) | 2021-02-20 | 2021-02-20 | Method for improving convergence and training speed of neural network with multiple agents |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112819144A CN112819144A (en) | 2021-05-18 |
CN112819144B true CN112819144B (en) | 2024-02-13 |
Family
ID=75864251
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110192255.2A Active CN112819144B (en) | 2021-02-20 | 2021-02-20 | Method for improving convergence and training speed of neural network with multiple agents |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112819144B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020000399A1 (en) * | 2018-06-29 | 2020-01-02 | 东莞理工学院 | Multi-agent deep reinforcement learning proxy method based on intelligent grid |
CN111079717A (en) * | 2020-01-09 | 2020-04-28 | 西安理工大学 | Face recognition method based on reinforcement learning |
CN112286203A (en) * | 2020-11-11 | 2021-01-29 | 大连理工大学 | Multi-agent reinforcement learning path planning method based on ant colony algorithm |
-
2021
- 2021-02-20 CN CN202110192255.2A patent/CN112819144B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020000399A1 (en) * | 2018-06-29 | 2020-01-02 | 东莞理工学院 | Multi-agent deep reinforcement learning proxy method based on intelligent grid |
CN111079717A (en) * | 2020-01-09 | 2020-04-28 | 西安理工大学 | Face recognition method based on reinforcement learning |
CN112286203A (en) * | 2020-11-11 | 2021-01-29 | 大连理工大学 | Multi-agent reinforcement learning path planning method based on ant colony algorithm |
Also Published As
Publication number | Publication date |
---|---|
CN112819144A (en) | 2021-05-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110991545B (en) | Multi-agent confrontation oriented reinforcement learning training optimization method and device | |
Shao et al. | Starcraft micromanagement with reinforcement learning and curriculum transfer learning | |
CN110882544B (en) | Multi-agent training method and device and electronic equipment | |
CN108211362B (en) | Non-player character combat strategy learning method based on deep Q learning network | |
Loiacono et al. | The 2009 simulated car racing championship | |
Hwang et al. | Cooperative strategy based on adaptive Q-learning for robot soccer systems | |
CN109794937B (en) | Football robot cooperation method based on reinforcement learning | |
CN113952733A (en) | Multi-agent self-adaptive sampling strategy generation method | |
CN112488320A (en) | Training method and system for multiple intelligent agents under complex conditions | |
CN112044076B (en) | Object control method and device and computer readable storage medium | |
Andou | Refinement of soccer agents' positions using reinforcement learning | |
Kose et al. | Q-learning based market-driven multi-agent collaboration in robot soccer | |
CN112149344A (en) | Football robot with ball strategy selection method based on reinforcement learning | |
CN116187777A (en) | Unmanned aerial vehicle air combat autonomous decision-making method based on SAC algorithm and alliance training | |
CN116991067A (en) | Pulse type track-chasing-escaping-blocking cooperative game intelligent decision control method | |
CN115409158A (en) | Robot behavior decision method and device based on layered deep reinforcement learning model | |
CN116090549A (en) | Knowledge-driven multi-agent reinforcement learning decision-making method, system and storage medium | |
CN112819144B (en) | Method for improving convergence and training speed of neural network with multiple agents | |
Wang et al. | Experience sharing based memetic transfer learning for multiagent reinforcement learning | |
CN111814988B (en) | Testing method of multi-agent cooperative environment reinforcement learning algorithm | |
CN116227622A (en) | Multi-agent landmark coverage method and system based on deep reinforcement learning | |
Tan et al. | Automated evaluation for AI controllers in tower defense game using genetic algorithm | |
Stein et al. | Combining NEAT and PSO for learning tactical human behavior | |
Packard et al. | Learning behavior from limited demonstrations in the context of games | |
Ali et al. | Evolving emergent team strategies in robotic soccer using enhanced cultural algorithms |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |