CN111783944A

CN111783944A - Rule embedded multi-agent reinforcement learning method and device based on combination training

Info

Publication number: CN111783944A
Application number: CN202010568287.3A
Authority: CN
Inventors: 李渊; 张帅; 徐新海; 刘逊韵; 张峰; 李豪
Original assignee: Research Institute of War of PLA Academy of Military Science
Current assignee: Research Institute of War of PLA Academy of Military Science
Priority date: 2020-06-19
Filing date: 2020-06-19
Publication date: 2020-10-16

Abstract

The embodiment of the invention discloses a rule embedded multi-agent reinforcement learning method and device based on combination training, wherein a rule base and reinforcement learning are effectively combined, modeling and solving of game countermeasure problems can be realized, indirect action types are introduced, and when an agent finds a self solution space, a decision of whether to use a rule is increased, the defect of preferential use of the rule is avoided, and the effectiveness of combination of the rule and learning is improved. In addition, after the indirect actions generated by the multi-agent reinforcement learning model specify the rule base used by the agent, the rule selection model is used to select the most appropriate rule from the specified rule base. Through a two-stage rule selection mechanism, the influence of invalid rules on the reinforcement learning effect can be effectively reduced. Aiming at the training problem comprising two heterogeneous models, a combined training method is provided, the two models are obtained through repeated iterative training, and the fusion training of the heterogeneous models is realized.

Description

Rule embedded multi-agent reinforcement learning method and device based on combination training

Technical Field

The invention relates to the technical field of multi-agent reinforcement learning, in particular to a rule embedded multi-agent reinforcement learning method and device based on combination training.

Background

Since 2016 Alpha Go defeats the top players in Go, reinforcement learning techniques have attracted considerable attention and have made outstanding advances in many areas such as simulation, game play, recommendation systems, and the like. The complexity of real-world problems has prompted the expansion of reinforcement learning from the single agent domain to the multi-agent domain. Instead of a simple superposition of multiple single agents, a multi-agent system forms an integrated capability through competition and cooperation relationships between the multi-agents. On the one hand, the increase of the number of the agents greatly expands the multi-agent state-action, which leads to the rapid increase of the difficulty of solving the multi-agent problem. On the other hand, multi-agent reinforcement learning requires not only a large number of trial-and-error interactions with the environment by each agent but also a competitive cooperation by the multi-agents, which makes multi-agent reinforcement learning very difficult. The traditional method for learning from zero leads to the random exploration of the intelligent agent in a huge strategy space, leads to a large amount of invalid exploration and has low exploration efficiency; the agent has no initial experience and the training period is long. The training of the intelligent agent needs a large amount of training data, and the training effect is difficult to realize. These problems lead to the reinforcement learning method falling into a local solution among many practical problems, which is not ideal.

There is often some experience or knowledge accumulated previously to the learning problem of most multi-agents. If the experience knowledge is integrated into the learning process in a certain mode to guide the exploration of the intelligent agent, a plurality of invalid explorations can be avoided, so that the intelligent agent training speed is higher, and the effect is better. A typical example is that in the inter-satellite dispute agent confrontation game of 2018, the rule-based multi-agent system gained a championship thanks to the guidance of the korean electronic contestants. The traditional technical approach based on human knowledge and experience, such as an expert system, effectively organizes a large amount of experience and knowledge for analyzing and solving actual problems. However, the knowledge and experience of human are numerous, and it is difficult to establish a complete, strong-adaptability and high-intelligence expert system.

The mode based on human experience knowledge and the mode based on data learning have advantages and disadvantages respectively, and the combination of the two modes is an effective means for efficiently solving the problem of the multi-agent. Current research on the way knowledge is combined with learning is also in a more initial state. Most commonly, a combination of knowledge experience expressed in a regular form and directly coupled with learning. In use, the method prioritizes the use of rules once they are matched. When there is no rule match, reinforcement learning is used for exploration. The disadvantage of this approach is that rules are always used with priority and invalid rules cannot be filtered.

Therefore, the problems of huge exploration space, long training period, unobvious training effect and the like exist in the multi-agent reinforcement learning. Embedding human knowledge experience in a learning process in a regular form is an effective means for improving the learning ability of multi-agent. The existing general rule coupling learning method is only suitable for determining effective scenes by rules and cannot discriminate the effectiveness of the rules. The invalid rule can not help the learning effect of the multi-agent, and can even weaken the learning effect of the multi-agent.

Disclosure of Invention

Aiming at the problems in the prior art, the embodiment of the invention provides a rule embedded multi-agent reinforcement learning method and device based on combination training.

Specifically, the embodiment of the invention provides the following technical scheme:

in a first aspect, an embodiment of the present invention provides a rule-embedded multi-agent reinforcement learning method based on combination training, including:

establishing a multi-agent reinforcement learning model fusing rules; the multi-agent reinforcement learning model comprises an agent neural network structure model corresponding to each agent and a mixed neural network structure model corresponding to all agents; each intelligent agent neural network structure model is used for receiving respective observation states and outputting next actions of the intelligent agent according to the observation states, wherein the actions comprise direct actions or indirect actions; wherein a direct action represents an action that can be directly performed in a multi-agent environment; the indirect action represents that matched rules need to be selected from a rule base corresponding to the indirect action, and the rules are analyzed to obtain direct actions; the hybrid neural network structure model is used for receiving the actions output by all the intelligent neural network structure models and outputting a global action for guiding the training of each intelligent neural network structure model;

establishing a rule selection reinforcement learning model; the rule selection reinforcement learning model comprises a rule base module, a depth reinforcement learning model and a rule analysis module; the rule base module is used for selecting a rule base corresponding to the indirect action from multi-class rule bases corresponding to the corresponding intelligent agent according to the indirect action; the deep reinforcement learning model is used for determining matched rules from a selected rule base; the rule analysis module is used for analyzing the rule;

and performing combined training on the multi-agent reinforcement learning model and the rule selection reinforcement learning model, fixing the rule selection reinforcement learning model during the training of the multi-agent reinforcement learning model, fixing the multi-agent reinforcement learning model during the training of the rule selection reinforcement learning model, and completing the combined training of the multi-agent reinforcement learning model and the rule selection reinforcement learning model through repeated iteration.

Further, the combined training of the multi-agent reinforcement learning model and the rule selection reinforcement learning model is performed by combining and training the multi-agent reinforcement learning model and the rule selection reinforcement learning model, fixing the rule selection reinforcement learning model during the training of the multi-agent reinforcement learning model, fixing the multi-agent reinforcement learning model during the training of the rule selection reinforcement learning model, and performing the combined training of the multi-agent reinforcement learning model and the rule selection reinforcement learning model through repeated iteration, specifically comprising:

randomly initializing a rule selection reinforcement learning model, fixing the rule selection reinforcement learning model to train a multi-agent reinforcement learning model, after the multi-agent reinforcement learning model is trained, fixing the multi-agent reinforcement learning model to train the rule selection reinforcement learning model, and repeatedly performing iterative training for K times; for the training of the multi-agent reinforcement learning model, the multi-agent tasks are simulated and operated for L times, each task operates T steps at the maximum, and each step generates a corresponding sample through interaction with the environment and is placed in a sample library; when the multi-agent reinforcement learning model is trained, a preset number of samples are required to be taken out from a sample library; and for the training of the rule selection reinforcement learning model, the multi-agent tasks are simulated and operated for L 'times, and T' rounds are executed at most for each task.

Further, the multi-agent reinforcement learning model adopts a central training-distributed execution architecture;

the behavior decision of each agent is determined by a neural network, and the hybrid network is used for coordinating the behaviors among the agents; during training, each agent uses respective observation information, the hybrid network uses global state information to realize central training, and each agent only makes a decision according to the own observation state during execution to realize distributed execution.

Further, still include:

expressing a multi-agent problem by quintuple (A, S, U, R, O), wherein A represents a group of multi-agent set, S represents the environment state of the multi-agent, U represents action set, R represents return value set, O represents the observation state of each agent, and each agent a ∈ A selects an action U at each time step^a∈ U, after the actions of all agents are executed, the environment of the agent jumps from one state s to another state s', each agent obtains the return value r of the current action from the environment after executing the action^a∈R。

Further, the deep reinforcement learning model adopts a DQN algorithm to complete a rule selection process, the deep reinforcement learning model calculates an action according to a current state, the action is used for specifying a specific rule in a rule base, and the deep reinforcement learning model generates a preset number of samples (o, v, o ', r) through interaction with a multi-agent environment and stores the samples (o, v, o', r) into the sample base; wherein o is the current state, v is the action calculated according to the current state, o' is the next state after the action is executed, r represents the return value after the action is executed, a specified number of samples are taken from a sample library during the deep reinforcement learning model training, the current state-action evaluation value is calculated according to the current value network, the maximum possible evaluation value of the next state-action is calculated according to the target value network, the gradient of the error function is calculated according to the two evaluation values and the DQN error function, and the current value network is trained and updated.

Further, still include:

determining a neural network structure, a hybrid network and a network structure of a rule selection reinforcement learning model of each agent according to the multi-agent task, and setting a rule base set of each agent;

starting a combined training algorithm, and performing combined training on the multi-agent reinforcement learning model and the rule selection reinforcement learning model;

in each time step, the intelligent agent obtains respective observation states from a multi-agent environment, and the next action is obtained through calculation based on an intelligent agent neural network structure model;

if the action is a direct action, the multi-agent environment is directly acted on; if the action is an indirect action, the indirect action is transmitted to the rule base module;

the rule base module determines a rule base which should be used by the intelligent agent currently according to the indirect action, and transmits relevant parameters of the rule base to the deep reinforcement learning model, so that the deep reinforcement learning model selects a matched rule from the rule base according to the current observation situation;

the rule analysis module analyzes the matched rule to obtain a direct action and acts on the environment according to the matched rule and a rule base which should be used currently;

if the multi-agent environment reaches a termination state, the task is ended; and if not, generating the next state by the multi-agent environment, returning to the step of combined training, and repeatedly executing.

Furthermore, the intelligent agent neural network structure model adopts a circulating neural network structure, an input layer and an output layer are full-connection networks, and a middle hidden layer is a circulating neural network layer; the hybrid neural network structure model adopts a multilayer perceptron neural network structure, realizes nonlinear integration of multi-agent output through the multilayer perceptron neural network structure, and realizes cooperation of the multi-agents.

In a second aspect, an embodiment of the present invention further provides a rule-embedded multi-agent reinforcement learning device based on combination training, including:

the first establishing module is used for establishing a multi-agent reinforcement learning model fusing rules; the multi-agent reinforcement learning model comprises an agent neural network structure model corresponding to each agent and a mixed neural network structure model corresponding to all agents; each intelligent agent neural network structure model is used for receiving respective observation states and outputting next actions of the intelligent agent according to the observation states, wherein the actions comprise direct actions or indirect actions; wherein a direct action represents an action that can be directly performed in a multi-agent environment; the indirect action represents that matched rules need to be selected from a rule base corresponding to the indirect action, and the rules are analyzed to obtain direct actions; the hybrid neural network structure model is used for receiving the actions output by all the intelligent neural network structure models and outputting a global action for guiding the training of each intelligent neural network structure model;

the second establishing module is used for establishing a rule selection reinforcement learning model; the rule selection reinforcement learning model comprises a rule base module, a depth reinforcement learning model and a rule analysis module; the rule base module is used for selecting a rule base corresponding to the indirect action from multi-class rule bases corresponding to the corresponding intelligent agent according to the indirect action; the deep reinforcement learning model is used for determining matched rules from a selected rule base; the rule analysis module is used for analyzing the rule;

and a third establishing module, configured to perform combined training on the multi-agent reinforcement learning model and the rule selection reinforcement learning model, fix the rule selection reinforcement learning model during training the multi-agent reinforcement learning model, fix the multi-agent reinforcement learning model during training the rule selection reinforcement learning model, and complete combined training on the multi-agent reinforcement learning model and the rule selection reinforcement learning model through repeated iteration.

In a third aspect, an embodiment of the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor, when executing the computer program, implements the rule-embedded multi-agent reinforcement learning method based on combination training according to the first aspect.

In a fourth aspect, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the rule-embedded multi-agent reinforcement learning method based on combination training according to the first aspect.

According to the technical scheme, the rule-embedded multi-agent reinforcement learning method and device based on combined training provided by the embodiment of the invention establish a rule-fused multi-agent reinforcement learning model and a rule selection reinforcement learning model, and perform combined training on the multi-agent reinforcement learning model and the rule selection reinforcement learning model. Therefore, the rule base and the reinforcement learning are effectively combined in the rule-embedded multi-agent reinforcement learning method provided by the embodiment of the invention, the modeling and solving of the game countermeasure problem can be realized, the decision of whether to use the rule is increased while the agent finds the self solution space by introducing the indirect action type, the defect of preferential use of the rule is avoided, and the effectiveness of the combination of the rule and the learning is improved. In addition, each agent corresponds to a plurality of types of rule bases, and after the indirect action generated by the multi-agent reinforcement learning model specifies the rule base used by the agent, the rule selection model is used for selecting the most appropriate rule from the specified rule base. Through a two-stage rule selection mechanism, the influence of invalid rules on the reinforcement learning effect can be effectively reduced. Aiming at the training problems of two heterogeneous models including multi-agent reinforcement learning and rule selection reinforcement learning, the embodiment of the invention provides a combined training method.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of a rule-embedded multi-agent reinforcement learning method based on combination training according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a multi-agent reinforcement learning model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a multi-agent reinforcement learning model neural network architecture according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a rule selection reinforcement learning model according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an application scenario provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of an embedded rule multi-agent model application schema provided by an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a rule-embedded multi-agent reinforcement learning device based on combination training according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following further describes embodiments of the present invention with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

The mode based on human experience knowledge and the mode based on data learning have advantages and disadvantages respectively, and the combination of the two modes is an effective means for efficiently solving the problem of the multi-agent. Current research on the way knowledge is combined with learning is also in a more initial state. Most commonly, a combination of knowledge experience expressed in a regular form and directly coupled with learning. In use, the method prioritizes the use of rules once they are matched. When there is no rule match, reinforcement learning is used for exploration. The disadvantage of this approach is that rules are always used with priority and invalid rules cannot be filtered. In practical problems, there are often a large number of accumulated rules. In use, which rules are valid, and when valid rules are used, there is no question of which. Therefore, the effective combination of rules and learning is an important issue for multi-agent reinforcement learning applications.

The problems of huge exploration space, long training period, unobvious training effect and the like exist in multi-agent reinforcement learning. Embedding human knowledge experience in a learning process in a regular form is an effective means for improving the learning ability of multi-agent. The general rule coupling learning method is only suitable for determining effective scenes by rules and cannot discriminate the effectiveness of the rules. Invalid rules may also diminish the learning effect of the multi-agent. In order to solve the problem, the embodiment of the invention provides a new method for combining rules with multi-agent reinforcement learning, and an appropriate rule is selected according to needs in the learning process. The embodiment of the invention establishes a rule selection reinforcement learning model for a corresponding rule system while carrying out multi-agent reinforcement learning modeling on the solved problem, and fuses the training of two heterogeneous models together in a combined training mode. In actual operation, whether a mode of reinforcement learning exploration or a mode of rules is adopted in each step is determined according to real-time situation, and the purpose of flexible rule embedding is achieved. The rule-embedded multi-agent reinforcement learning method based on combination training provided by the invention is explained and explained by specific embodiments.

Fig. 1 is a flowchart illustrating a rule-embedded multi-agent reinforcement learning method based on combination training according to an embodiment of the present invention, and as shown in fig. 1, the rule-embedded multi-agent reinforcement learning method based on combination training according to the embodiment of the present invention specifically includes the following contents:

step 101: establishing a multi-agent reinforcement learning model fusing rules; the multi-agent reinforcement learning model comprises an agent neural network structure model corresponding to each agent and a mixed neural network structure model corresponding to all agents; each intelligent agent neural network structure model is used for receiving respective observation states and outputting next actions of the intelligent agent according to the observation states, wherein the actions comprise direct actions or indirect actions; wherein a direct action represents an action that can be directly performed in a multi-agent environment; the indirect action represents that matched rules need to be selected from a rule base corresponding to the indirect action, and the rules are analyzed to obtain direct actions; the hybrid neural network structure model is used for receiving the actions output by all the intelligent neural network structure models and outputting a global action for guiding the training of each intelligent neural network structure model;

step 102: establishing a rule selection reinforcement learning model; the rule selection reinforcement learning model comprises a rule base module, a depth reinforcement learning model and a rule analysis module; the rule base module is used for selecting a rule base corresponding to the indirect action from multi-class rule bases corresponding to the corresponding intelligent agent according to the indirect action; the deep reinforcement learning model is used for determining matched rules from a selected rule base; the rule analysis module is used for analyzing the rule;

step 103: and performing combined training on the multi-agent reinforcement learning model and the rule selection reinforcement learning model, fixing the rule selection reinforcement learning model during the training of the multi-agent reinforcement learning model, fixing the multi-agent reinforcement learning model during the training of the rule selection reinforcement learning model, and completing the combined training of the multi-agent reinforcement learning model and the rule selection reinforcement learning model through repeated iteration.

The embodiment provides a method for combining rules with multi-agent reinforcement learning, which mainly comprises the design of a multi-intelligent reinforcement learning model with rules fused, the design of a rule selection reinforcement learning model and the design of a combined training method. In this embodiment, for the multi-agent reinforcement learning model, the specific establishment process is as follows:

a multi-agent problem may be represented by a quintuple (A, S, U, R, O), a ∈ A representing a set of multi-agent sets, S ∈ S representing the environmental conditions in which the multi-agents are located, at each time step, each agent a ∈ A selects an action U^a∈ U. after all agent actions are executed, the environment jumps from one state s to another state s'. after each agent has executed an action, it obtains the reward value r of the current action from the environment^a∈ R. rule-fused Multi-agent reinforcement learning model As shown in FIG. 2A Multi-agent Environment is used to receive behavior actions of Multi-agents in real-time and feed environmental status back to Multi-agent reinforcement learning model A ∈ A receives respective observed states o^aCalculating the next action u^a. The action space of each agent contains two parts, one part is the agent action behavior defined according to the environment and can be directly executed in the environment, called direct action ǔ, and the other part indicates which rule base the agent uses, called indirect action

For the next action g chosen by the intelligent body, the type of g is first determined, if it is a direct action, it can act directly on the environment. If the ug is indirect, a rule selection module selects a proper rule from a corresponding rule base to analyze the rule to form a direct action.

As shown in FIG. 3, the multi-agent reinforcement learning model involves two types of neural networks. One type is the neural network of the agent, which uses a recurrent neural network architecture (RNN), with input and output layers being fully-connected networks and an intermediate hidden layer being the recurrent neural network layer. The number of neurons in each layer and the number of layers in the hidden layer are set according to specific problems. In RNN, the output of hidden layer neuron can be applied to itself at the next time step, i.e. the input of the i-th layer neuron at the time t includes the output of the i-1 layer neuron at the time t and the output of the i-1 layer neuron at the time t-1, thus realizing the processing of the partial situation related to the time sequence. The hybrid network adopts a multilayer perceptron neural network structure, and the number of hidden layers and the number of neurons are set according to practical problems. The hybrid network realizes the nonlinear integration of the output of the multi-agent through the neural network structure of the multilayer perceptron, and realizes the cooperation of the multi-agent.

In this embodiment, for the rule selection reinforcement learning model, the specific establishment process is as follows:

and the rule selection reinforcement learning model is used for the intelligent agent to select reasonable rules from the corresponding rule base according to the current situation, and the reasonable rules are analyzed to generate direct actions to drive the next action. The rule selection reinforcement learning model comprises a rule base module, a deep reinforcement learning (DQN) model and a rule parsing module, as shown in FIG. 4.

The rule base module is used for storing the rule corresponding to the intelligent agent. In a real-world problem, each agent corresponds to a large number of rules. These rules are organized into different types of rule bases. The indirect action output by the agent neural network indicates which rule base to use. The function of selecting the appropriate rule from the specified rule base is performed by the DQN model, depending on the real-time status.

The action output of the DQN model specifies the rule that should be selected under the current situation. And the rule analysis module analyzes the direct action according to the action output of the DQN and the rules in the rule base and acts on the multi-agent environment.

In this embodiment, the multi-agent reinforcement learning model and the rule selection model employ two different types of reinforcement learning methods. The training of both models involves a hybrid network, a network of agents, and a rule selection network for each agent. When training is performed based on a multi-agent environment, if all networks are trained together, it is difficult to achieve the effect due to the fact that the training scale is too large. The embodiment provides a combined training method for different types of reinforcement learning model fusion, and the combined effect is realized through iterative training of two models.

The method provided by the present embodiment will be described in detail with reference to the example shown in fig. 5.

As shown in fig. 5, the red drone (dark drone in fig. 5) is trained against the built-in blue drone (light drone in fig. 5), which is explained by taking 3v3 red-blue multiple drone air battle as an example. Each unmanned aerial vehicle in the red is an intelligent agent, and the course of the next step, whether a missile is launched, a radar frequency point and interference frequency point setting need to be determined at each moment. The detection range of the unmanned aerial vehicle takes the unmanned aerial vehicle as a circle center, a certain distance is a prototype region with a radius, and the detection range is a fan-shaped region within a certain distance along the direction of the machine head. Each unmanned aerial vehicle corresponds to two rule bases.

The rule base 1 includes two rules:

1. when the radar finds a target, the heading of the next step is calculated according to the position of the nearest enemy.

2. And if the enemy target enters the range of the weapon and is not locked by any friend unit missile, executing missile launching action (randomly selecting a missile), otherwise, not carrying out missile attack.

The rule base 2 includes two rules:

1. the remote missile is used for attacking remote enemies, but the short-range missile is preferentially used for carrying out missile attack on the premise that the range is allowed.

2. And randomly selecting the radar frequency points of the unmanned aerial vehicle.

In this example, the direct action is 4 actions: course, whether to launch the missile, radar frequency point and interference frequency point. The heading is a discrete value of 0-359, and whether the missile is launched takes a discrete value of 0-20. 0 denotes no-missile launching, 1-10 denotes long-range missile launching, and 11-20 denotes short-range missile launching. The radar frequency points are discrete values of 1-10. The interference frequency points are discrete values of 1-11. The closer the radar pin point is to the enemy interference frequency point, the smaller the radar detection range is. Each drone has 2 indirect actions: and selecting a rule base 1 and a rule base 2, and obtaining a multi-agent model and a rule selection reinforcement learning model after training by adopting a rule embedding multi-agent reinforcement learning method. Each drone calculates the next action according to the respective neural network, which action, if direct, can be performed directly by the drone. If the action is indirect, the used rule base is clarified, and the appropriate rule is selected from the corresponding rule base by the rule selection model. Direct action is obtained after rule analysis and can be executed by the unmanned aerial vehicle.

According to the technical scheme, the rule embedding multi-agent reinforcement learning method based on combination training provided by the embodiment of the invention establishes a multi-agent reinforcement learning model and a rule selection reinforcement learning model which are combined with rules, and performs combination training on the multi-agent reinforcement learning model and the rule selection reinforcement learning model. Therefore, the rule base and the reinforcement learning are effectively combined, modeling and solving of the game countermeasure problem can be achieved, by introducing the indirect action type, the decision of whether to use the rule is increased while the intelligent agent finds the self solution space, the defect that the rule is used preferentially is overcome, in actual operation, whether the mode of reinforcement learning and searching or the mode of the rule is adopted in each step is determined according to the real-time situation, the purpose of flexible rule embedding is achieved, and the effectiveness of combination of the rule and the learning is improved. In addition, each agent corresponds to a plurality of types of rule bases, and after the indirect action generated by the multi-agent reinforcement learning model specifies the rule base used by the agent, the rule selection model is used for selecting the most appropriate rule from the specified rule base. Through a two-stage rule selection mechanism, the influence of invalid rules on the reinforcement learning effect can be effectively reduced. Aiming at the training problems of two heterogeneous models including multi-agent reinforcement learning and rule selection reinforcement learning, the embodiment of the invention provides a combined training method.

Based on the contents of the foregoing embodiments, in this embodiment, the combined training of the multi-agent reinforcement learning model and the rule selection reinforcement learning model is performed, the rule selection reinforcement learning model is fixed during the training of the multi-agent reinforcement learning model, the multi-agent reinforcement learning model is fixed during the training of the rule selection reinforcement learning model, and the combined training of the multi-agent reinforcement learning model and the rule selection reinforcement learning model is completed through repeated iteration, specifically including:

In this embodiment, two different types of reinforcement learning methods are used for the multi-agent reinforcement learning model and the rule selection model. And the training of the two models involves a hybrid network, a network of agents, and a rule selection network corresponding to each agent. When training is performed based on a multi-agent environment, if all networks are trained together, it is difficult to achieve the effect due to the fact that the training scale is too large. The embodiment provides a combined training method for different types of reinforcement learning model fusion, and the combined effect is realized through iterative training of two models.

The embodiment provides a combined training method aiming at the training problems of two heterogeneous models including multi-agent reinforcement learning and rule selection reinforcement learning. The method can effectively reduce the training difficulty and realize the fusion training of the heterogeneous model.

In this embodiment, the multi-agent reinforcement learning model and rule selection model combined training algorithm is shown in the following code. Before the algorithm runs, the rule base of each agent needs to be specified. Firstly, a model omega is selected by a random initialization rule, and the model omega is fixed to train a multi-agent reinforcement learning model phi. After the model phi is trained, fixing the model phi to train the model omega, and thus iteratively training for K times. For the training of the model phi, a plurality of intelligent agent tasks are simulated and operated for L times, each task operates T steps to the maximum, and corresponding samples are generated and put into a sample library through interaction between each step and the environment. During model training, a certain number of samples need to be taken out of the sample library. The model omega has a training mode similar to the model phi, and runs L 'multi-agent tasks, and each task executes T' rounds at most. Two well-trained models are generated through iterative training.

The following is the implementation process of the combined training algorithm:

based on the content of the above embodiment, in this embodiment, the multi-agent reinforcement learning model adopts a centralized training-distributed execution architecture;

In the embodiment, the multi-agent reinforcement learning model adopts a central training-distributed execution architecture, the behavior decision of each agent is determined by a neural network, and a hybrid network is used for coordinating the behaviors among the agents. During training, each intelligent agent uses respective observation information, the hybrid network uses global state information to realize central training, and each intelligent agent only makes a decision according to the own observation state during execution to realize distributed execution. The input of each agent neural network is the self-observation state o, and the output is the state-action evaluation value Q (s, u). The hybrid network takes the evaluation values of all agents as input, and a global evaluation value Q (u) is obtained through calculation and used for guiding the training of each agent.

Based on the content of the foregoing embodiment, in this embodiment, the method further includes:

A multi-agent problem may be represented by a quintuple (A, S, U, R, O), a ∈ A representing a set of multi-agent sets, S ∈ S representing the environmental conditions in which the multi-agents are located, at each time step, each agent a ∈ A selects an action U^a∈ U. after all agent actions are executed, the environment jumps from one state s to another state s'. after each agent has executed an action, it obtains the reward value r of the current action from the environment^a∈ R. rule-fused Multi-agent reinforcement learning model As shown in FIG. 2A Multi-agent Environment is used to receive behavior actions of Multi-agents in real-time and feed environmental status back to Multi-agent reinforcement learning model A ∈ A receives respective observed states o^aCalculating the next action u^a. The action space of each agent comprises two parts, one part isThe agent's action behavior is defined according to the environment and can be performed directly in the environment, called direct action ǔ, another part indicates which rule base the agent uses, called indirect action

The rule base and the reinforcement learning are effectively combined, modeling and solving of game countermeasure problems can be achieved, indirect action types are introduced, the intelligent agent finds the self solving space, meanwhile, the decision of whether the rule is used or not is increased, the defect that the rule is preferentially used is avoided, and the effectiveness of combining the rule with the learning is improved. Since each agent corresponds to a plurality of types of rule bases, when the rule base used by the agent is specified by the indirect action generated by the multi-agent reinforcement learning model, the most appropriate rule is selected from the specified rule base by using the rule selection reinforcement learning model. Therefore, the influence of invalid rules on the reinforcement learning effect can be effectively reduced through a two-stage rule selection mechanism.

Based on the content of the above embodiment, in this embodiment, the deep reinforcement learning model completes the rule selection process by using a DQN algorithm, calculates an action according to the current state, the action is used to specify a specific rule in a rule base, and generates a preset number of samples (o, v, o', r) through interaction with a multi-agent environment and stores the samples in the sample base; wherein o is the current state, v is the action calculated according to the current state, o' is the next state after the action is executed, r represents the return value after the action is executed, a specified number of samples are taken from a sample library during the deep reinforcement learning model training, the current state-action evaluation value is calculated according to the current value network, the maximum possible evaluation value of the next state-action is calculated according to the target value network, the gradient of the error function is calculated according to the two evaluation values and the DQN error function, and the current value network is trained and updated.

In this embodiment, as shown in FIG. 4, the DQN model employs DQN algorithm to perform the function of selecting rules, the model calculates action V ∈ V from current state O ∈ O, action V specifies specific rules in a rule base, the DQN model generates a large number of samples (O, V, O ', r) and stores them in a sample base through a large number of interactions with the multi-agent environment, O' is the next state after the action is performed, r represents a reward value after the action is performed, the DQN model takes a certain number of samples from the sample base during training, calculates current state-action evaluation value Q (O, V) from the current value network, calculates the maximum possible next state-action evaluation value max (max, V) from the target value network_v′Q (o ', v'). And calculating the gradient of the error function according to the two evaluation values and the DQN error function, and training and updating the current value network. The neural network structures of the current value network and the target value network are the same, a multi-layer perceptron structure is adopted, and the specific network layer number and the number of neurons in each layer are set according to specific problems. The neural network parameters of the target value network are periodically copied from the current value network, and are used for calculating the DQN error function gradient together with the current value network.

In the present embodiment, as shown in fig. 6:

firstly, determining a neural network structure, a hybrid network and a network structure of a rule selection model of each agent according to a multi-agent task, and setting a rule base set of each agent;

secondly, setting parameters such as K, L, T, L ', T' and the like, starting a combined training algorithm, and performing combined training on the multi-agent reinforcement learning model phi and the model omega;

thirdly, using the trained model phi and the trained model omega according to the shown mode, and specifically comprising the following steps:

(1) loading a multi-Agent model phi into Agent 1, Agent 2, … and Agent N; loading the model omega to the DQN model.

(2) At each time step, the agents obtain their own observed states o from the multi-agent environment_aCalculating to obtain action u 'of the next step based on the intelligent body neural network'_a。

(3) If u'_aThe action is direct action, and the action is directly acted on the multi-agent environment; if u'_aIs an indirect operation of u'_aPassed to a rule base.

(4) The rule base is according to indirect action u'_aAnd determining a rule base i which the agent should use currently, and transferring relevant parameters of the rule base i to the DQN model. The DQN model is based on the current observation situationIt is decided which rule in the rule base i should be selected, i.e. action v.

(5) And the rule analysis module analyzes the corresponding rule according to the action v and the rule base i which needs to be used currently to obtain a direct action and acts on the environment.

(6) If the multi-agent environment reaches a termination state, the task is ended; if not, the multi-agent environment generates the next state and returns to the step (2) to repeat the operation.

Based on the content of the above embodiment, in this embodiment, the intelligent neural network structure model adopts a recurrent neural network structure, the input and output layers are full-connection networks, and the middle hidden layer is a recurrent neural network layer; the hybrid neural network structure model adopts a multilayer perceptron neural network structure, realizes nonlinear integration of multi-agent output through the multilayer perceptron neural network structure, and realizes cooperation of the multi-agents.

Fig. 7 is a schematic structural diagram of a rule-embedded multi-agent reinforcement learning device based on combination training according to another embodiment of the present invention, and as shown in fig. 7, the rule-embedded multi-agent reinforcement learning device based on combination training includes: a first building module 21, a second building module 22 and a third building module 23, wherein:

the first establishing module 21 is used for establishing a multi-agent reinforcement learning model of a fusion rule; the multi-agent reinforcement learning model comprises an agent neural network structure model corresponding to each agent and a mixed neural network structure model corresponding to all agents; each intelligent agent neural network structure model is used for receiving respective observation states and outputting next actions of the intelligent agent according to the observation states, wherein the actions comprise direct actions or indirect actions; wherein a direct action represents an action that can be directly performed in a multi-agent environment; the indirect action represents that matched rules need to be selected from a rule base corresponding to the indirect action, and the rules are analyzed to obtain direct actions; the hybrid neural network structure model is used for receiving the actions output by all the intelligent neural network structure models and outputting a global action for guiding the training of each intelligent neural network structure model;

the second establishing module 22 is used for establishing a rule selection reinforcement learning model; the rule selection reinforcement learning model comprises a rule base module, a depth reinforcement learning model and a rule analysis module; the rule base module is used for selecting a rule base corresponding to the indirect action from multi-class rule bases corresponding to the corresponding intelligent agent according to the indirect action; the deep reinforcement learning model is used for determining matched rules from a selected rule base; the rule analysis module is used for analyzing the rule;

a third establishing module 23, configured to perform combined training on the multi-agent reinforcement learning model and the rule selection reinforcement learning model, fix the rule selection reinforcement learning model during training the multi-agent reinforcement learning model, fix the multi-agent reinforcement learning model during training the rule selection reinforcement learning model, and complete combined training on the multi-agent reinforcement learning model and the rule selection reinforcement learning model through repeated iteration.

Since the rule-embedded multi-agent reinforcement learning device based on combination training provided by the embodiment can be used for executing the rule-embedded multi-agent reinforcement learning method based on combination training provided by the above embodiment, the working principle and the beneficial effects are similar, and the detailed description is omitted here.

Based on the same inventive concept, another embodiment of the present invention provides an electronic device, which specifically includes the following components, with reference to fig. 8: a processor 301, a memory 302, a communication interface 303, and a communication bus 304;

the processor 301, the memory 302 and the communication interface 303 complete mutual communication through the communication bus 304; the communication interface 303 is used for realizing information transmission between the devices;

the processor 301 is used to call the computer program in the memory 302, and the processor executes the computer program to implement all the steps of the above-mentioned rule-embedded multi-agent reinforcement learning method based on combination training, for example, the processor executes the computer program to implement the following steps: establishing a multi-agent reinforcement learning model fusing rules; the multi-agent reinforcement learning model comprises an agent neural network structure model corresponding to each agent and a mixed neural network structure model corresponding to all agents; each intelligent agent neural network structure model is used for receiving respective observation states and outputting next actions of the intelligent agent according to the observation states, wherein the actions comprise direct actions or indirect actions; wherein a direct action represents an action that can be directly performed in a multi-agent environment; the indirect action represents that matched rules need to be selected from a rule base corresponding to the indirect action, and the rules are analyzed to obtain direct actions; the hybrid neural network structure model is used for receiving the actions output by all the intelligent neural network structure models and outputting a global action for guiding the training of each intelligent neural network structure model; establishing a rule selection reinforcement learning model; the rule selection reinforcement learning model comprises a rule base module, a depth reinforcement learning model and a rule analysis module; the rule base module is used for selecting a rule base corresponding to the indirect action from multi-class rule bases corresponding to the corresponding intelligent agent according to the indirect action; the deep reinforcement learning model is used for determining matched rules from a selected rule base; the rule analysis module is used for analyzing the rule; and performing combined training on the multi-agent reinforcement learning model and the rule selection reinforcement learning model, fixing the rule selection reinforcement learning model during the training of the multi-agent reinforcement learning model, fixing the multi-agent reinforcement learning model during the training of the rule selection reinforcement learning model, and completing the combined training of the multi-agent reinforcement learning model and the rule selection reinforcement learning model through repeated iteration.

Based on the same inventive concept, yet another embodiment of the present invention provides a non-transitory computer-readable storage medium, having stored thereon a computer program, which when executed by a processor implements all the steps of the above-mentioned combined training-based rule-embedded multi-agent reinforcement learning method, for example, the processor implements the following steps when executing the computer program: establishing a multi-agent reinforcement learning model fusing rules; the multi-agent reinforcement learning model comprises an agent neural network structure model corresponding to each agent and a mixed neural network structure model corresponding to all agents; each intelligent agent neural network structure model is used for receiving respective observation states and outputting next actions of the intelligent agent according to the observation states, wherein the actions comprise direct actions or indirect actions; wherein a direct action represents an action that can be directly performed in a multi-agent environment; the indirect action represents that matched rules need to be selected from a rule base corresponding to the indirect action, and the rules are analyzed to obtain direct actions; the hybrid neural network structure model is used for receiving the actions output by all the intelligent neural network structure models and outputting a global action for guiding the training of each intelligent neural network structure model; establishing a rule selection reinforcement learning model; the rule selection reinforcement learning model comprises a rule base module, a depth reinforcement learning model and a rule analysis module; the rule base module is used for selecting a rule base corresponding to the indirect action from multi-class rule bases corresponding to the corresponding intelligent agent according to the indirect action; the deep reinforcement learning model is used for determining matched rules from a selected rule base; the rule analysis module is used for analyzing the rule; and performing combined training on the multi-agent reinforcement learning model and the rule selection reinforcement learning model, fixing the rule selection reinforcement learning model during the training of the multi-agent reinforcement learning model, fixing the multi-agent reinforcement learning model during the training of the rule selection reinforcement learning model, and completing the combined training of the multi-agent reinforcement learning model and the rule selection reinforcement learning model through repeated iteration.

In addition, the logic instructions in the memory may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the rule-embedded multi-agent reinforcement learning method based on combination training according to various embodiments or some parts of embodiments.

In addition, in the present invention, terms such as "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Moreover, in the present invention, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Furthermore, in the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A rule-embedded multi-agent reinforcement learning method based on combination training is characterized by comprising the following steps:

2. The combination-training-based rule-embedded multi-agent reinforcement learning method according to claim 1, wherein the multi-agent reinforcement learning model and the rule-selection reinforcement learning model are subjected to combination training, the rule-selection reinforcement learning model is fixed during training of the multi-agent reinforcement learning model, the multi-agent reinforcement learning model is fixed during training of the rule-selection reinforcement learning model, and the combination training of the multi-agent reinforcement learning model and the rule-selection reinforcement learning model is completed by repeated iteration, specifically comprising:

3. The combination training-based rule-embedded multi-agent reinforcement learning method of claim 1, wherein the multi-agent reinforcement learning model employs a centralized training-distributed execution architecture;

4. The combination training-based rule-embedded multi-agent reinforcement learning method of claim 1, further comprising:

5. The rule-embedded multi-agent reinforcement learning method based on combination training as claimed in claim 1, wherein the deep reinforcement learning model adopts DQN algorithm to complete the rule selection process, the deep reinforcement learning model calculates actions according to the current state, the actions are used to specify specific rules in the rule base, the deep reinforcement learning model generates a preset number of samples (o, v, o', r) through interaction with the multi-agent environment and stores the samples in the sample base; wherein o is the current state, v is the action calculated according to the current state, o' is the next state after the action is executed, r represents the return value after the action is executed, a specified number of samples are taken from a sample library during the deep reinforcement learning model training, the current state-action evaluation value is calculated according to the current value network, the maximum possible evaluation value of the next state-action is calculated according to the target value network, the gradient of the error function is calculated according to the two evaluation values and the DQN error function, and the current value network is trained and updated.

6. The combination training-based rule-embedded multi-agent reinforcement learning method of claim 2, further comprising:

7. The combination training-based rule-embedded multi-agent reinforcement learning method of claim 1, wherein the agent neural network structure model adopts a recurrent neural network structure, the input and output layers are fully-connected networks, and the intermediate hidden layer is a recurrent neural network layer; the hybrid neural network structure model adopts a multilayer perceptron neural network structure, realizes nonlinear integration of multi-agent output through the multilayer perceptron neural network structure, and realizes cooperation of the multi-agents.

8. A rule-embedded multi-agent reinforcement learning device based on combination training, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the computer program implements the combination training-based rule-embedded multi-agent reinforcement learning method of any one of claims 1 to 7.

10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the combination training based rule embedded multi-agent reinforcement learning method of any of claims 1 to 7.