CN112417760B

CN112417760B - Warship control method based on competitive hybrid network

Info

Publication number: CN112417760B
Application number: CN202011309350.8A
Authority: CN
Inventors: 王红滨; 谢晓东; 何鸣; 王念滨; 周连科; 崔琎
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2023-01-17
Anticipated expiration: 2040-11-20
Also published as: CN112417760A

Abstract

The invention discloses a ship control method based on a competitive hybrid network, and relates to a ship control method. The invention aims to solve the problem that the existing ship is low in control precision in a complex environment. The process is as follows: 1. establishing an individual intelligent agent network model; 2. establishing a dominant hybrid network model; 3. establishing a state value hybrid network model; 4. inputting individual observation history into an individual intelligent agent network model to obtain an individual advantage value function and an individual state value function; transmitting the individual advantage value function to an advantage hybrid network model, and outputting a combined advantage function value by the advantage hybrid network model; transmitting the individual state value function to a state value hybrid network model, and outputting a joint state value hybrid value by the state value hybrid network model; and adding the joint dominance function value and the joint state value mixed value to obtain a joint action value function. The invention is used in the field of ship control.

Description

Warship control method based on competitive hybrid network

Technical Field

The invention relates to a ship control method.

Background

Many problems in real life can be summarized as cooperative multi-agent problems, such as traffic planning, robot control, and autopilot. However, applying the multiple intelligent algorithm directly to such a real-world situation has many problems, such as that the view of the real-world agents should only have a local scope, and as the number of agents increases, the action space of the model may increase explosively, and how to determine how much each agent contributes to each task. These problems make multi-agent algorithms perform poorly in real-world tasks.

The initial solution to these problems was to have each agent a network that learned its own policy, i.e., independent peers (IQL). However, as the method does not lead to communication among the agents, and the agents can regard other agents as part of the environment, the instability can have serious negative effects on the updating of the strategy, and finally, the convergence guarantee of the algorithm is not provided. These all result in a reduction of the effectiveness of the algorithm. IQL still contributes to the explosive growth problem of motion space.

The other direction, which is exactly opposite to the IQL, is a centralized learning method, which treats all agents as a whole, and treats the whole in the angle of a single agent. Obviously, the processing method cannot solve the situation of explosive growth of the action space, and the centralized method can cause that roles between the intelligent agents are difficult to distinguish and can not produce good practical effect.

Based on the above two completely different methods, an intermediate method called Centralized Training Decentralized Execution (CTDE) is created. As the name suggests, all the agents are trained uniformly during training, and the strategies are executed according to local observed values of the agents during execution, so that the centralized intelligent agent training system has the centralized advantage and the distributed intelligent agent training system has the distributed advantage. The difficulty with the CTDE mode is to centralize the training portion. One strategy-based approach is to use an actor-reviewer approach to construct a globally shared reviewer for centralized training, such as COMA. Another approach is to learn a global but decomposable action value function, initially called Value Decomposition Network (VDN), which simply combines the total joint action value Q _tot Decomposed into individual action values Q _i The sum of (a) and (b). However, since the form of addition is too simple to satisfy the requirement of complex task, there is a monotonic function decomposition network (QMIX) which can convert the total action value function Q into _tot Decomposed into individual action values Q _i The sum relationship extends to a monotonic relationship so that the model can perform better in some complex tasks. Qtran, followed by QMIX, proposed a linear constraint decomposition method based on the Individual Global Maximum (IGM) property, but it is very difficult to accurately implement the theoretical part, so it is adoptedThe versions after the approximation are obtained, but the model is poor in effect under complex environments.

Using only the individual action value Q such as VDN, QMIX and Qtran _i Direct fitting of the joint action value Q _tot The method of (2) limits the complexity of the joint action value function to some extent.

CTDE gradually becomes the mainstream model paradigm of multi-agent reinforcement learning at present, lowe et al proposes a multi-agent actor-critic method MADDPG similar to CTDE, which learns a sharable critic for each agent, so that the critic can use global information during training for convenience of learning, and the algorithm can be used in cooperative and competitive environments. Also based on the CTDE model, foerster et al, proposed COMA, which designed a fully centralized reviewer for all agents using counterfactual baselines for a fully collaborative environment, aiming at solving the multi-agent belief assignment problem, while each agent executed its own policy based on its own local observations using a decentralized approach. Unlike the policy gradient-based multi-agent approach, VDN, which also follows the CTDE pattern but is based on a value function, is proposed by Sunehag et al. VDN decomposes the total joint motion value into a simple sum of individual motion values, and the joint motion value function is trained in the same way as DQN. The hybrid network structure of the VDN is too simple and does not utilize information such as global state during training, so that the VDN is not good. But such hybrid network architectures still have great room for improvement. Thus, on the basis of VDN, rasid et al propose QMIX. The simple addition relationship is expanded by the QMIX, the QMIX is upgraded to be a monotonous relationship, and the global state information is added into the hybrid network, so that the joint action value function range which can be expressed by the QMIX is much larger than that of the VDN, and experiments prove that the method achieves good effect. However, even though QMIX works well, monotonicity still constrains the representation ability of QMIX, so Mahajan et al designs an unconstrained joint action value function in Qtran, but since the theory mentioned in the algorithm is difficult to implement, some theoretical aspects have to be approximated, which will affect the effect of Qtran to some extent. Qtran works well in a simple environment but becomes less effective in a complex environment.

In addition, there are some other operations of the multi-agent based on the CTDE mode, for example, du et al also uses the CTDE mode (LIIR), a reward function is designed for each agent, so that two rewards of inside and outside are formed, and good experimental effects are obtained by using a strategy gradient method to train. SMIX was also proposed by wharton et al, who changed the single-step Q-learning update scheme in QMIX to multi-step Q-learning. But they wrongly believe that the way QMIX updates affects the representational limitations of the model. Wang et al propose an Action Semantic Network (ASN) that separates actions into agents that can affect them and agents that do not, but such an approach requires the introduction of an accurate a priori knowledge.

The SMAC is a multi-agent reinforcement learning experiment platform introduced in recent years, and overcomes the defect that a unified experiment platform is not available in the field of multi-agent reinforcement learning for a long time. SMAC was introduced and used by many researchers for algorithmic evaluation and comparison. SMAC is an item generated based on the interplanetary dispar 2 learning environment (SC 2 LE). The existing SMAC is formed by adding key functions such as decentralized strategy execution and local observation in the original learning environment of the SC2 LE. The SMAC includes many microscopic management maps, with different types of maps having different difficulties, similar to different data sets, to bring a stable baseline environment for the researcher.

Disclosure of Invention

The invention aims to solve the problem that the existing ship is low in control precision in a complex environment, and provides a ship control method based on a competitive hybrid network.

The ship control method based on the competitive hybrid network comprises the following specific processes:

step one, establishing an individual intelligent agent network model;

step two, establishing an advantageous hybrid network model;

step three, establishing a state value hybrid network model;

step four, inputting individual observation history into an individual intelligent agent network model to obtain an individual dominance value function A _i (τ _i ,a _i ) And an individual state value function V _i (τ _i )；

Function A of individual dominance values _i (τ _i ,a _i ) The result is transmitted to the advantage hybrid network model, and the advantage hybrid network model outputs a joint advantage function value A _tot (τ,a)；

Function V of individual state values _i (τ _i ) The state value is transmitted to the state value hybrid network model, and the state value hybrid network model outputs a joint state value hybrid value V _tot (τ)；

By combining the value of the merit function A _tot (τ, a) value V mixed with joint state value _tot (tau) are added to obtain a joint action value function Q _tot (τ,a)；

The individual observation history is a local observation value, an action and a role code;

in the sea warfare, the local observation value represents the warship situation observed in the visual field range of the warship;

in the sea warfare, the action means that a ship chooses to shoot or move a ship of an enemy at a certain time t;

the character code represents a vessel number.

The beneficial effects of the invention are as follows:

inspired by the single-agent Dueling structure, the invention provides a new multi-agent reinforcement learning model called the Dueling Mixing Networks, and because an attention mechanism is used in the new model, the new model is abbreviated as AD _ MIX. Using a new structure to combine the action values Q _tot Decomposition into joint merit function A _tot And a joint state value function V _tot The sum of (1). Joint dominance value function A _tot By using a structure identical to that of the hybrid network in QMIX to function the individual dominance

Mixed, combined state value function V _tot By using global statess and the individual local observed value o _i Attention weight between, function V of individual state values _i And weighting and adding. Since the state value function is independent of the choice of action, the whole process is fully compliant with the IGM properties.

The control accuracy of the ship is replaced and tested by using a simulation environment in an interplanetary multi-intelligent challenge (SMAC), and is compared with the best baseline at present. Finally, comparative experiments show that the structure of the invention brings improvement to the use of the Dueling structure only in the intelligent agent network.

The invention provides a novel multi-agent reinforcement learning value decomposition model, which decomposes a global joint action value function into the sum of a joint advantage function and a joint state value function, wherein the value of the joint advantage function and the value of the joint state value function are respectively composed of an individual advantage function and an individual state value function. The entire model can increase the complexity of the joint action value function.

The attention relation between the global state and the individual local observation is used for obtaining the attention weight of the individual state value function, and the value of the joint state value function is obtained through weighting sum. And the problem of single structure caused by simple addition is avoided.

The experimental result shows that the effect of the model of the invention is better than that of the most advanced multi-agent baseline algorithm at present.

In the invention, a novel value decomposition-based multi-agent reinforcement learning algorithm AD _ MIX is provided. The model of the traditional value decomposition multi-agent algorithm for simply mixing the total state action value by using the individual state action value is expanded, the state action value is decomposed into two parts, namely an advantage function and a state value function, and the two parts are respectively mixed from the two directions to obtain the total state action value function. In addition to this, an attention mechanism is used to derive an overall state value function by weighted sum using the global state and the individual local states to derive the attention weight of the state value function. The total state value function network plays a role in making up the difference between the current network state action value and the target network state action value. The two parts jointly form a global state action value function, and simultaneously, an intelligent network is also jointly optimized, so that the control precision of the ship in a complex environment is improved;

5 representative maps are selected on the SMAC for experiments and are compared with QMIX, qtran and COMA, besides, the simulation study is carried out, and finally, the experimental result proves that the model effect of the invention can be obviously superior to the current baseline effect.

Drawings

FIG. 1 is a diagram of a model architecture proposed by the present invention;

representing the observation history of an agent i, wherein Agent1.. AgentN is an agent network, and the agent network consists of two MLPs and a GRU and respectively represents two linear neural network layers and a gating cycle unit; a. The _i (τ _i ,a _i ) Is the value of the individual dominance value function of agent i; v _i (τ _i ) Is the value of the individual state value function of agent i; advantage Mixing represents a dominant hybrid network and consists of two weight matrixes W1 and W2; s _t Is a global observation; induction |, represents an absolute value function; value Mixing represents a state Value hybrid network, which is composed of an MLP layer, a Scaled dot product, softmax and a dot product, and respectively represents a linear neural network, a scaling dot product operation, a Softmax operation and a dot product operation; a. The _tot (τ, a) and V _tot (τ) represents a joint merit function and a joint status value function, respectively, with + sign representing the addition of two values;

FIG. 2 is a flow chart of an attention mechanism, wherein MatMul + Scale is a scaling dot product, mask is an operation of shielding unnecessary data, softmax refers to an integer weight to decimal weight, matMul is a matrix multiplication, and Q, K and V respectively represent a Query vector, a Key vector and a Value vector;

FIG. 3 is a control test win ratio plot of a ship on a map 2s3z, with the abscissa being the number of training sessions and the ordinate being the win ratio of the test;

FIG. 4 is a graph of the control test wins for a ship on a map 2s_vs _1sc;

FIG. 5 is a control test winning rate graph of a ship on a map 3s5 z;

FIG. 6 is a plot of the control test win rate for a ship on a map 5m _vs _ _6m;

FIG. 7 shows the map 6h _vs _8zof a ship the control test winning rate graph is shown;

FIG. 8 is a graph of control versus experimental win ratio for a ship on a map of 5m _vs _ _6m.

Detailed Description

The first specific implementation way is as follows: the ship control method based on the competitive hybrid network in the embodiment specifically comprises the following processes:

step one, establishing an individual intelligent agent network model;

step two, establishing an advantageous hybrid network model;

step three, establishing a state value hybrid network model;

Function A of individual dominance values _i (τ _i ,a _i ) The combined superiority function value A is output by the superiority hybrid network model _tot (τ,a)；

Function V of individual state values _i (τ _i ) Transmitting the data to a state value hybrid network model, and outputting a joint state value hybrid value V by the state value hybrid network model _tot (τ)；

By combining the combined dominance function values A _tot (τ, a) value V mixed with joint state value _tot (tau) are added to obtain a joint action value function Q _tot (τ,a)；

in the sea warfare, the local observation value represents the situation of a certain ship observed in the visual field range of the ship, for example, how many own ships and how many enemy ships can be seen around the ship, the damage degree of the own ships and the damage degree of the enemy ships;

in the sea warfare, the action means that a certain ship can choose to shoot or move a certain enemy ship at a certain time t;

the role code represents the number of a ship, namely 001 is Xiaoming, 002 is Xiaowang and 003 is Xiaoli, so as to distinguish different agents;

an agent is a ship.

The second embodiment is as follows: the first difference between the present embodiment and the specific embodiment is: establishing an individual intelligent agent network model in the first step; the specific process is as follows:

the individual agent network model comprises n agent network structures (the number of n depends on the specific application scenario);

each intelligent agent network has the same structure and consists of a Group1 and a Group 2;

the Group1 comprises an input layer, a ReLU activation layer, a GRU unit and a first linear network layer;

the output end of the input layer is connected with the input end of the ReLU active layer, the output end of the ReLU active layer is connected with the input end of the GRU unit, and the output end of the GRU unit is connected with the first linear network layer;

these structures are connected in series.

The input layer receives the observation history of the intelligent agent, the observation history is formed by splicing the local observation value (in the sea warfare, the local observation value represents the situation of a certain ship observed in the visual field range of the intelligent agent, for example, how many ships and how many enemy ships can be seen around, the damage degree of the ships and the damage degree of the enemy ships) and the action (in the sea warfare, the action represents that a certain ship can choose to shoot or move on a certain enemy ship at a certain time t) of the intelligent agent, the local observation value of the intelligent agent represents the surrounding environment which the intelligent agent can see at the time t, and the action represents the operation which the intelligent agent can carry out at the time t, for example, shooting and mining in the SMAC environment belong to one of the actions of the intelligent agent. The input layer outputs data to the ReLU active layer, the ReLU active layer inputs the output data to the GRU unit, and the GRU unit processes the data and outputs the data to a linear network layer.

The Group2 comprises a second linear network layer and a third linear network layer which are connected in parallel; the output data in the Group1 are respectively sent to a second linear network layer and a third linear network layer to be used as input data of the two linear layers;

the second linear network layer calculates the input data to obtain the output of the second linear network layer, which is called as the value of the individual dominance value function;

the third linear network layer calculates the input data to obtain the output of the third linear network layer, which is called as the value of the individual state value function;

finally, weighted average is carried out on the output of the first linear network in the action dimension, the maximum value in the action dimension after the average is selected, and finally the value is added with the output value of the second linear network to obtain the value Q of the individual action value function _i (τ _i ,a _i ). The individual intelligent agent network model comprises n intelligent agent network structures, each network structure mainly comprises GRU units and is responsible for extracting information of input state information and making a decision, and the finally obtained decision result comprises two parts which are respectively an individual dominance value function A _i (τ _i ,a _i ) And an individual state value function V _i (τ _i ) Wherein the individual dominance value function and the action (action represents the operation that the agent can perform at time t, such as shooting and mining in SMAC environment, belong to one of the agent actions. ) And observation history (the observation history is formed by splicing local observation values of the intelligent agent and actions, the local observation values of the intelligent agent represent the surrounding environment which the intelligent agent can see at the time t, and the actions represent the operations which the intelligent agent can perform at the time t, such as shooting and mining in an SMAC environment, belong to the actions of the intelligent agent. ) Whereas the individual state value function is only relevant to the observation history.

Other steps and parameters are the same as those in the first embodiment.

The third concrete implementation mode: the present embodiment differs from the first or second embodiment in that: establishing an advantageous hybrid network model in the second step; the method specifically comprises the following steps:

the dominant hybrid network model is composed of a parameter network structure and a super network structure;

the parameter network structure comprises a fourth linear network layer and a fifth linear network layer, the fourth linear network layer and the fifth linear network layer are connected in parallel, and the input of the fourth linear network layer and the input of the fifth linear network layer are global observed values;

the fourth linear network layer processes the input global observation value to obtain an output value which is used as a parameter matrix of a first layer network in the super network structure;

the fifth linear network layer processes the input global observation value to obtain an output value which is used as a parameter matrix of a second layer network in the super network structure;

the super network structure receives the value of the individual dominance value function, matrix-solving absolute values of parameter matrixes of a first layer of network in the super network structure is carried out, matrix multiplication is carried out on the parameter matrixes and the received value of the individual dominance value function to obtain output of the first layer, the second layer receives the output of the first layer, the absolute values of the parameter matrixes of the second layer in the super network structure are solved, and then matrix multiplication is carried out on the parameter matrixes of the second layer and the output of the first layer to obtain output of the dominance mixing network;

the global observation value is the ship loss degree of both parties in the whole battlefield.

Other steps and parameters are the same as those in the first or second embodiment.

The fourth concrete implementation mode is as follows: the difference between this embodiment and the first to third embodiments is that, in the third step, a state value hybrid network model is established; the method specifically comprises the following steps:

the state value hybrid network model is composed of an attention network model, and the attention network model is composed of two serialization models;

the first serialization model comprises a sixth linear network layer, a Droupout layer, a LeakyReLU activation layer and a seventh linear network layer;

the several models are connected in series.

The sixth linear network layer receives the global observation value, processes and outputs the input value, the Droupout layer receives the output value of the sixth linear network layer as input, 10% of input neurons are discarded randomly and output to LeakyReLU for activation processing; the seventh linear network layer receives the output of the LeakyReLU layer as input and processes the input to finally obtain the output of the first serialization model;

the second serialization model comprises an eighth linear network layer, the eighth linear network layer processes the input individual local observation value to obtain output, and the output is used as the output of the second serialization model;

then, performing scaling dot product on the outputs of the two serialization models, namely performing matrix multiplication on the output matrixes of the two serialization models to prevent the result from overflowing, and dividing the obtained structure by a number (the number is equal to the output dimensionality of an eighth linear network layer under a root sign);

and performing Softmax operation on the data obtained after the dot product is scaled to convert the data into percentage weight, and finally performing weighted sum on the obtained weight and the individual state value function.

Other steps and parameters are the same as those in one of the first to third embodiments.

The fifth concrete implementation mode: this embodiment is different from the first to the fourth embodiments in that in the fourth step, the individual observation history (local observation value) is determined

Movement of

And role coding) to obtain individual dominance value function A _i (τ _i ,a _i ) And an individual state value function V _i (τ _i ) (ii) a The specific process is as follows:

all agents share one agent network, and local observed values at above one moment

Movement of

(in the war, the local observation value represents the self-surrounding war condition which can be seen by some soldier with naked eyes, but it does not represent the whole battlefield condition; the action represents that some soldier can select forward or shoot or backward movement at some moment, and each moment can only execute one action;) as the input of the network model of the individual intelligent body at the moment t, the role code (in the war, different arms have different numbers; the role code is used for distinguishing the use of the role of different intelligent bodies; the intelligent bodies with different role codes can complete different tasks.) is added in the input to distinguish the roles of different intelligent bodies, and finally, the individual dominance value function A is output _i (τ _i ,a _i ) And an individual state value function V _i (τ _i )。

The intelligent agent is a ship;

in the sea warfare, the action means that a certain ship can choose to shoot or move a certain enemy ship at a certain time t.

The hidden state is an input in the GRU unit that is used to add information from a previous time instant to the time instant t, an input that is necessary to use the GRU unit. A global observation represents the situation of the entire battlefield at a certain moment.

Other steps and parameters are the same as in one of the first to fourth embodiments.

The sixth specific implementation mode: the difference between this embodiment and one of the first to fifth embodiments is: the fourth step is to use the individual dominance value function A _i (τ _i ,a _i ) The combined superiority function value A is output by the superiority hybrid network model _tot (τ, a); the specific process is as follows:

the advantage function and the action value function are related to the selection and the state of the action at the same time, so that the monotone hyper-network of QMIX can be completely usedHybrid network as a dominance function will A _i (τ _i ,a _i ) Is mixed into A _tot (τ, a), the formula is as follows:

A _tot (τ,a)＝Monotone_hypernetworks(A ₁ (τ ₁ ,a ₁ ),…,A _i (τ _i ,a _i ),…,A _N (τ _N ,a _N ))

in the formula, monotone _ networks is a dominant hybrid network, and the joint action observation history value of all agents is the set of each agent action observation history. The joint action of all agents refers to the set of actions of all agents; τ is the observation history of the joint action of all agents, and a is the joint action of all agents.

The structure of the dominant hybrid network is substantially the same as that of QMIX, and mainly includes a super network, the parameters of which are generated from global observations s through a neural network, and monotonicity is obtained by using an absolute value function on the parameters. The advantage mixing function obtains an individual advantage value function A _i (τ _i ,a _i ) Obtaining a joint merit function A by mixing through a monotonic hyper-network _tot (τ,a)。

Other steps and parameters are the same as in one of the first to fifth embodiments.

The seventh embodiment: in this embodiment, the difference between the first embodiment and the sixth embodiment is that in the fourth step, the individual state value function V is used _i (τ _i ) Transmitting the data to a state value hybrid network model, and outputting a joint state value hybrid value V by the state value hybrid network model _tot (τ); the specific process is as follows:

the structure of the state-value hybrid network is mainly realized by an attention mechanism, and an attention weight w _i From global observations s and individual local observations o _i Get, by attention weight w _i And value of individual state function V _i (τ _i ) Weighted sum to obtain a joint state value function V _tot (τ)。

The state value function is independent of the action and only related to the state, so the attention weight of the state value function can be calculated according to the attention relation between the individual local observation of the agent and the global state:

where s is the global observation (in a war, the global observation represents the situation of the entire battlefield at a certain moment in time), o _i E(s) and e (o) as local observations (in war, a local observation indicates the surrounding situation of a soldier seen by the naked eye, but this does not represent the situation of the whole battlefield) _i ) The embedded vectors are a global observation value and a local observation value respectively (the embedded vector of the global observation value is a vector with a fixed length obtained by sending the global observation value into a neural network layer for processing. However, in the present invention the embedded vector is the same as the original value, i.e., s = e(s), o _i ＝e(o _i ) The reason for this is that s and oi have been transformed internally to e(s) and e (o) within the SMAC _i )。)，W _s And W _i Respectively adding e(s) and e (o) _i ) Conversion into Q and K vectors, W _s And W _i Are respectively a conversion matrix (W) _s And W _i Is two matrices for connecting e(s) to e (o) _i ) And converting into matrix dimension required by scaling dot product. ) W is a radical of _s Performing matrix multiplication with e(s) to obtain Q, W _i And e (o) _i ) Multiplying the matrixes to obtain K, wherein a Q vector is a global vector shared by all agents, and each agent has a K vector; d _k Is the magnitude of the output dimension of the linear model in the second serialized model in the state hybrid network structure; softmax (·) is a method for converting integer weights into decimal weights, and the sum of the weights is 1; t is transposition;

attention weight w by state value function _i With individual state value function V _i (τ _i ) And carrying out weighted addition to obtain a total joint state value function:

V _tot (τ)＝∑ _i w _i V _i (τ _i ,a _i )

wherein, w _i Is attention weight, V _i (τ _i ,a _i ) Is a function of the individual agent state values.

Other steps and parameters are the same as those in one of the first to sixth embodiments.

The specific implementation mode is eight: this embodiment differs from the first to seventh embodiments in that the fourth step is performed by combining the merit function values A _tot (τ, a) value V mixed with joint state value _tot (tau) are added to obtain a joint action value function Q _tot (τ, a); the specific process is as follows:

Q _tot (τ,a)＝A _tot (τ,a)+V _tot (τ)

＝Monotone_hypernetworks(A ₁ (τ ₁ ,a ₁ )…A _N (τ _N ,a _N ))+∑ _i w _i V _i (τ _i ,a _i )

in the formula, monotone _ networks is a dominant hybrid network;

when an individual intelligent agent network model, an advantage hybrid network model and a state value hybrid network model are trained, the following loss functions are minimized:

where b is the number of samples randomly chosen from the experience pool for training,

is obtained from the target network as an intermediate variable

θ ^- And θ are parameters of the target network and the value network, respectively; r is the team award value shared by all agents, γ is the discount coefficient when accumulating awards; τ 'is the joint action observation history of all agents, u' is the joint action value of the target network, and u is the joint action value of the value network;

the target network comprises an individual intelligent agent network, an advantage hybrid network and a state value hybrid network;

the value network comprises an individual agent network, an advantage hybrid network and a state value hybrid network;

r is the environmental feedback, obtained automatically. max of _u 'Q _tot (τ',u'；θ ^- ) Is the final output value of the target network.

( The whole network structure (dominant hybrid network, value hybrid network and individual agent network structure) has two parts, one is to update parameters every training, called value network, and the whole is value network, and the other is target network, which is not updated every training but copies its parameters from the value network at regular times. The value network has the same structure as the target network, and the value network and the target network are only used for calculating the L (theta). The target network is generally not mentioned in describing the reinforcement learning algorithm, and the whole article indicates the value network without specific description. The upper right corner with-or' is for the target network and none is for the value network. )

And obtaining the optimized team reward value r shared by all the intelligent agents through a minimization loss function.

Other steps and parameters are the same as those in one of the first to seventh embodiments.

Cooperative multi-agent reinforcement learning tasks are typically modeled as decentralized locally observable Markov processes (DEC-POMDP), generally represented by an octave k = < S, U, P, r, Z, O, N, γ > where the global state S ∈ S at each time instant, each agent i ∈ N selecting an action a from the global state S at each time instant _i E.g. U. Joint actions a ∈ U ≡ U of all agents ^N Acting on the environment. The environment is based on the state transition function P (S' | S, U): S × U × S → [0,1]An environmental state transition is performed. Finally all agents share a joint team prize r (s, a) and a discount factor y in the cumulative prize. Each agent is based on a local observation transfer function O (s, a) when the local observable property of the agent is taken into account _i ) Obtaining a local observed value o _i . In addition, each agent has an action-observation history τ _i ∈T≡(Z×U) ^* Using this observation history to optimize the strategy pi _i (u _i |τ _i ) To maximize the joint team award r.

The method based on the value function is a mainstream direction of a single-agent reinforcement learning algorithm, generally, two neural networks are constructed, one is called a value network, the other is called a target network, the structure of the target network is identical to that of the value network, and parameters of the target network are copied from the value network at regular intervals. The experience (s, a, r, s') used for training is stored using a data structure called an experience pool. Single agent value-based function model using time difference formula

To calculate gradients for strategic optimization. Multi-agent value function based algorithms also require a pool of experiences to store the experiences (τ, a, r, τ ') of all agents, where a is the joint action of all agents, τ and τ' are the joint action observation histories of all agents, and r is the team reward value shared by all agents. The reason for replacing s in a single agent with τ is that the observation scope of agents in most cooperative multi-agent reinforcement learning is local, so a local action observation history consisting of local observations and actions is used instead of global observations. Similarly, the multi-agent algorithm based on value function also uses a time difference formula to optimize the strategy, wherein the time difference formula is as follows:

in addition, in the value solution method based on the CTDE mode, the global state action value Q _tot And the individual state action value Q _i The IGM properties need to be satisfied:

the global and local relationships in the two representative models VDN and QMIX are: q _tot (τ,a)＝∑ _i Q _i (τ,a _i ) To be provided withAnd

it is clear that both VDN and QMIX satisfy IGM properties, but there are still limitations on the representation. Qtran avoids the representation limitation of VDN and QMIX using a linear constraint decomposition method, but the approximation of the theoretical aspect in practice makes the algorithm difficult to achieve good results in complex environments.

The attention mechanism was originally used in natural language processing, but has been gradually applied to other fields including image processing and speech recognition in recent years due to its excellent performance. The attention mechanism originates from the research of researchers on bionics, and when a human receives a picture or a section of audio, the human rejects unimportant information and retains important information for processing. The efficiency is improved, and meanwhile, resources can be used in valuable places. Due to the excellent performance of attention mechanisms in other areas, more and more researchers believe that attention mechanisms may also play an important role in reinforcement learning. Attention machines come in many forms, including soft attention machines, hard attention machines, including self-attention machines and multi-head attention machines. The calculation method of the attention mechanism can be simply described as mapping the Q vector and the K vector to obtain the attention weight, and then multiplying the attention weight by the V vector to obtain the required value:

the mapping process includes a scaling dot product with softmax operation. Referring to fig. 2, matMul + scale refers to a scaled dot product, mask is not generally implemented, softmax refers to integer weight to decimal weight, and MatMul refers to matrix multiplication.

The structure of the algorithm of the present invention and the theory involved will be described in detail in the present invention.

The Dueling architecture decomposes the action value function into a state value function having a relation only to the state and an advantage having a relation to both the state and the action in a single agent algorithm based on the value functionTwo parts of the potential function, i.e. Q = a + V. The reason for this is that, starting from the characteristics of human visual perception, sometimes the magnitude of the value function is independent of the action selection and only of the current state. At present, value function-based algorithms of multiple intelligent agents mostly directly use local action value function Q _i To obtain a global joint action value function Q _tot Even though some algorithms adopting a dulling-like structure exist, joint value functions in those algorithms are usually obtained by directly using a global state through a hybrid network, and the obtained joint state value functions cannot be linked with an individual intelligent agent network, so that the value function network cannot be well optimized. Thus, the following section will describe in detail how the agent network is used to generate the joint state value function, resulting in an overall joint action value function from both the joint merit function and the joint state value function, and also optimizing the agent network from both. The specific network structure is shown in figure 1;

in QMIX, the individual action values Q _i (τ _i ,a _i ) Obtaining a joint action value Q through mixing of monotonic super-networks _tot (τ, a). The intelligent agent network is changed into the intelligent agent network with the Dueling structure, so that each intelligent agent can obtain an advantage function value A _i (τ _i ,a _i ) And a value V of a state value function _i (τ _i ). The merit function and the action value function are both related to the selection and the state of the action, so that v can be mixed into A by using a hybrid network of the monotonic super network of QMIX as the merit function _tot (τ, a), the formula is as follows:

in the formula, A _i (τ _i ,a _i ) Function A of individual dominance value obtained by inputting individual observation history into individual intelligent agent network model _i (τ _i ,a _i ) And an individual state value function V _i (τ _i )；

wherein e(s) and e (o) _i ) Embedding vectors, W, of global state and individual local observations, respectively _s And W _i Respectively adding e(s) and e (o) _i ) Converting the vector into a Q vector and a K vector, wherein the Q vector is a global vector shared by all agents, and each agent has one K vector;

the total joint state value function is obtained by weighted addition of the attention weight and the state value function:

V _tot (τ)＝∑ _i w _i V _i (τ _i ,a _i )

in the formula, V _i (τ _i ) Function V of individual state values obtained by inputting individual observation history into individual intelligent agent network model _i (τ _i )；w _i Is the attention weight, V _i (τ _i ,a _i ) Is an individual agent state value function;

and finally, adding the global advantage function and the global state value function to obtain a global state action value function:

Q _tot (τ,a)＝A _tot (τ,a)+V _tot (τ)

the following loss function is minimized during training:

where b is from an experience poolThe number of randomly selected samples for training is obtained from the target network

θ ^- And θ are parameters of the target network and the value network, respectively. Policy optimization can be performed by minimizing a loss function.

The whole network structure consists of three parts:

indvidual agent network: the individual agent network is composed of a Recurrent Neural Network (RNN), all agents share one agent network, and local observed values at more than one moment

Movement of

And a hidden state

As an input at time t, a role code is added to the input to distinguish the roles of different agents. Finally, outputting individual advantage function A _i (τ _i ,a _i ) And an individual state value function V _i (τ _i )。

Advantage differentiation network: input the dominance function A of all agents, following the monotonic hyper-network in QMIX _i (τ _i ,a _i ) To obtain the overall merit function A _tot (τ, a). The IGM properties are guaranteed.

State-value mixng network: encoding e(s) global state and e (o) local state using attention mechanism _i ) Using two transformation matrices W _s And W _i Respectively obtaining a Q vector and a K vector, and then obtaining the attention weight w of each agent by using a scaling dot product _i Finally, a function V of the state value _i Weighted addition is carried out to obtain V _tot (τ)。

And finally, adding the obtained joint advantage function and the joint action value function to obtain the required joint action value function. The algorithm pseudocode is as follows.

Initializing the target network parameters and copying the target network parameters to the value network parameters. Initializing an experience pool, initializing a learning rate and a reward discount rate. And obtaining an initial observation value at the time of T =0, sending the initial observation value to an individual intelligent agent network to obtain the action of the intelligent agent, acting the action on the environment through an action function to obtain a shared reward value fed back by the environment and an observation value at the next time, and sending the observation value, the reward, the action and the observation value at the next time to an experience pool to be stored. Such operations are repeated until a task is finished. And when the number of the executed tasks meets the training requirement, performing one-time training. Randomly obtaining experience of the amount of batch _ size from an experience pool, sending data at the time t in the experience into an individual intelligent agent network in a value network to obtain an individual dominance value function and an individual state value function, sending the value of the individual dominance value function and the value of the individual state value function into a dominance mixing network and a state value mixing network respectively to obtain the value of a joint dominance value function and the value of a joint state value function, and finally adding the two to obtain the value of a joint action value function at the time t. And sending the data of the next moment t 'of t into the target network to obtain the value of the joint action value function of the moment t'. The value of the loss function is calculated to update the network. When the network is trained for a certain number of times, the parameters of the value function are copied to the target network once.

The following examples were used to demonstrate the beneficial effects of the present invention:

the first embodiment is as follows:

SMAC was used as the experimental platform. SC2LE is the reinforcement learning environment established on popular real-time strategy game interstellar dispute 2, and SMAC is the multi-agent learning environment established on SC2LE and added with cooperative multi-agent environment characteristics. The SMAC itself has 14 different types of maps for testing the multi-agent algorithm, different maps have different difficulties, equivalent to 14 different types of test data, in addition to the player can construct other maps by himself. In order to verify that the method has good effect on maps with different difficulties, the method selects 5 maps (table 1) with different difficulties for testing. All maps are struggled by two parties, one party is controlled by a multi-agent reinforcement learning algorithm, and the other party is controlled by a built-in heuristic non-learning algorithm elaborately designed by the SMAC. The goal of the present invention is to defeat the built-in heuristic or to obtain more likely scores, the measure of which is derived from the number of injuries and deaths inflicted on the enemy. At each time step, the agent has its own local view, and the action is selected according to its local view by the current strategy, and agents in all map types have 4 legal actions such as move, attack/cure, stop, and do not operate, but the number of actions that can be performed by each agent according to different maps is 7-70. After each fight is finished, the scene feeds back a team reward value, and the SMAC designs two reward values, namely a sparse reward and a dense reward for research under different scenes. To facilitate comparison to other baselines, the present invention selects a dense reward for testing, and the hyperparameters used in the experiment are all the same as those in the baseline.

Table 1 experimental selection map

Three multi-agent baseline algorithms were compared, QMIX, COMA, and Qtran, respectively. The settings of the hyper-parameters in QMIX, qtran and COMA are all consistent with the original paper, wherein Qtran selects the Qtran-alt model with the best effect in the paper. All algorithms were run with 20000 epochs in the experiment.

1)2s3z

Firstly, testing a model on a simpler map 2s3z and comparing the effect, wherein the obtained test result is shown in figure 3;

it can be seen from the figure that the model of the present invention (asterisk) performs almost always better than the other three baseline algorithms throughout the training process. From this, a joint state value function V to be weighted with attention can be obtained _tot The method can achieve the effect improvement of all aspects in a simple map by adding the method into a model. However, the model may be unstable in the initial training period, and this phenomenon occurs in part because the introduction of the value mixing network leads to a larger number of training parameters, which adds uncertainty interference to the training process, and therefore this phenomenon may occur, but this does not affect the final convergence of the algorithm.

2)2s_vs_1sc

Map 2s _vs _1scis no longer the same number of agent pairings on both sides, but two agents with lower relative combat power need to cooperate to defeat one agent with higher combat power. Although the amount compensates for the gap in combat power, this map has higher requirements for the model's collaborative strategy.

As can be seen from FIG. 4, learning instability still occurs in the early stage of the algorithm, but the final convergence of the model is still not affected. The stability and convergence rate ratio of the model of the invention in the middle and later stages are all better than three baselines, and the average rate of the model after 7000 epochs is always higher than the other three baselines.

3)3s_VS_5z

FIG. 5 shows the representation of the model of the present invention in a 3s_vs _5z map and a comparison with the other three algorithms.

The map 3s_vs _5zis relatively complex, and the map difficulty is higher because the model needs to control 3 agents to complete the battle of 5 agents, and the situation that the tactic strategy is less and more difficult to complete is much more difficult than that of the same number of agents. As can be seen from the figure, the COMA and Qtran-alt can not win a win during the whole training period because of the high difficulty of the map, so the training curves of the COMA and the Qtran-alt are always positioned on the epoch axis. Although the algorithm of the invention is slow in strategy learning speed in the early stage of training but surpasses the effect of the QMIX algorithm in the middle and later stages, the map difficulty is higher, so that the effect of the algorithm of the invention is only a little better than that of the QMIX algorithm, even equivalent to the QMIX algorithm, and is not obviously improved.

4)5m_VS_6m

Compared to the less aggressive maps that can be used in quantities to compensate for the lack of fighting power, the 5m _vs _6mrequires the completion of less aggressive missions with identical individual fighting powers. This map is more difficult than the previous map. FIG. 6 shows the test and comparison of the model of the invention at 5m _vs _6m.

As can be seen in FIG. 6, COMA and Qtran-alt still do not make a win at 5m _vs _6m. However, some success could be achieved with QMIX and the model of the present invention, which performed better than QMIX throughout the training period. The algorithm of the present invention outperforms all three baselines at 5m _vs _6m.

5)6h_VS_8z

The map 6h _vs _8zis more difficult than all other maps and requires a win to be completed with disadvantages in both quantity and fighting capacity. Figure 7 illustrates the effect of the model of the present invention. The victory axis has been enlarged for ease of comparison with the present invention.

It can be seen from fig. 7 that the model of the present invention and the other three models do not achieve good results, but QMIX and the model of the present invention still have better effects than the other two models. In contrast, the model of the present invention works better than QMIX. Such a small increase is particularly important in situations where the effect is not good.

In order to mix the state action value and the state value respectively, the original Nature agent structure is modified into a Dueling agent structure in the network structure. The method aims to explore whether the model effect is improved by changing the hybrid network or the model effect is improved by changing the intelligent agent network. Ablation study was performed at 5m _vs _6m. The results of this experiment are shown in figure 8.

The experimental result shows that the change of the intelligent agent network does not bring substantial improvement to the model, and compared with the model effect of changing the QMIX intelligent agent network into the blanking structure, the model of the invention obtains obvious improvement. This is sufficient to prove that the method of the invention is effective.

The present invention is capable of other embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and scope of the present invention.

Claims

1. A ship control method based on a competitive hybrid network is characterized by comprising the following steps: the method comprises the following specific processes:

step one, establishing an individual intelligent agent network model;

step two, establishing an advantageous hybrid network model;

step three, establishing a state value hybrid network model;

the character code represents a ship number;

establishing an individual intelligent agent network model in the first step; the specific process is as follows:

the individual agent network model comprises n agent network structures;

the Group2 comprises a second linear network layer and a third linear network layer which are connected in parallel;

the output data in the Group1 are respectively sent to a second linear network layer and a third linear network layer to be used as input data of the two linear layers;

establishing an advantageous hybrid network model in the second step; the method specifically comprises the following steps:

the dominant hybrid network model consists of a parameter network structure and a super network structure;

the global observation value is the ship loss degree of both parties in the whole battlefield;

establishing a state value hybrid network model in the third step; the method specifically comprises the following steps:

the first serialization model includes a sixth linear network layer, a Droupout layer, an LeakyReLU activation layer and a seventh linear network layer;

then, carrying out scaling dot product on the outputs of the two serialization models, namely carrying out matrix multiplication on the output matrixes of the two serialization models, and dividing the obtained structure by one number;

2. According toThe ship control method based on the competitive hybrid network as set forth in claim 1, wherein: in the fourth step, the individual observation history is input into the individual intelligent agent network model to obtain an individual dominance value function A _i (τ _i ,a _i ) And an individual state value function V _i (τ _i ) (ii) a The specific process is as follows:

all agents share one agent network, and local observed values at more than one moment

Movement of

As the input of the network model of the individual intelligent agent at the time t, the role codes are added in the input to distinguish the roles of different intelligent agents, and finally, the individual dominance value function A is output _i (τ _i ,a _i ) And an individual state value function V _i (τ _i )；

The intelligent agent is a ship.

3. The ship control method based on the competitive hybrid network as set forth in claim 2, wherein: in the fourth step, the individual dominance value function A is used _i (τ _i ,a _i ) The result is transmitted to the advantage hybrid network model, and the advantage hybrid network model outputs a joint advantage function value A _tot (τ, a); the specific process is as follows:

the formula is as follows:

in the formula, monoto _ hypernetworks is an advantageous hybrid network; τ is the observation history of the joint action of all agents, and a is the joint action of all agents.

4. The vessel control method based on the competitive hybrid network as set forth in claim 3, wherein the method is performed in a vessel control systemIs characterized in that: in the fourth step, the individual state value function V _i (τ _i ) Transmitting the data to a state value hybrid network model, and outputting a joint state value hybrid value V by the state value hybrid network model _tot (τ); the specific process is as follows:

calculating the attention weight of the state value function:

where s is the global observation, o _i For local observations, e(s) and e (o) _i ) An embedded vector, W, of global and local observations, respectively _s And W _i Respectively adding e(s) and e (o) _i ) Conversion into Q and K vectors, W _s And W _i Are respectively a conversion matrix, W _s Performing matrix multiplication with e(s) to obtain Q, W _i And e (o) _i ) Multiplying the matrixes to obtain K, wherein a Q vector is a global vector shared by all agents, and each agent has a K vector; d is a radical of _k Is the magnitude of the output dimension of the linear model in the second serialized model in the state hybrid network structure; softmax (·) is the conversion of integer weights to fractional weights; t is transposition;

V _tot (τ)＝∑ _i w _i V _i (τ _i ,a _i )

5. The ship control method based on the competitive hybrid network as claimed in claim 4, wherein: in the fourth step, the joint advantage function value A is obtained _tot (τ, a) value V mixed with joint state value _tot (tau) are added to obtain a joint action value function Q _tot (τ, a); the specific process is as follows:

Q _tot (τ,a)＝A _tot (τ,a)+V _tot (τ)

in the formula, monoto _ hypernetworks is an advantageous hybrid network;

where b is the number of samples used for training,

is obtained from the target network as an intermediate variable

θ ^- And θ are parameters of the target network and the value network, respectively; r is the team award value shared by all agents, and γ is the discount coefficient at jackpot; τ 'is the joint action observation history of all agents, u' is the joint action value of the target network, and u is the joint action value of the value network;

the target network comprises an individual intelligent agent network, a dominant hybrid network and a state value hybrid network;