CN113222106A

CN113222106A - Intelligent military chess deduction method based on distributed reinforcement learning

Info

Publication number: CN113222106A
Application number: CN202110185566.6A
Authority: CN
Inventors: 彭星光; 李亚男; 宋保维; 潘光; 张福斌; 高剑; 李乐; 张立川
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2021-02-10
Filing date: 2021-02-10
Publication date: 2021-08-06
Anticipated expiration: 2041-02-10
Also published as: CN113222106B

Abstract

The invention provides an intelligent war game deduction method based on distributed reinforcement learning, which comprises the steps of firstly determining state variables and action variables of a war game operator decision network; secondly, determining a Markov decision process according to the state input variable and the state output variable, and constructing a training pool for reinforcement learning for neural network training; and then establishing a neural network for each operator by adopting an Actor-Critic algorithm, taking the minimization of an evaluation function as a training target, training parameters of the neural network of each operator step by combining information in an experience pool, and finally accessing the trained Actor network into a chess deduction system, wherein each operator makes a corresponding decision according to the battlefield situation. The method establishes a decision neural network for each operator, and in battlefield situation description, except that conventional labels are used for describing battlefield situations, the image data is adopted for representing environment states and individual attributes which are difficult to quantify, so that effective decision results can be obtained more accurately aiming at various battlefield situations.

Description

Intelligent military chess deduction method based on distributed reinforcement learning

Technical Field

The invention relates to the technical field of intelligent chess deduction, in particular to an intelligent chess deduction method based on distributed reinforcement learning.

Background

Wargame deduction is a military scientific tool which simulates each party of war confrontation, uses chessboard and chessman representing battlefield and military strength thereof to carry out logic deduction research and evaluation on the war process according to rules summarized from war experience, and is a round game problem. The war is moved into a sand table and a computer to construct a virtual battlefield, and the army obtains greater victory in the future war through simulation which is as close to actual war as possible, namely the significance of 'war deduction'. A chess is usually composed of 3 parts, a map (chessboard), derived pieces (operators) and an adjudication rule (derived rule). In modern chess games, electronic chess systems with computers as carriers are adopted more frequently.

As shown in fig. 1, a four-sided or hexagonal grid map is adopted for a general war game pursuit, and the landform, the landform or the height of the position is marked in each grid through different marks or colors. The war game is completed by at least two parties, and each party is equipped with chess pieces with approximately same general ability but different maneuvering ability, shooting mode and the like. Generally, there are roughly two tasks that a chess carries out: it can kill adversary and land occupation. In a large military chess map, valuable grids are distributed sparsely: usually, in thousands of grids, only tens of grids which have important reference significance for the strategy formulation at the current moment, and only the surrounding grids of the feasible range of each chessman.

The war game deduction is an effective tool for knowing and mastering future war, utilizes war games to deduce future operation actions in a virtual operation environment, is beneficial to trending toward profit and avoiding harm, and converts various operation ideas into actual action schemes. In the future, the concept of multi-domain battle will be further developed and further improved, the war deduction will continue to play an important role in the development of multi-domain battle, and the cooperation capability among various military species is improved through the combined deduction.

With the transformation from 'informatization' to 'intellectualization' in the new military revolution, the 'artificial intelligence + chess' will be applied more in the military field in the future. Therefore, it is necessary to research the strategy of war game deduction by using artificial intelligence technology.

The warrior and the like of the national defense university provide a war game deduction decision method frame based on reinforcement learning, a scheme of layered reinforcement learning is adopted, battlefield situations are described by using artificially designed labels and vectorization, however, in practical application, certain battlefield situations cannot be accurately quantized, so that an effective decision result cannot be obtained aiming at various battlefield situations by adopting the scheme of layered reinforcement learning; in addition, the existing wargame deduction decision method based on reinforcement learning adopts a unified reinforcement learning model, has large network scale and needs high-performance computer support.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a distributed intelligent chess deduction method based on reinforcement learning, a decision neural network is established for each operator, and in battlefield situation description, except that conventional labels are used for describing battlefield situations, image data are adopted for representing environment states and individual attributes which are difficult to quantify, so that effective decision results can be obtained more accurately aiming at various battlefield situations.

The technical scheme of the invention is as follows:

the intelligent war game deduction method based on distributed reinforcement learning comprises the following steps:

step 1: determining state variables and action variables of a chess operator decision network;

for operator i, the state variable S_iThe method comprises the state information of an own operator, the information of an enemy operator, the position information of a control-taking point and the communication state of the current position of an operator i; action variable A_iAn action taken by the operator i at the current moment;

step 2: in the algorithm training stage, the scores of each action of an operator are obtained as a reward function R by interacting with a chess deduction platform in the process of chess deduction;

and step 3: determining Markov decision process < S, A, R, gamma > according to the state input variable and the state output variable, wherein S is the state variable input of an operator, and A is the action variable output of the operator; r is a reward function, and gamma is a discount factor; the training pool for reinforcement learning is constructed as follows

Wherein,

is the state variable input of the operator i at time t,

is the maneuver variable output by the operator i at time t,

is the reward value that the operator finds by the reward function at time t,

for operator i in state

Taking action

A post-update state variable; in the algorithm training stage, information generated by interaction of each operator and the platform is stored in the operator experience pool and used for neural network training;

and 4, step 4: for each operator, establishing a neural network by adopting an Actor-Critic algorithm, wherein the input of the Actor network is state variable observation of the operator, and the output of the Actor is an action variable of the operator, the input of the Critic network is the state variable observation of the operator and the action variable of the operator, and the output of the Critic network is an evaluation function which is a difference value of a real reward value and an estimated reward value of the operator;

training parameters of each operator neural network step by taking the minimization of the evaluation function as a training target and combining the information in the experience pool until the network converges;

and then accessing the convergent Actor network obtained by training into a chess deduction system, and making a corresponding decision by each operator according to the battlefield situation.

Further, the perspective state of the current position of the operator i is represented by a perspective view, and the perspective view is an image formed by a range which can be viewed by the current position of the operator; and processing the perspective view through a convolution neural network to obtain the perspective state of the current position of the operator i.

Further, the state information of the operator of the own party comprises position, power and residual blood volume; the enemy operator information comprises position, operator type and residual blood volume.

Further, the action variable is a 4-dimensional vector formed by a maneuver, a maneuver position, an attack and an attack target.

Further, the reward function R comprises a contention control score R_conJianfen R_desAnd the force of the arms is divided into R_rem

R＝R_con+R_des+R_rem。

Furthermore, the Actor network is a three-layer fully-connected neural network, the number of neurons in the first layer is determined by the dimension of the input operator state variable, the second layer comprises 256 neurons, and the number of neurons in the third layer is determined by the dimension of the operator action variable.

Furthermore, the Critic network is a three-layer fully-connected neural network, the neuron number of the first layer is determined by a state variable dimension input by the Actor neural network and an action variable dimension output by the Actor neural network, the neuron number of the second layer comprises 128 neurons, and the neuron number of the third layer is 1.

Further, the parameters of each operator neural network are trained step by using a gradient descent method and a reverse gradient propagation method.

Advantageous effects

The distributed reinforcement learning method is adopted, one operator corresponds to one decision network, the network scale is small, the search space is small, and the migration is convenient.

According to the invention, the image information is also added into the state variable of the operator, so that the complex battlefield situation which cannot be simply quantized can be described, and an effective decision result can be obtained more accurately according to various battlefield situations.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1: a military chess map;

FIG. 2: the general view is schematic;

FIG. 3: an Actor-Critic algorithm operation flow chart;

FIG. 4: a decision flow chart of a reinforcement learning network in a war game deduction environment.

Detailed Description

The war game deduction is a typical incomplete information game problem, different confrontation strategies are mainly explored in a human-human battle mode at present, and the war game deduction has certain limitation. By using the artificial intelligence method based on distributed reinforcement learning, the invention designs the intelligent war game deduction scheme without excessive human intervention, thereby realizing intelligent war game confrontation.

The method comprises the following specific steps:

step 1: and determining the state variable and the action variable of the chess operator decision network.

For operator i, the state variable S_iMainly comprises two parts. One part is obtained by interacting with a war game deduction and fight platform, and mainly comprises a vector formed by state information (position, power and residual blood volume) of an operator of the own party (including the operator of the own party), enemy operator information (position, operator type and residual blood volume), position information of a control-taking point and the like, and the other part also comprises a communication state (represented by a communication view) of the current position of the operator, because the terrains of all grids of a war game map are different, whether operators in different grids can mutually observe is called a communication relation, an image formed by the range in which each grid can communicate is a communication view, as shown in fig. 2, a red position is the current position of the operator, and a blue area is the communication range of the operator in the current position. And processing the perspective view by a convolutional neural network to obtain the perspective state of the current position of the operator. The two parts jointly form the state variable S of the operator i_i. The state variable is used as the input of the strategy network, and the action variable A is output_iThe actions that should be taken by the operator i at the current moment mainly include maneuvering, shooting, and the like.

In the embodiment, the number of the own operator is 3, the number of the enemy operator is 3, and the observed number of the enemy operator is 1. Thus for the own-operator i, the state variable S_iThe system is composed of 42-dimensional vectors consisting of state information (position, power and residual blood volume) of a self operator (including the self operator), enemy operator information (position, operator type and residual blood volume), position information of a control point, and through-view information (24-dimensional vectors obtained through dimension reduction processing of a convolutional neural network), wherein the states of unobserved enemy operators are all set to be 0, and the states are used as the input of a decision network. Action variable A of decision network output_iThe operator i is the action to be taken at the current moment, and is a 4-dimensional vector formed by the maneuvering, the maneuvering position, the attack and the attack target.

Step 2: in the algorithm training stage, each operator is obtained by interacting with a weapon and chess deduction platform in the weapon and chess deduction processThe score of the step movement is used as a reward function R, mainly comprising a rob-control score R_conJianfen R_desAnd the force of the arms is divided into R_remIs shown as follows

R＝R_con+R_des+R_rem

The higher the R value is, the better the military chess game performance of the game is.

And step 3: determining a Markov decision process according to the state input variables and the state output variables, wherein the Markov decision process is expressed as follows:

<S，A，R，γ>

wherein, S is the state variable input of the operator in the step 1, and A is the action variable output of the operator in the step 1; r is the reward function in the step 2, gamma is a discount factor, and the value range of gamma belongs to [0, 1 ].

Based on this, a training pool of reinforcement learning is constructed as follows

Wherein,

is the state variable input of the operator i at time t,

is the maneuver variable output by the operator i at time t,

is the reward value that the operator finds by the reward function at time t,

for operator i in state

Taking action

The post-updated state variables.

And in the algorithm training stage, information generated by interaction of each operator and the platform is stored in the operator experience pool and is used for neural network training.

And 4, step 4: the distributed learning method is adopted, each operator has a neural network, the neural networks are realized by adopting an Actor-criticic algorithm commonly used in reinforcement learning, the distributed learning method mainly comprises two neural networks, one neural network is an Actor network, the input is the state variable observation of the operator, and the output is the action of the operator; one is a Critic network, the input is the observation of the state variables of the operators and the action variables of the operators, and the output is an evaluation function, and the main process is shown in FIG. 3.

The input of the Actor network defining each operator is the state variable of the operator determined in step 1, the initial value of the state variable is the situation information represented by the initial position and the initial state of the operator, and the output is the action variable of the operator determined in step 1. The Actor network is a three-layer fully-connected neural network, the number of neurons in the first layer is determined by the dimension of an input state variable, in this embodiment, the number of neurons in the first layer is 42, the second layer comprises 256 neurons, the number of neurons in the third layer is determined by the dimension of an action variable of an operator, and in this embodiment, the number of neurons in the third layer is 4.

And defining the input of the Critic network of each operator as the input of the Actor network and the output of the Actor network, wherein the evaluation function of the output of the Critic network is the difference value of the real reward value calculated by the operator through the reward function in the step 2 and the estimated reward value of the Critic network. The Critic neural network is a three-layer fully-connected layer, the number of neurons in the first layer is determined by a state variable dimension input by the Actor neural network and an action variable dimension output by the Actor neural network, in this embodiment, the number of neurons in the first layer is 46, the number of neurons in the second layer includes 128 neurons, and the number of neurons in the third layer is 1.

And (4) training parameters of the operator neural network step by combining information in the experience pool and utilizing a gradient descent method and a reverse gradient propagation method until the network is converged by using the minimization of the evaluation function as a training target.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made in the above embodiments by those of ordinary skill in the art without departing from the principle and spirit of the present invention.

Claims

1. An intelligent war game deduction method based on distributed reinforcement learning is characterized in that: the method comprises the following steps:

Wherein,

is the state variable input of the operator i at time t,

is the maneuver variable output by the operator i at time t,

is the reward value that operator i finds by the reward function at time t,

for operator i in state

Taking action

2. The intelligent chess deduction method based on distributed reinforcement learning as claimed in claim 1, characterized in that: the perspective state of the current position of the operator i is represented by a perspective view, and the perspective view is an image formed by a range which can be viewed by the current position of the operator i; and processing the perspective view through a convolution neural network to obtain the perspective state of the current position of the operator i.

3. The intelligent chess deduction method based on distributed reinforcement learning as claimed in claim 1, characterized in that: the state information of the operator at the own side comprises position, motor power and residual blood volume; the enemy operator information comprises position, operator type and residual blood volume.

4. The intelligent chess deduction method based on distributed reinforcement learning as claimed in claim 1, characterized in that: the action variables are 4-dimensional vectors formed by maneuvering, maneuvering positions, attacks and attack targets.

5. The intelligent chess deduction method based on distributed reinforcement learning as claimed in claim 1, characterized in that: the reward function R comprises a contention control score R_conJianfen R_desAnd the force of the arms is divided into R_rem

R＝R_con+R_des+R_rem。

6. The intelligent chess deduction method based on distributed reinforcement learning as claimed in claim 1, characterized in that: the Actor network is a three-layer fully-connected neural network, the number of neurons in the first layer is determined by the dimension of an input operator state variable, the second layer comprises 256 neurons, and the number of neurons in the third layer is determined by the dimension of an action variable of an operator.

7. The intelligent chess deduction method based on distributed reinforcement learning as claimed in claim 1, characterized in that: the Critic network is a three-layer fully-connected neural network, the number of neurons in the first layer is determined by the dimension of a state variable input by an Actor neural network and the dimension of an action variable output by the Actor neural network, the number of neurons in the second layer comprises 128 neurons, and the number of neurons in the third layer is 1.

8. The intelligent chess deduction method based on distributed reinforcement learning as claimed in claim 1, characterized in that: and training the parameters of each operator neural network step by using a gradient descent method and a reverse gradient propagation method.