CN114722998A

CN114722998A - Method for constructing chess deduction intelligent body based on CNN-PPO

Info

Publication number: CN114722998A
Application number: CN202210232129.XA
Authority: CN
Inventors: 张震; 臧兆祥
Original assignee: China Three Gorges University CTGU
Current assignee: China Three Gorges University CTGU
Priority date: 2022-03-09
Filing date: 2022-03-09
Publication date: 2022-07-08
Anticipated expiration: 2042-03-09
Also published as: CN114722998B

Abstract

The invention discloses a construction method of a chess deduction intelligent body based on CNN-PPO, which comprises the following steps: acquiring initial situation data of a chess deduction platform, and preprocessing the initial situation data to obtain target situation data; an influence map module is constructed, target situation data are input into the influence map module, and influence characteristics are output and obtained; and optimally constructing a hybrid neural network model based on a convolutional neural network and a near-end strategy, splicing target situation data and influence characteristics, inputting the spliced data into the hybrid neural network model, and performing model iterative training until a target function is minimum and the network is converged, thereby realizing the construction of the CNN-PPO agent. The invention increases the comprehension degree of the situation of the intelligent agent and increases the fighting intensity of the intelligent agent to a certain degree.

Description

Method for constructing chess deduction intelligent body based on CNN-PPO

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a construction method of a chess deduction intelligent body based on CNN-PPO.

Background

The war chess deduction mainly utilizes the experience and rule summarized from the war practice to carry out deduction analysis on the battle process. With the rapid development of the computing power of computers, various new technologies are applied to the war game deduction, the war game deduction of computers becomes a main branch of war game deduction, and countries in the world also regard the war game deduction as a means for improving military capability.

In the specific war game pursuit, the general simplification is such a problem: under the limitation of a certain objective rule, a certain target is realized through the actions of force deployment, maneuvering, attack and the like, for example, the control point is seized or the force of enemies is killed. The aim of constructing a weapon deduction intelligent agent is to obtain a director which can autonomously make corresponding action decision according to the current battlefield situation. The intelligent agent is classified into a rule type and a learning type according to whether the intelligent agent has learning ability. The regular agent is realized by hard programming means, a plurality of branch loops are used for stipulating that the agent takes certain action at a certain time, and a commonly used technology is a behavior tree. Learning-oriented agents are agents with autonomous learning capabilities represented by machine learning models, and the models can update network parameters in the fight process, so that more excellent models can be obtained.

The existing intelligent body construction method mainly comprises a rule type model and a neural network model, and because the state space in the war game deduction is huge, the rule made according to the expert experience is difficult to cover all conditions, and only the states can be classified more generally, so that the rule type intelligent body is rigid in decision making and cannot flexibly deal with the emergency. The difficulty faced by the neural network model is mainly that the sparse reward given by the environment is difficult to effectively update network parameters, dimension explosion and the like.

Disclosure of Invention

In order to solve the above problems, the present invention provides the following solutions: a war game deduction intelligent body construction method based on CNN-PPO comprises the following steps:

acquiring initial situation data of a chess deduction platform, and preprocessing the initial situation data to obtain target situation data;

an influence map building module is used for inputting the target situation data into the influence map module and outputting the influence map module to obtain influence characteristics;

and optimally constructing a hybrid neural network model based on a convolutional neural network and a near-end strategy, splicing the target situation data and the influence characteristics, inputting the spliced data into the hybrid neural network model, and performing model iterative training until a target function is minimum and the network is converged, thereby realizing the construction of the CNN-PPO agent.

Preferably, the initial situation data is preprocessed to screen the initial situation data, remove non-standard data and obtain target situation data;

the initial situation data comprises attribute information of own party combat entity, attribute information of enemy party combat entity, map synopsis attribute information and scoreboard information;

the non-standard data comprises redundant data, missing format data, null values and error information.

Preferably, the overall architecture of the hybrid neural network model is a CNN-PPO architecture, and includes a convolutional neural network, an Actor _ new network, an Actor _ old network, and a criticc network;

the convolutional neural network is used for mining potential relation among target situation data and realizing extraction of hidden features;

the Actor _ new network, the Actor _ old network and the Critic network all use three layers of fully connected neural networks.

Preferably, before inputting the hybrid neural network model for model iterative training, the method further includes inputting the output of the convolutional neural network into an Actor network in the PPO architecture to obtain the output of the Actor network; and splicing the output of the Actor network and the output of the convolutional neural network, and inputting the spliced output into the Critic network to obtain the output of the Critic network.

Preferably, the output of the convolutional neural network is input into an Actor network in the PPO architecture, and obtaining the output of the Actor network includes inputting the output of the convolutional neural network into an Actor _ new network, so as to obtain two parameter values of μ and σ; establishing normal distribution based on the two parameter values to represent the distribution of action, wherein mu is the mean value of the normal distribution, and sigma is the equation of the normal distribution; and obtaining an action according to the normal distribution sampling, and obtaining the reward value given by the environment and the state of the next moment through interaction between the action and the environment.

Preferably, the output of the Actor network and the output of the convolutional neural network are spliced and input into the Critic network, and obtaining the output of the Critic network comprises inputting situation data of the next moment into the Critic network to obtain an output V _ofthe network and calculating a discount reward value; inputting the state values of T moments into a criticic network to obtain T V _ values; and calculating the mean square error of the discount reward values R and V _ and updating the criticic network by using a back propagation mechanism. Where V _ is the estimated revenue value and calculates the discount reward value taken by taking action a in state S.

Preferably, inputting the hybrid neural network model for model iterative training includes performing N-suboptimization on network parameters by using a mean square error loss function, and performing B-suboptimization on an Actor network and a convolutional neural network until a target function is minimum and the network converges.

Preferably, performing N sub-optimization on the network parameters by using a mean square error loss function, and performing B sub-optimization on the Actor network and the convolutional neural network comprises respectively inputting all state values in an experience pool into the Actor _ new network and the Actor _ old network to obtain normal distributions N1 and N2 of actions; inputting all actions in the experience pool into N1 and N2 to obtain probabilities p1 and p2, and calculating the value of p2/p1 based on the probability values of p1 and p 2; and calculating the error of the Actor network, updating parameters by using a back propagation mechanism, training the model until convergence, and realizing the construction of the CNN-PPO agent.

The invention discloses the following technical effects:

the invention provides a military chess deduction intelligent body construction method based on CNN-PPO, which is characterized in that potential association mining is carried out on initial situation data based on a convolutional neural network to obtain influence characteristic information, the influence characteristic and the initial situation data are input into a PPO algorithm model together for learning, a hybrid neural network model is formed by adopting the Convolutional Neural Network (CNN) and near-end strategy optimization (PPO), and the characteristic formed by an influence map is artificially added in the aspect of characteristic processing. This makes the convolutional neural network converge faster when processing the feature data, and the action selection given by the whole agent is more careful. The comprehension degree of the intelligent agent to the situation is increased, and the fighting intensity of the intelligent agent is increased to a certain degree.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flow chart of a method according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

As shown in fig. 1, the invention provides a construction method of a chess deduction intelligent agent based on CNN-PPO, comprising the following steps:

acquiring initial situation data of a military chess deduction platform, and preprocessing the initial situation data to obtain target situation data;

Preprocessing the initial situation data to screen the initial situation data, removing non-standard data and obtaining target situation data;

the non-standard data comprises redundant data, data with missing format, null value and error information.

The overall architecture of the hybrid neural network model is a CNN-PPO architecture, and comprises a convolutional neural network, an Actor _ new network, an Actor _ old network and a Critic network;

Before inputting the hybrid neural network model for model iterative training, the method further comprises the steps of inputting the output of the convolutional neural network into an Actor network in the PPO architecture to obtain the output of the Actor network; and splicing the output of the Actor network and the output of the convolutional neural network, and inputting the spliced output into the Critic network to obtain the output of the Critic network.

Inputting the output of the convolutional neural network into an Actor network in a PPO architecture, and obtaining the output of the Actor network, wherein the output of the convolutional neural network is input into an Actor _ new network to obtain two parameter values of mu and sigma; establishing normal distribution based on the two parameter values to represent the distribution of action, wherein mu is the mean value of the normal distribution, and sigma is the equation of the normal distribution; and obtaining an action according to the normal distribution sampling, and obtaining the reward value given by the environment and the state of the next moment through interaction between the action and the environment.

Splicing the output of the Actor network and the output of the convolutional neural network, and inputting the spliced output and the output into a Critic network to obtain the output of the Critic network, wherein the obtaining of the output of the Critic network comprises the steps of inputting situation data of the next moment into the Critic network to obtain the output V of the network and calculating a discount reward value; inputting the state values of T moments into a criticic network to obtain T V _ values; and calculating the mean square error of the discount reward values R and V _ and updating the criticic network by using a back propagation mechanism. Where V _ is the estimated revenue value and calculates the discount reward value taken by taking action a in state S.

Inputting the hybrid neural network model for model iterative training, namely performing N-suboptimization on network parameters by using a mean square error loss function, and performing B-suboptimization on an Actor network and a convolutional neural network until a target function is minimum and the network is converged.

Performing N sub-optimization on network parameters by using a mean square error loss function, and performing B sub-optimization on an Actor network and a convolutional neural network, wherein all state values in an experience pool are respectively input into an Actor _ new network and an Actor _ old network to obtain normal distribution N1 and N2 of actions; inputting all actions in the experience pool into N1 and N2 to obtain probabilities p1 and p2, and calculating the value of p2/p1 based on the probability values of p1 and p 2; and calculating the error of the Actor network, updating parameters by using a back propagation mechanism, training the model until convergence, and realizing the construction of the CNN-PPO agent.

Example one

As shown in fig. 1, the construction method of a chess deduction intelligent agent based on CNN-PPO provided by the invention comprises the following steps:

step 1: and operating the weapon and chess deduction platform, creating a weapon and chess deduction scene, and obtaining situation data returned by the platform. The situation data are generated by the action of a randomly initialized neural network model Actor _ new network and robots built in the environment. The method specifically comprises the following steps:

1.1 regular intelligent bodies are built in the war game deduction platform, which can be used for training man-machine antagonism and machine-machine antagonism. The Actor _ new network is used for fighting with the built-in intelligent agent and generating situation data. The Actor _ new network is a three-layer fully-connected neural network.

Step 2: and (4) screening the situation data returned by the platform in the step 1, and removing irregular data. The irregular data mainly refers to redundant data, data with a missing format and the like, and the data are eliminated. In the data generated by the fighting built-in robot, a part of reward values are positive, the majority of reward values are negative, and the experience of positive rewards is preferentially collected during collection.

The step 2 specifically comprises the following steps:

2.1 the situation data mainly comprises own entity attribute, entity attribute which is discovered by the enemy, map attribute and scoreboard information.

2.2 the non-standard data mainly refers to null values, error messages and the like.

The invention adopts the idea of reinforcement learning and the idea of influence map. Reinforcement learning plans the problem into a Markov decision process, and solves the problem through iteration. The influence map divides the situation features into primary features and secondary features. The first-level characteristics comprise attribute information of own fighting entity and attribute information of enemy fighting entity; the secondary features comprise map visual information, score board information and influence map information.

And step 3: and inputting the screened data into an influence map module, wherein the input of the influence map module is situation information comprising own/enemy entity information and map information. The output is the influence characteristic of a certain point of the map.

The step 3 specifically comprises the following steps:

3.1 the construction of an influence map module, wherein the influence map module is a module for further extracting situation data, and the influence in a certain range around the entity of the own party is given by the following formula:

e＝ine+high+da+di

ine in the formula is a visibility coefficient, wherein visibility is whether occlusion exists between two coordinates, and the absence of the occlusion is called visibility, and the presence of the occlusion is called non-visibility. high is elevation, i.e. altitude in the colloquial sense. da is the risk factor and di refers to the distance from the command point.

3.2 the output is generally set as the map point location of a certain area around the entity of the own party, taking the hexagonal grid as an example, the output is the influence coefficient of all hexagonal grids in n hexagonal grids away from the entity of the own party.

The reward function may give a negative value as a penalty to the agent when the own entity is in a region where the influence is negative, and a positive value as a reward to the agent when the own entity is in a region where the influence is positive.

The form of the reward function is as follows:

R＝r_a+r_c+r_d+a

wherein r is_aA score representing the current surviving self units; r is_cA score representing the seized control points; r is_dA unit score representing elimination of enemies; and a represents the current situation score, and the score is the blood volume lost by striking at the last moment or the effective score of striking the enemy.

And 4, step 4: constructing a hybrid neural network, wherein the hybrid neural network is constructed by adopting a near-end policy optimization algorithm (PPO) architecture.

The step 4 specifically comprises the following steps:

4.1 constructing a convolution neural network to mine potential connection between situation data.

4.2, constructing a hybrid neural network overall architecture according to the PPO algorithm architecture, wherein the hybrid neural network overall architecture is a CNN-PPO architecture and is composed of 4 neural networks, namely a convolutional neural network, an Actor _ new network, an Actor _ old network and a criticic network.

The convolutional neural network is used for extracting hidden features, the CNN uses convolution kernels with 3 different sizes and pays attention to different potential features respectively, and a CNN model calculation formula is as follows:

x^t＝σ_cnn(w_cnn⊙x_t+b_cnn)

wherein x^tRepresenting the current state feature, w_cnnRepresenting the weight of the filter, b_cnnRepresenting a deviation parameter, σ_cnnRepresenting an activation function.

The Actor network according to the current state s_tObtaining values of mu and sigma, establishing normal distribution N according to the values, sampling from the distribution N to obtain action a, obtaining reward value r given by environment, and observing next state s after environment change_t+1. The gradient is given as:

further, the Actor network is updated using the gradient.

Wherein P is_θ(a_t|s_t) For the sampling strategy, P_θ'(a_t|s_t) And (4) sampling strategy after parameter updating.

The Critic network, based on the input state s_tAnd action a_tCalculating the action value function Q(s)_t,a_t) The formula for calculating the loss of the criticic network is as follows:

loss＝(r+γ(maxQ(s',a'))-Q(s,a))²

where r is the reward value given by the environment, γ is the discount factor, and Q (s, a) is an action value function, representing the benefit of taking action a in state s.

And constructing an Actor _ new network, an Actor _ old network and a Critic network according to the PPO algorithm architecture. The Actor _ new network uses three layers of fully-connected neural networks, the number of neurons in the first layer is 42, the number of neurons in the second layer is 128, and the number of neurons in the third layer is 15. The Critic network uses three layers of full-connection neural networks, the number of neurons in the first layer is 57, the number of neurons in the second layer is 64, and the number of neurons in the third layer is 1. The Actor _ new network is consistent with the Actor _ old network structure. And after the model is constructed, randomly initializing network parameters.

And 5: and (3) splicing the situation information and the influence characteristics output by the influence module in the step (3) and then inputting the spliced situation information and the influence characteristics into the convolutional neural network to obtain the output of the convolutional neural network. And (3) splicing the situation information and the influence characteristics output by the influence module in the step (3) and then inputting the spliced situation information and the influence characteristics into the convolutional neural network to obtain the output of the convolutional neural network. The input of the convolutional neural network is an 80-dimensional input vector formed by splicing an initial situation of 26 dimensions and an influence characteristic of 54 dimensions, and the output is an output vector of 42 dimensions.

The step 5 specifically comprises the following steps:

5.1, splicing the initial situation information and the characteristic information extracted by the influence map module, and merging and inputting the information into the convolutional neural network. The splicing mode is direct splicing and bitwise addition.

5.2 convolutional neural networks use a plurality of convolution kernels of different sizes to focus on different potential features.

Step 6: and inputting the output of the convolutional neural network into an Actor network in the PPO architecture, and acquiring the output of the Actor network.

The step 6 specifically comprises the following steps:

6.1 construct experience pool, store each experience information in the format of

And 7: and splicing the output of the Actor network and the output of the convolutional neural network, and inputting the spliced output and the output into the Critic network to obtain the output of the Critic network. And N sub-optimization is carried out on the network parameters by using a mean square error loss function, and B sub-optimization is carried out on the Actor network and the convolutional neural network until the target function is minimum and the network is converged.

The overall flow order of data in the four networks is: inputting the initial situation data into an influence map module to obtain secondary influence characteristics; splicing the initial situation data and the secondary influence characteristics, and inputting the initial situation data and the secondary influence characteristics into a convolutional neural network to obtain the output of the convolutional neural network; the output of the convolutional neural network is input into an Actor _ new network to obtain two values of mu and sigma, and a normal distribution is established by using the two values to represent the distribution of action; sampling from the normal distribution to obtain an action; the action interacts with the environment to obtain the reward value given by the environment and the state of the next moment; and inputting the situation data of the next moment into the Critic network to obtain the output V _ofthe network, and calculating the discount reward value. Inputting the state values of T moments into a criticic network to obtain T V _ values; calculating the mean square error of the discount reward values R and V _ R; the criticic network is then updated using a back-propagation mechanism.

Inputting all state values in the experience pool into an Actor _ new network and an Actor _ old network respectively to obtain normal distributions N1 and N2 of actions; inputting all a values in the experience pool into N1 and N2 to obtain probabilities p1 and p2, and then calculating the value of p2/p1 by using p1 and p 2; and then calculating the error of the Actor network by using the following formula, and updating the parameters by using a back propagation mechanism.

And training the model until convergence is achieved, namely the CNN-PPO agent is constructed.

The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.

Claims

1. A construction method of a chess deduction intelligent body based on CNN-PPO is characterized by comprising the following steps:

2. The construction method of chess deduction intelligent agent based on CNN-PPO as claimed in claim 1,

3. The construction method of chess deduction intelligent agent based on CNN-PPO as claimed in claim 1,

4. The construction method of chess deduction intelligent agent based on CNN-PPO as claimed in claim 3,

5. The construction method of chess deduction intelligent agent based on CNN-PPO as claimed in claim 4,

6. The construction method of chess deduction intelligent agent based on CNN-PPO as claimed in claim 5,

splicing the output of the Actor network and the output of the convolutional neural network, and inputting the spliced output and the output into a Critic network to obtain the output of the Critic network, wherein the situation data of the next moment is input into the Critic network to obtain the output V of the network; inputting the state values of T moments into a criticic network to obtain T V _ values; calculating the mean square error of the discount reward value R and the V _ value, and updating the critical network by using a back propagation mechanism; where V _ is the estimated revenue value and calculates the discount award value taken by taking action a in state S.

7. The construction method of chess deduction intelligent agent based on CNN-PPO as claimed in claim 1,

8. The construction method of chess deduction intelligent agent based on CNN-PPO as claimed in claim 7,