CN114611669B

CN114611669B - Intelligent decision-making method for chess deduction based on double experience pool DDPG network

Info

Publication number: CN114611669B
Application number: CN202210244709.0A
Authority: CN
Inventors: 张震; 臧兆祥
Original assignee: China Three Gorges University CTGU
Current assignee: China Three Gorges University CTGU
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2023-10-13
Anticipated expiration: 2042-03-14
Also published as: CN114611669A

Abstract

The application discloses a soldier chess deduction intelligent decision-making method based on a double experience pool DDPG network, which comprises the following steps: obtaining deduction data of the chess and constructing a double experience pool DDPG model; preprocessing the deduction data of the chess, vectorizing the preprocessed data, and obtaining vectorized data; and inputting the vectorized data into a double-experience-pool DDPG model for training, completing training when the double-experience-pool DDPG model reaches a preset convergence degree, and generating a chess deduction intelligent decision based on the trained double-experience-pool DDPG model. Compared with a general reinforcement learning architecture, the method has the advantages that the convergence speed is higher, the training time is saved, and the whole strategy is learned faster. The double experience pool DDPG structure is applied to chess deduction, and the training speed is improved by the double experience pools, so that an available neural network model is trained faster. By screening and utilizing high quality samples, the problem of model performance dependence on sample quality is ameliorated to some extent.

Description

Intelligent decision-making method for chess deduction based on double experience pool DDPG network

Technical Field

The application belongs to the field of intelligent decision making, and particularly relates to a soldier chess deduction intelligent decision making method based on a double experience pool DDPG network.

Background

The purpose of intelligent decision making is to solve complex decision making problems by artificial intelligence methods using human knowledge and by means of computers. Typical complex decision problems are deducted from a chess. The chess deduction is a common countermeasure pattern in military exercises, the sand table is used for replacing the field, different chess pieces are used for replacing different forces of the arms, the field actual combat is simulated to the greatest extent based on a background database and electronic situation information, and the chess deduction method can be used for checking a strategic tactic and inspiring a commander on tactical strategies. With the development of artificial intelligence technology, intelligent decisions and chess deductions are fused into research hotspots in the fields of chess deduction and artificial intelligence, and a plurality of achievements are obtained for the research of chess deduction intelligent decisions, so that the achievements are hopeful to practically promote the combat force of troops and deepen the intelligent military process.

The existing intelligent decision method is mainly divided into two types:

rule type: for example, a decision tree method solves the decision problem by setting different coping strategies adopted in different situations. The main problems of the technology are that the situation complexity in the deduction of the chess is high, the branches required by the rule type intelligent agent for performing the action through judging the situation are too many, and the complexity of the whole decision tree grows exponentially along with the rise of the complexity of the problem.

Learning type: a certain network model is built by deep learning and reinforcement learning technology, battlefield situation is used as network input, actions required to be taken by own force are used as network output, parameters of the network are updated through certain evaluation, learning of the whole decision frame is achieved, and after a certain time of training, the network model can be directly used for fight. The main limitation of this type of technology is that the convergence rate of the network model is greatly affected by the quality of the sample, and the convergence rate is not guaranteed.

Disclosure of Invention

The application aims to provide a soldier chess deduction intelligent decision-making method based on a double experience pool DDPG network, so as to solve the problems in the prior art.

In order to achieve the above purpose, the application provides a chess deduction intelligent decision method based on a double experience pool DDPG network, comprising the following steps:

obtaining deduction data of the chess and constructing a double experience pool DDPG model;

preprocessing the chess deduction data, vectorizing the preprocessed data, and obtaining vectorized data;

and inputting the vectorized data into the double-experience-pool DDPG model for training, completing training when the double-experience-pool DDPG model reaches a preset convergence degree, and generating a chess deduction intelligent decision based on the trained double-experience-pool DDPG model.

Optionally, the step of obtaining the chess deduction data includes running a chess deduction environment and obtaining the chess deduction data in the chess deduction environment;

the chess deduction data comprises: own entity attribute information, entity attribute information of which an enemy has been found, deduction time, map attribute information, and scoreboard information;

wherein the own entity attribute information comprises the residual blood volume of own units, the positions of own units and the residual elastic quantity of own units;

the entity attribute information that the enemy has been discovered includes enemy residual blood volume and enemy location;

the map attribute information comprises an elevation and a number;

the scoreboard information includes score information that is currently obtained.

Optionally, in the preprocessing process of the deduction data of the chess, a data cleaning mode is adopted for preprocessing, and the data cleaning includes:

carrying out data extraction on the acquired deduction data of the chess to obtain normalized data;

and classifying the normalized data and removing redundant data.

Optionally, the data extraction of the acquired deduction data of the chess game is performed, and the process of obtaining the normalized data includes:

when the deduction data of the chesses are extracted, removing the nonstandard data in the deduction data to obtain normalized data;

the non-canonical data includes: blank data and scrambled data.

Optionally, the process of classifying the normalized data and removing the redundant data includes:

dividing the normalized data into the own entity attribute information, the entity attribute information of which the enemy has been found, the deduction time and the scoreboard information;

and eliminating redundant data in the classified data, wherein the redundant data comprises information which is useless for decision.

Optionally, the vectorizing the preprocessed data includes:

coding deduction time, own entity attribute information and entity attribute information of discovered enemies based on a one-hot coding mode;

and directly taking the scoreboard information as one of the vectorized data without encoding the map attribute information and the scoreboard information.

Optionally, the process of constructing the dual experience pool DDPG model includes:

constructing a DDPG neural network based on a DDPG algorithm architecture, wherein the DDPG neural network comprises an Actor network, a Critic network, an actor_target network and a cirtic_target network;

constructing two experience pools for storing experiences generated in the training process, wherein the experience pools are multidimensional arrays;

and constructing the double experience pool DDPG model based on the DDPG neural network and the two experience pools.

Optionally, inputting the vectorized data into the dual experience pool DDPG model for training comprises:

inputting the vectorized data into the Actor network, and inputting the obtained value into the Critic network for processing;

updating the actor_target network based on the parameters of the Actor network every preset time step, and updating the cirtic_target network based on the parameters of the Critic network;

and storing the current experience into a first experience pool when each training is completed, and if the obtained rewards in the current experience are larger than the average rewards in the first experience pool, storing the current experience into a second experience pool.

Optionally, in the process of updating the actor_target network based on the parameters of the Actor network, the Actor network is updated by adopting a gradient descent method; and in the process of updating the cirtic_target network based on the parameters of the Critic network, the Critic network is also updated by using a gradient descent method, and in the updating process, the loss function of the Critic network uses mean square error loss.

The application has the technical effects that:

compared with a general reinforcement learning architecture, the method has the advantages that the convergence speed is higher, the training time is saved, and the whole strategy is learned faster. The double experience pool DDPG structure is applied to chess deduction, and the training speed is improved by the double experience pools, so that an available neural network model is trained faster. By screening and utilizing high quality samples, the problem of model performance dependence on sample quality is ameliorated to some extent.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application. In the drawings:

FIG. 1 is a flow chart of a method in an embodiment of the application;

fig. 2 is a schematic diagram of a training process in an embodiment of the present application.

Detailed Description

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.

Example 1

1-2, the embodiment provides a soldier chess deduction intelligent decision method based on a double experience pool DDPG network, which comprises the following steps:

step 1, running a soldier chess deduction environment and collecting data;

step 2, data cleaning, which comprises data extraction, data classification and redundant data rejection;

and 3, vectorizing the text data.

And 4, constructing a double experience pool DDPG model.

And 5, inputting the data into the model, and filling two experience pools.

And 6, training until the model converges.

The step 1 specifically comprises the following steps:

1.1 the chess is an instant tactical chess, so as to be distinguished from round-making chess.

1.2 the data mainly comprises own entity attribute information, entity attribute information of which an enemy has been found, map attribute information and scoreboard information. Wherein the own entity attribute information comprises the residual blood volume, the position, the residual bullet volume and the like of the own unit, the entity attribute information of the enemy found comprises the residual blood volume and the position but does not comprise the bullet volume, the map attribute information comprises the elevation, the number and the like, and the scoreboard information is the score information obtained at present.

The step 2 specifically comprises the following steps:

2.1 extracting each piece of data collected, and removing nonstandard data such as blank lines and messy codes. Normalized data is obtained.

2.2 removing redundant data, removing some useless data for decision according to expert experience to reduce state space, such as information of 'aggregation or not' given by environment, which is not helpful to decision of the intelligent agent, and can be deleted.

The step 3 specifically comprises the following steps:

1.1 converting formatted data output by environment into vector format, using one-hot coding mode to code deduction time, own unit information and obtained enemy unit information, the scoreboard information can be directly used, and it is not necessary to use one-hot coding to make conversion. The meaning of the above formatted data is: the data format given by the environment is certain, and the position of the data is preset by classifying.

1.2 eliminating the influence of different scales on data, and carrying out normalization processing on the data, wherein the normalization formula is as follows:

x′ _ij is x _ij Normalized x _ij The value thereafter is the ith column, the jth dimension, the feature, x _i Is the ith column feature, min (x _i ) Is the minimum of the values in all dimensions of column i, max (x _i ) Is the numerical maximum in all dimensions of column i.

The step 4 specifically comprises the following steps:

the neural network is built according to a DDPG algorithm architecture, and the DDPG architecture needs to build 4 neural networks, namely an Actor network, a Critic network, an actor_target network and a cirtic_target network.

4.1 the convolution layer uses multiple convolution kernels, different convolution kernels focus on different features, and features are extracted for more important attributes such as blood volume and coordinates, respectively. The update formula of the convolutional neural network is as follows:

x ^t ＝σ _cnn (w _cnn ⊙x _t +b _cnn )

wherein x is ^t Representing the current state characteristics, w _cnn Representing the weights of the filters, b _cnn Representing the deviation parameter, sigma _cnn Is an activation function.

The 4.2Actor network is a three-layer neural network, the first layer is a convolution layer, the number of neurons is determined by the situation information dimension, the number of neurons of the second layer is 128, and the number of neurons of the third layer is determined by the action variable dimension.

The 4.3Critic network is a three-layer fully-connected neural network, the number of neurons of the first layer is determined by the input variable dimension of the Actor network and the output variable dimension of the Actor network together, the number of neurons of the second layer is 128, and the number of neurons of the third layer is 1.

4.4Acotr_target network structure is the same as the Actor network, and Critic_target network structure is the same as the Critic network.

4.5, constructing two experience pools for storing experiences obtained in the deduction process, wherein the experience pools are multidimensional arrays and have the following structure:

wherein the method comprises the steps ofFor the state at the current moment->For the action taken at time t, +.>For rewarding acquisitions->I represents the ith experience for the state at time t+1.

The dimensions of the experience pool are determined by the following formula:

dim＝2*state_dim+action_dim+1

where dim represents the dimension of the experience pool, state_dim is the dimension of the situation information, and action_dim is the dimension of the motion vector, plus 1 dimension required for the prize value.

The step 5 specifically comprises the following steps:

and 5.1, inputting the vectorized data in the step 3 into an Actor network, and mining potential links in situation information. The output value of the Actor network.

And 5.2, splicing the output value of the convolutional neural network and the output value of the Actor network together, and inputting the spliced output value and the output value into the Critic network to obtain the output of the Critic network. And updating the actor_target network and the critic_target network respectively by using the parameters of the Actor network and the parameters of the Critic network at fixed time steps. The method for updating the parameters of the target network is soft-update, and is carried out according to the following formula:

θ _targ ←——ρθ _targ +(1-ρ)θ

here θ _targ Referring to the initial parameters in the target network, ρ generally takes a larger value to ensure that the parameters are updated slowly, which is more robust than the hard update of directly copied parameters.

5.3 environmental Provisions one at a timeIs stored in an experience pool A, the experience R _i ^t Greater than the average prize in experience pool a, it is copied into experience pool B.

The step 6 specifically comprises the following steps:

6.1Actor network is updated using gradient descent method.

The loss function of the 6.2Critic network was updated using mean square error loss (MSE), also using the gradient descent method.

6.3 updating the data in the experience pool using the overlay.

Example two

1-2, in this embodiment, a method for intelligent decision making of chess deduction based on a double experience pool DDPG network is provided, a chess deduction environment is operated, and data are collected; the data cleaning comprises data extraction, data classification and redundant data rejection; text data vectorization, construction of a double experience pool DDPG model, filling of an experience pool, updating of network parameters, training until the model converges, and the method specifically comprises the following steps:

step 1: and operating a chess deduction environment, and collecting fight data, wherein the fight data comprises the state of each step, actions taken by the user, the part score and the like, and the data can be generated by fight between a hard coding strategy written manually and a bot built in the deduction environment.

Step 2: normalizing collected data

Step 3: the DDPG network model of the double experience pool designed by the application is composed of 5 parts, namely an Actor network, a Critic network, an Actor-target network, a Critic-target network and a double experience pool. The input of the Actor network is the current observed state, and the output of the Actor network is the action of each unit of the current state; the input of the Critic network is the current observed state and the current actions of each unit, and the output is an estimated value, and the main process is shown in the figure.

Step 4: the observed state is observable situation information in the environment, and comprises a current score; step number; position of own unit, blood volume, ammunition allowance; the location of the observed enemy units, blood volume; the position of the robotically controlled point. The actions of the above units mainly include movement, masking, shooting, and the like.

In the examples, the number of own units was 3. The enemy unit operator number is also 3. The specific state information is a 36-dimensional vector composed of own unit state, enemy unit state and score board information returned by the deduction environment. This set of vectors represents the information on the current field that needs to be of interest, and is used as input to the Actor network. Each unit action comprises a 15-dimensional vector formed by actions such as maneuvering, maneuvering targets, robbery control, shooting targets and the like.

Step 5: according to the interaction between the network model and the environment, the current time state S is obtained ^t Action A at the present moment ^t The next time state S ^t+1 Rewarding information R given by score board ^t . The experience pool was constructed accordingly as follows:

wherein the subscript i represents the ith experience.The double experience pools are marked as experience pool A and experience pool B, wherein the experience pool A normally stores combat data, and the experience pool B only receives R in the experience pool A _i ^t Experience with values above the average prize value.

The prize value is defined as:

wherein gamma is a discount factor, r(s) _i ,a _i ) Is in state s _i Action a is taken at that time _i A prize value may be obtained.

Step 6: after the data in the experience pool reaches a certain number, experience is extracted from the experience pools A and B according to different proportions to train the network. Wherein the number of neurons of the first layer of the Actor network is 36, the number of neurons of the second layer is 128, and the number of neurons of the third layer is 15. The Critic network has a first layer of neurons 51, a second layer of neurons 128, and a third layer of neurons 1.

Parameters of the Actor network are updated layer by layer using a gradient descent algorithm.

The Actor network update formula is:

and updating Critic network parameters layer by using a mean square error loss function and a gradient descent method.

The Critic network update formula is:

the following formula is used when updating the actor_target network:

θ _targ ←——ρθ _targ +(1-ρ)θ

the following formula is used when updating the critic_target network:

φ _targ ←——ρφ _targ +(1-ρ)φ

in this embodiment, ρ is set to 0.95.

The present application is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

1. A soldier chess deduction intelligent decision method based on a double experience pool DDPG network is characterized by comprising the following steps:

inputting the vectorized data into the double experience pool DDPG model for training, completing training when the double experience pool DDPG model reaches a preset convergence degree, and generating a chess deduction intelligent decision based on the trained double experience pool DDPG model;

the process of constructing the double experience pool DDPG model comprises the following steps:

constructing a double experience pool DDPG model based on the DDPG neural network and the two experience pools;

the process of inputting the vectorized data into the dual experience pool DDPG model for training comprises the following steps:

when each training is completed, storing the current experience into a first experience pool, and if the obtained rewards in the current experience are larger than the average rewards in the first experience pool, storing the current experience into a second experience pool;

in the process of updating the actor_target network based on the parameters of the Actor network, the Actor network is updated by adopting a gradient descent method; and in the process of updating the cirtic_target network based on the parameters of the Critic network, the Critic network is also updated by using a gradient descent method, and in the updating process, the loss function of the Critic network uses mean square error loss.

2. The method of claim 1, wherein the step of obtaining the chess deduction data comprises operating a chess deduction environment and obtaining the chess deduction data in the chess deduction environment;

the map attribute information comprises an elevation and a number;

3. The method according to claim 2, wherein in preprocessing the chess deduction data, the preprocessing mode adopts data cleaning, and the data cleaning includes:

and classifying the normalized data and removing redundant data.

4. The method according to claim 3, wherein the step of extracting the acquired deduction data of the chess and obtaining the normalized data comprises the steps of:

the non-canonical data includes: blank data and scrambled data.

5. A method according to claim 3, wherein classifying the normalized data and eliminating redundant data comprises:

6. The method of claim 2, wherein vectorizing the preprocessed data comprises: