CN114722998A - Method for constructing chess deduction intelligent body based on CNN-PPO - Google Patents
Method for constructing chess deduction intelligent body based on CNN-PPO Download PDFInfo
- Publication number
- CN114722998A CN114722998A CN202210232129.XA CN202210232129A CN114722998A CN 114722998 A CN114722998 A CN 114722998A CN 202210232129 A CN202210232129 A CN 202210232129A CN 114722998 A CN114722998 A CN 114722998A
- Authority
- CN
- China
- Prior art keywords
- network
- actor
- output
- ppo
- neural network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims description 10
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 52
- 238000010276 construction Methods 0.000 claims abstract description 23
- 238000003062 neural network model Methods 0.000 claims abstract description 21
- 238000012549 training Methods 0.000 claims abstract description 15
- 238000007781 pre-processing Methods 0.000 claims abstract description 6
- 230000009471 action Effects 0.000 claims description 31
- 238000009826 distribution Methods 0.000 claims description 24
- 238000013528 artificial neural network Methods 0.000 claims description 12
- 238000005457 optimization Methods 0.000 claims description 10
- 230000007246 mechanism Effects 0.000 claims description 8
- 238000005070 sampling Methods 0.000 claims description 7
- 238000005065 mining Methods 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 3
- 230000003993 interaction Effects 0.000 claims description 3
- 239000003795 chemical substances by application Substances 0.000 description 20
- 230000006870 function Effects 0.000 description 15
- 210000002569 neuron Anatomy 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 3
- 230000008485 antagonism Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000001788 irregular Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000002787 reinforcement Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 239000008280 blood Substances 0.000 description 1
- 210000004369 blood Anatomy 0.000 description 1
- 230000000875 corresponding effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012821 model calculation Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Abstract
The invention discloses a construction method of a chess deduction intelligent body based on CNN-PPO, which comprises the following steps: acquiring initial situation data of a chess deduction platform, and preprocessing the initial situation data to obtain target situation data; an influence map module is constructed, target situation data are input into the influence map module, and influence characteristics are output and obtained; and optimally constructing a hybrid neural network model based on a convolutional neural network and a near-end strategy, splicing target situation data and influence characteristics, inputting the spliced data into the hybrid neural network model, and performing model iterative training until a target function is minimum and the network is converged, thereby realizing the construction of the CNN-PPO agent. The invention increases the comprehension degree of the situation of the intelligent agent and increases the fighting intensity of the intelligent agent to a certain degree.
Description
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a construction method of a chess deduction intelligent body based on CNN-PPO.
Background
The war chess deduction mainly utilizes the experience and rule summarized from the war practice to carry out deduction analysis on the battle process. With the rapid development of the computing power of computers, various new technologies are applied to the war game deduction, the war game deduction of computers becomes a main branch of war game deduction, and countries in the world also regard the war game deduction as a means for improving military capability.
In the specific war game pursuit, the general simplification is such a problem: under the limitation of a certain objective rule, a certain target is realized through the actions of force deployment, maneuvering, attack and the like, for example, the control point is seized or the force of enemies is killed. The aim of constructing a weapon deduction intelligent agent is to obtain a director which can autonomously make corresponding action decision according to the current battlefield situation. The intelligent agent is classified into a rule type and a learning type according to whether the intelligent agent has learning ability. The regular agent is realized by hard programming means, a plurality of branch loops are used for stipulating that the agent takes certain action at a certain time, and a commonly used technology is a behavior tree. Learning-oriented agents are agents with autonomous learning capabilities represented by machine learning models, and the models can update network parameters in the fight process, so that more excellent models can be obtained.
The existing intelligent body construction method mainly comprises a rule type model and a neural network model, and because the state space in the war game deduction is huge, the rule made according to the expert experience is difficult to cover all conditions, and only the states can be classified more generally, so that the rule type intelligent body is rigid in decision making and cannot flexibly deal with the emergency. The difficulty faced by the neural network model is mainly that the sparse reward given by the environment is difficult to effectively update network parameters, dimension explosion and the like.
Disclosure of Invention
In order to solve the above problems, the present invention provides the following solutions: a war game deduction intelligent body construction method based on CNN-PPO comprises the following steps:
acquiring initial situation data of a chess deduction platform, and preprocessing the initial situation data to obtain target situation data;
an influence map building module is used for inputting the target situation data into the influence map module and outputting the influence map module to obtain influence characteristics;
and optimally constructing a hybrid neural network model based on a convolutional neural network and a near-end strategy, splicing the target situation data and the influence characteristics, inputting the spliced data into the hybrid neural network model, and performing model iterative training until a target function is minimum and the network is converged, thereby realizing the construction of the CNN-PPO agent.
Preferably, the initial situation data is preprocessed to screen the initial situation data, remove non-standard data and obtain target situation data;
the initial situation data comprises attribute information of own party combat entity, attribute information of enemy party combat entity, map synopsis attribute information and scoreboard information;
the non-standard data comprises redundant data, missing format data, null values and error information.
Preferably, the overall architecture of the hybrid neural network model is a CNN-PPO architecture, and includes a convolutional neural network, an Actor _ new network, an Actor _ old network, and a criticc network;
the convolutional neural network is used for mining potential relation among target situation data and realizing extraction of hidden features;
the Actor _ new network, the Actor _ old network and the Critic network all use three layers of fully connected neural networks.
Preferably, before inputting the hybrid neural network model for model iterative training, the method further includes inputting the output of the convolutional neural network into an Actor network in the PPO architecture to obtain the output of the Actor network; and splicing the output of the Actor network and the output of the convolutional neural network, and inputting the spliced output into the Critic network to obtain the output of the Critic network.
Preferably, the output of the convolutional neural network is input into an Actor network in the PPO architecture, and obtaining the output of the Actor network includes inputting the output of the convolutional neural network into an Actor _ new network, so as to obtain two parameter values of μ and σ; establishing normal distribution based on the two parameter values to represent the distribution of action, wherein mu is the mean value of the normal distribution, and sigma is the equation of the normal distribution; and obtaining an action according to the normal distribution sampling, and obtaining the reward value given by the environment and the state of the next moment through interaction between the action and the environment.
Preferably, the output of the Actor network and the output of the convolutional neural network are spliced and input into the Critic network, and obtaining the output of the Critic network comprises inputting situation data of the next moment into the Critic network to obtain an output V _ofthe network and calculating a discount reward value; inputting the state values of T moments into a criticic network to obtain T V _ values; and calculating the mean square error of the discount reward values R and V _ and updating the criticic network by using a back propagation mechanism. Where V _ is the estimated revenue value and calculates the discount reward value taken by taking action a in state S.
Preferably, inputting the hybrid neural network model for model iterative training includes performing N-suboptimization on network parameters by using a mean square error loss function, and performing B-suboptimization on an Actor network and a convolutional neural network until a target function is minimum and the network converges.
Preferably, performing N sub-optimization on the network parameters by using a mean square error loss function, and performing B sub-optimization on the Actor network and the convolutional neural network comprises respectively inputting all state values in an experience pool into the Actor _ new network and the Actor _ old network to obtain normal distributions N1 and N2 of actions; inputting all actions in the experience pool into N1 and N2 to obtain probabilities p1 and p2, and calculating the value of p2/p1 based on the probability values of p1 and p 2; and calculating the error of the Actor network, updating parameters by using a back propagation mechanism, training the model until convergence, and realizing the construction of the CNN-PPO agent.
The invention discloses the following technical effects:
the invention provides a military chess deduction intelligent body construction method based on CNN-PPO, which is characterized in that potential association mining is carried out on initial situation data based on a convolutional neural network to obtain influence characteristic information, the influence characteristic and the initial situation data are input into a PPO algorithm model together for learning, a hybrid neural network model is formed by adopting the Convolutional Neural Network (CNN) and near-end strategy optimization (PPO), and the characteristic formed by an influence map is artificially added in the aspect of characteristic processing. This makes the convolutional neural network converge faster when processing the feature data, and the action selection given by the whole agent is more careful. The comprehension degree of the intelligent agent to the situation is increased, and the fighting intensity of the intelligent agent is increased to a certain degree.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a flow chart of a method according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
As shown in fig. 1, the invention provides a construction method of a chess deduction intelligent agent based on CNN-PPO, comprising the following steps:
acquiring initial situation data of a military chess deduction platform, and preprocessing the initial situation data to obtain target situation data;
an influence map building module is used for inputting the target situation data into the influence map module and outputting the influence map module to obtain influence characteristics;
and optimally constructing a hybrid neural network model based on a convolutional neural network and a near-end strategy, splicing the target situation data and the influence characteristics, inputting the spliced data into the hybrid neural network model, and performing model iterative training until a target function is minimum and the network is converged, thereby realizing the construction of the CNN-PPO agent.
Preprocessing the initial situation data to screen the initial situation data, removing non-standard data and obtaining target situation data;
the initial situation data comprises attribute information of own party combat entity, attribute information of enemy party combat entity, map synopsis attribute information and scoreboard information;
the non-standard data comprises redundant data, data with missing format, null value and error information.
The overall architecture of the hybrid neural network model is a CNN-PPO architecture, and comprises a convolutional neural network, an Actor _ new network, an Actor _ old network and a Critic network;
the convolutional neural network is used for mining potential relation among target situation data and realizing extraction of hidden features;
the Actor _ new network, the Actor _ old network and the Critic network all use three layers of fully connected neural networks.
Before inputting the hybrid neural network model for model iterative training, the method further comprises the steps of inputting the output of the convolutional neural network into an Actor network in the PPO architecture to obtain the output of the Actor network; and splicing the output of the Actor network and the output of the convolutional neural network, and inputting the spliced output into the Critic network to obtain the output of the Critic network.
Inputting the output of the convolutional neural network into an Actor network in a PPO architecture, and obtaining the output of the Actor network, wherein the output of the convolutional neural network is input into an Actor _ new network to obtain two parameter values of mu and sigma; establishing normal distribution based on the two parameter values to represent the distribution of action, wherein mu is the mean value of the normal distribution, and sigma is the equation of the normal distribution; and obtaining an action according to the normal distribution sampling, and obtaining the reward value given by the environment and the state of the next moment through interaction between the action and the environment.
Splicing the output of the Actor network and the output of the convolutional neural network, and inputting the spliced output and the output into a Critic network to obtain the output of the Critic network, wherein the obtaining of the output of the Critic network comprises the steps of inputting situation data of the next moment into the Critic network to obtain the output V of the network and calculating a discount reward value; inputting the state values of T moments into a criticic network to obtain T V _ values; and calculating the mean square error of the discount reward values R and V _ and updating the criticic network by using a back propagation mechanism. Where V _ is the estimated revenue value and calculates the discount reward value taken by taking action a in state S.
Inputting the hybrid neural network model for model iterative training, namely performing N-suboptimization on network parameters by using a mean square error loss function, and performing B-suboptimization on an Actor network and a convolutional neural network until a target function is minimum and the network is converged.
Performing N sub-optimization on network parameters by using a mean square error loss function, and performing B sub-optimization on an Actor network and a convolutional neural network, wherein all state values in an experience pool are respectively input into an Actor _ new network and an Actor _ old network to obtain normal distribution N1 and N2 of actions; inputting all actions in the experience pool into N1 and N2 to obtain probabilities p1 and p2, and calculating the value of p2/p1 based on the probability values of p1 and p 2; and calculating the error of the Actor network, updating parameters by using a back propagation mechanism, training the model until convergence, and realizing the construction of the CNN-PPO agent.
Example one
As shown in fig. 1, the construction method of a chess deduction intelligent agent based on CNN-PPO provided by the invention comprises the following steps:
step 1: and operating the weapon and chess deduction platform, creating a weapon and chess deduction scene, and obtaining situation data returned by the platform. The situation data are generated by the action of a randomly initialized neural network model Actor _ new network and robots built in the environment. The method specifically comprises the following steps:
1.1 regular intelligent bodies are built in the war game deduction platform, which can be used for training man-machine antagonism and machine-machine antagonism. The Actor _ new network is used for fighting with the built-in intelligent agent and generating situation data. The Actor _ new network is a three-layer fully-connected neural network.
Step 2: and (4) screening the situation data returned by the platform in the step 1, and removing irregular data. The irregular data mainly refers to redundant data, data with a missing format and the like, and the data are eliminated. In the data generated by the fighting built-in robot, a part of reward values are positive, the majority of reward values are negative, and the experience of positive rewards is preferentially collected during collection.
The step 2 specifically comprises the following steps:
2.1 the situation data mainly comprises own entity attribute, entity attribute which is discovered by the enemy, map attribute and scoreboard information.
2.2 the non-standard data mainly refers to null values, error messages and the like.
The invention adopts the idea of reinforcement learning and the idea of influence map. Reinforcement learning plans the problem into a Markov decision process, and solves the problem through iteration. The influence map divides the situation features into primary features and secondary features. The first-level characteristics comprise attribute information of own fighting entity and attribute information of enemy fighting entity; the secondary features comprise map visual information, score board information and influence map information.
And step 3: and inputting the screened data into an influence map module, wherein the input of the influence map module is situation information comprising own/enemy entity information and map information. The output is the influence characteristic of a certain point of the map.
The step 3 specifically comprises the following steps:
3.1 the construction of an influence map module, wherein the influence map module is a module for further extracting situation data, and the influence in a certain range around the entity of the own party is given by the following formula:
e=ine+high+da+di
ine in the formula is a visibility coefficient, wherein visibility is whether occlusion exists between two coordinates, and the absence of the occlusion is called visibility, and the presence of the occlusion is called non-visibility. high is elevation, i.e. altitude in the colloquial sense. da is the risk factor and di refers to the distance from the command point.
3.2 the output is generally set as the map point location of a certain area around the entity of the own party, taking the hexagonal grid as an example, the output is the influence coefficient of all hexagonal grids in n hexagonal grids away from the entity of the own party.
The reward function may give a negative value as a penalty to the agent when the own entity is in a region where the influence is negative, and a positive value as a reward to the agent when the own entity is in a region where the influence is positive.
The form of the reward function is as follows:
R=ra+rc+rd+a
wherein r isaA score representing the current surviving self units; r iscA score representing the seized control points; r isdA unit score representing elimination of enemies; and a represents the current situation score, and the score is the blood volume lost by striking at the last moment or the effective score of striking the enemy.
And 4, step 4: constructing a hybrid neural network, wherein the hybrid neural network is constructed by adopting a near-end policy optimization algorithm (PPO) architecture.
The step 4 specifically comprises the following steps:
4.1 constructing a convolution neural network to mine potential connection between situation data.
4.2, constructing a hybrid neural network overall architecture according to the PPO algorithm architecture, wherein the hybrid neural network overall architecture is a CNN-PPO architecture and is composed of 4 neural networks, namely a convolutional neural network, an Actor _ new network, an Actor _ old network and a criticic network.
The convolutional neural network is used for extracting hidden features, the CNN uses convolution kernels with 3 different sizes and pays attention to different potential features respectively, and a CNN model calculation formula is as follows:
xt=σcnn(wcnn⊙xt+bcnn)
wherein xtRepresenting the current state feature, wcnnRepresenting the weight of the filter, bcnnRepresenting a deviation parameter, σcnnRepresenting an activation function.
The Actor network according to the current state stObtaining values of mu and sigma, establishing normal distribution N according to the values, sampling from the distribution N to obtain action a, obtaining reward value r given by environment, and observing next state s after environment changet+1. The gradient is given as:
further, the Actor network is updated using the gradient.
Wherein P isθ(at|st) For the sampling strategy, Pθ'(at|st) And (4) sampling strategy after parameter updating.
The Critic network, based on the input state stAnd action atCalculating the action value function Q(s)t,at) The formula for calculating the loss of the criticic network is as follows:
loss=(r+γ(maxQ(s',a'))-Q(s,a))2
where r is the reward value given by the environment, γ is the discount factor, and Q (s, a) is an action value function, representing the benefit of taking action a in state s.
And constructing an Actor _ new network, an Actor _ old network and a Critic network according to the PPO algorithm architecture. The Actor _ new network uses three layers of fully-connected neural networks, the number of neurons in the first layer is 42, the number of neurons in the second layer is 128, and the number of neurons in the third layer is 15. The Critic network uses three layers of full-connection neural networks, the number of neurons in the first layer is 57, the number of neurons in the second layer is 64, and the number of neurons in the third layer is 1. The Actor _ new network is consistent with the Actor _ old network structure. And after the model is constructed, randomly initializing network parameters.
And 5: and (3) splicing the situation information and the influence characteristics output by the influence module in the step (3) and then inputting the spliced situation information and the influence characteristics into the convolutional neural network to obtain the output of the convolutional neural network. And (3) splicing the situation information and the influence characteristics output by the influence module in the step (3) and then inputting the spliced situation information and the influence characteristics into the convolutional neural network to obtain the output of the convolutional neural network. The input of the convolutional neural network is an 80-dimensional input vector formed by splicing an initial situation of 26 dimensions and an influence characteristic of 54 dimensions, and the output is an output vector of 42 dimensions.
The step 5 specifically comprises the following steps:
5.1, splicing the initial situation information and the characteristic information extracted by the influence map module, and merging and inputting the information into the convolutional neural network. The splicing mode is direct splicing and bitwise addition.
5.2 convolutional neural networks use a plurality of convolution kernels of different sizes to focus on different potential features.
Step 6: and inputting the output of the convolutional neural network into an Actor network in the PPO architecture, and acquiring the output of the Actor network.
The step 6 specifically comprises the following steps:
And 7: and splicing the output of the Actor network and the output of the convolutional neural network, and inputting the spliced output and the output into the Critic network to obtain the output of the Critic network. And N sub-optimization is carried out on the network parameters by using a mean square error loss function, and B sub-optimization is carried out on the Actor network and the convolutional neural network until the target function is minimum and the network is converged.
The overall flow order of data in the four networks is: inputting the initial situation data into an influence map module to obtain secondary influence characteristics; splicing the initial situation data and the secondary influence characteristics, and inputting the initial situation data and the secondary influence characteristics into a convolutional neural network to obtain the output of the convolutional neural network; the output of the convolutional neural network is input into an Actor _ new network to obtain two values of mu and sigma, and a normal distribution is established by using the two values to represent the distribution of action; sampling from the normal distribution to obtain an action; the action interacts with the environment to obtain the reward value given by the environment and the state of the next moment; and inputting the situation data of the next moment into the Critic network to obtain the output V _ofthe network, and calculating the discount reward value. Inputting the state values of T moments into a criticic network to obtain T V _ values; calculating the mean square error of the discount reward values R and V _ R; the criticic network is then updated using a back-propagation mechanism.
Inputting all state values in the experience pool into an Actor _ new network and an Actor _ old network respectively to obtain normal distributions N1 and N2 of actions; inputting all a values in the experience pool into N1 and N2 to obtain probabilities p1 and p2, and then calculating the value of p2/p1 by using p1 and p 2; and then calculating the error of the Actor network by using the following formula, and updating the parameters by using a back propagation mechanism.
And training the model until convergence is achieved, namely the CNN-PPO agent is constructed.
The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.
Claims (8)
1. A construction method of a chess deduction intelligent body based on CNN-PPO is characterized by comprising the following steps:
acquiring initial situation data of a chess deduction platform, and preprocessing the initial situation data to obtain target situation data;
an influence map building module is used for inputting the target situation data into the influence map module and outputting the influence map module to obtain influence characteristics;
and optimally constructing a hybrid neural network model based on a convolutional neural network and a near-end strategy, splicing the target situation data and the influence characteristics, inputting the spliced data into the hybrid neural network model, and performing model iterative training until a target function is minimum and the network is converged, thereby realizing the construction of the CNN-PPO agent.
2. The construction method of chess deduction intelligent agent based on CNN-PPO as claimed in claim 1,
preprocessing the initial situation data to screen the initial situation data, removing non-standard data and obtaining target situation data;
the initial situation data comprises attribute information of own party combat entity, attribute information of enemy party combat entity, map synopsis attribute information and scoreboard information;
the non-standard data comprises redundant data, missing format data, null values and error information.
3. The construction method of chess deduction intelligent agent based on CNN-PPO as claimed in claim 1,
the overall architecture of the hybrid neural network model is a CNN-PPO architecture, and comprises a convolutional neural network, an Actor _ new network, an Actor _ old network and a Critic network;
the convolutional neural network is used for mining potential relation among target situation data and realizing extraction of hidden features;
the Actor _ new network, the Actor _ old network and the Critic network all use three layers of fully connected neural networks.
4. The construction method of chess deduction intelligent agent based on CNN-PPO as claimed in claim 3,
before inputting the hybrid neural network model for model iterative training, the method further comprises the steps of inputting the output of the convolutional neural network into an Actor network in the PPO architecture to obtain the output of the Actor network; and splicing the output of the Actor network and the output of the convolutional neural network, and inputting the spliced output into the Critic network to obtain the output of the Critic network.
5. The construction method of chess deduction intelligent agent based on CNN-PPO as claimed in claim 4,
inputting the output of the convolutional neural network into an Actor network in a PPO architecture, and obtaining the output of the Actor network, wherein the output of the convolutional neural network is input into an Actor _ new network to obtain two parameter values of mu and sigma; establishing normal distribution based on the two parameter values to represent the distribution of action, wherein mu is the mean value of the normal distribution, and sigma is the equation of the normal distribution; and obtaining an action according to the normal distribution sampling, and obtaining the reward value given by the environment and the state of the next moment through interaction between the action and the environment.
6. The construction method of chess deduction intelligent agent based on CNN-PPO as claimed in claim 5,
splicing the output of the Actor network and the output of the convolutional neural network, and inputting the spliced output and the output into a Critic network to obtain the output of the Critic network, wherein the situation data of the next moment is input into the Critic network to obtain the output V of the network; inputting the state values of T moments into a criticic network to obtain T V _ values; calculating the mean square error of the discount reward value R and the V _ value, and updating the critical network by using a back propagation mechanism; where V _ is the estimated revenue value and calculates the discount award value taken by taking action a in state S.
7. The construction method of chess deduction intelligent agent based on CNN-PPO as claimed in claim 1,
inputting the hybrid neural network model for model iterative training, namely performing N-suboptimization on network parameters by using a mean square error loss function, and performing B-suboptimization on an Actor network and a convolutional neural network until a target function is minimum and the network is converged.
8. The construction method of chess deduction intelligent agent based on CNN-PPO as claimed in claim 7,
performing N sub-optimization on network parameters by using a mean square error loss function, and performing B sub-optimization on an Actor network and a convolutional neural network, wherein all state values in an experience pool are respectively input into an Actor _ new network and an Actor _ old network to obtain normal distribution N1 and N2 of actions; inputting all actions in the experience pool into N1 and N2 to obtain probabilities p1 and p2, and calculating the value of p2/p1 based on the probability values of p1 and p 2; and calculating the error of the Actor network, updating parameters by using a back propagation mechanism, training the model until convergence, and realizing the construction of the CNN-PPO agent.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210232129.XA CN114722998B (en) | 2022-03-09 | 2022-03-09 | Construction method of soldier chess deduction intelligent body based on CNN-PPO |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210232129.XA CN114722998B (en) | 2022-03-09 | 2022-03-09 | Construction method of soldier chess deduction intelligent body based on CNN-PPO |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114722998A true CN114722998A (en) | 2022-07-08 |
CN114722998B CN114722998B (en) | 2024-02-02 |
Family
ID=82238024
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210232129.XA Active CN114722998B (en) | 2022-03-09 | 2022-03-09 | Construction method of soldier chess deduction intelligent body based on CNN-PPO |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114722998B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115829034A (en) * | 2023-01-09 | 2023-03-21 | 白杨时代(北京)科技有限公司 | Method and device for constructing knowledge rule execution framework |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130325774A1 (en) * | 2012-06-04 | 2013-12-05 | Brain Corporation | Learning stochastic apparatus and methods |
US20150100530A1 (en) * | 2013-10-08 | 2015-04-09 | Google Inc. | Methods and apparatus for reinforcement learning |
US20170024643A1 (en) * | 2015-07-24 | 2017-01-26 | Google Inc. | Continuous control with deep reinforcement learning |
CN108171796A (en) * | 2017-12-25 | 2018-06-15 | 燕山大学 | A kind of inspection machine human visual system and control method based on three-dimensional point cloud |
CN109948642A (en) * | 2019-01-18 | 2019-06-28 | 中山大学 | Multiple agent cross-module state depth deterministic policy gradient training method based on image input |
CN111605565A (en) * | 2020-05-08 | 2020-09-01 | 昆山小眼探索信息科技有限公司 | Automatic driving behavior decision method based on deep reinforcement learning |
CN112861442A (en) * | 2021-03-10 | 2021-05-28 | 中国人民解放军国防科技大学 | Multi-machine collaborative air combat planning method and system based on deep reinforcement learning |
CN113222106A (en) * | 2021-02-10 | 2021-08-06 | 西北工业大学 | Intelligent military chess deduction method based on distributed reinforcement learning |
CN113947022A (en) * | 2021-10-20 | 2022-01-18 | 哈尔滨工业大学(深圳) | Near-end strategy optimization method based on model |
-
2022
- 2022-03-09 CN CN202210232129.XA patent/CN114722998B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130325774A1 (en) * | 2012-06-04 | 2013-12-05 | Brain Corporation | Learning stochastic apparatus and methods |
US20150100530A1 (en) * | 2013-10-08 | 2015-04-09 | Google Inc. | Methods and apparatus for reinforcement learning |
US20170024643A1 (en) * | 2015-07-24 | 2017-01-26 | Google Inc. | Continuous control with deep reinforcement learning |
CN108171796A (en) * | 2017-12-25 | 2018-06-15 | 燕山大学 | A kind of inspection machine human visual system and control method based on three-dimensional point cloud |
CN109948642A (en) * | 2019-01-18 | 2019-06-28 | 中山大学 | Multiple agent cross-module state depth deterministic policy gradient training method based on image input |
CN111605565A (en) * | 2020-05-08 | 2020-09-01 | 昆山小眼探索信息科技有限公司 | Automatic driving behavior decision method based on deep reinforcement learning |
CN113222106A (en) * | 2021-02-10 | 2021-08-06 | 西北工业大学 | Intelligent military chess deduction method based on distributed reinforcement learning |
CN112861442A (en) * | 2021-03-10 | 2021-05-28 | 中国人民解放军国防科技大学 | Multi-machine collaborative air combat planning method and system based on deep reinforcement learning |
CN113947022A (en) * | 2021-10-20 | 2022-01-18 | 哈尔滨工业大学(深圳) | Near-end strategy optimization method based on model |
Non-Patent Citations (3)
Title |
---|
OGUZHAN DOGRU等: "Actor–Critic Reinforcement Learning and Application in Developing Computer-Vision-Based Interface Tracking", ENGINEERING, vol. 7, no. 9, pages 1248 - 1261 * |
崔文华;李东;唐宇波;柳少军;: "基于深度强化学习的兵棋推演决策方法框架", 国防科技, no. 02, pages 118 - 126 * |
薛傲: "基于强化学习的兵棋推演智能对抗研究与实现", 中国优秀硕士学位论文全文数据库社会科学Ⅰ辑, no. 01, pages 6 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115829034A (en) * | 2023-01-09 | 2023-03-21 | 白杨时代(北京)科技有限公司 | Method and device for constructing knowledge rule execution framework |
Also Published As
Publication number | Publication date |
---|---|
CN114722998B (en) | 2024-02-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112329348B (en) | Intelligent decision-making method for military countermeasure game under incomplete information condition | |
CN110929394B (en) | Combined combat system modeling method based on super network theory and storage medium | |
CN113392521B (en) | Method and system for constructing resource marshalling model for air-sea joint combat mission | |
CN113222106B (en) | Intelligent soldier chess deduction method based on distributed reinforcement learning | |
Gmytrasiewicz et al. | Bayesian update of recursive agent models | |
CN112364972A (en) | Unmanned fighting vehicle team fire power distribution method based on deep reinforcement learning | |
CN114722998B (en) | Construction method of soldier chess deduction intelligent body based on CNN-PPO | |
CN114485665A (en) | Unmanned aerial vehicle flight path planning method based on sparrow search algorithm | |
CN113282100A (en) | Unmanned aerial vehicle confrontation game training control method based on reinforcement learning | |
Wang et al. | Deep reinforcement learning-based air combat maneuver decision-making: literature review, implementation tutorial and future direction | |
CN116661503B (en) | Cluster track automatic planning method based on multi-agent safety reinforcement learning | |
CN116596343A (en) | Intelligent soldier chess deduction decision method based on deep reinforcement learning | |
CN113283574B (en) | Method and device for controlling intelligent agent in group confrontation, electronic equipment and storage medium | |
CN113988301B (en) | Tactical strategy generation method and device, electronic equipment and storage medium | |
CN115909027A (en) | Situation estimation method and device | |
CN113128698B (en) | Reinforced learning method for multi-unmanned aerial vehicle cooperative confrontation decision | |
CN114662655A (en) | Attention mechanism-based weapon and chess deduction AI hierarchical decision method and device | |
CN115220458A (en) | Distributed decision-making method for multi-robot multi-target enclosure based on reinforcement learning | |
CN114202175A (en) | Combat mission planning method and system based on artificial intelligence | |
CN113255893A (en) | Self-evolution generation method of multi-agent action strategy | |
CN114239834A (en) | Adversary relationship reasoning method and device based on multi-round confrontation attribute sharing | |
CN113324545A (en) | Multi-unmanned aerial vehicle collaborative task planning method based on hybrid enhanced intelligence | |
Tran et al. | Adaptation of a mamdani fuzzy inference system using neuro-genetic approach for tactical air combat decision support system | |
CN114611669B (en) | Intelligent decision-making method for chess deduction based on double experience pool DDPG network | |
CN117252081A (en) | Method for dynamically allocating air defense weapon-target to be driven |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |