CN114722998A - Method for constructing chess deduction intelligent body based on CNN-PPO - Google Patents

Method for constructing chess deduction intelligent body based on CNN-PPO Download PDF

Info

Publication number
CN114722998A
CN114722998A CN202210232129.XA CN202210232129A CN114722998A CN 114722998 A CN114722998 A CN 114722998A CN 202210232129 A CN202210232129 A CN 202210232129A CN 114722998 A CN114722998 A CN 114722998A
Authority
CN
China
Prior art keywords
network
actor
output
ppo
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210232129.XA
Other languages
Chinese (zh)
Other versions
CN114722998B (en
Inventor
张震
臧兆祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Three Gorges University CTGU
Original Assignee
China Three Gorges University CTGU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Three Gorges University CTGU filed Critical China Three Gorges University CTGU
Priority to CN202210232129.XA priority Critical patent/CN114722998B/en
Publication of CN114722998A publication Critical patent/CN114722998A/en
Application granted granted Critical
Publication of CN114722998B publication Critical patent/CN114722998B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The invention discloses a construction method of a chess deduction intelligent body based on CNN-PPO, which comprises the following steps: acquiring initial situation data of a chess deduction platform, and preprocessing the initial situation data to obtain target situation data; an influence map module is constructed, target situation data are input into the influence map module, and influence characteristics are output and obtained; and optimally constructing a hybrid neural network model based on a convolutional neural network and a near-end strategy, splicing target situation data and influence characteristics, inputting the spliced data into the hybrid neural network model, and performing model iterative training until a target function is minimum and the network is converged, thereby realizing the construction of the CNN-PPO agent. The invention increases the comprehension degree of the situation of the intelligent agent and increases the fighting intensity of the intelligent agent to a certain degree.

Description

Method for constructing chess deduction intelligent body based on CNN-PPO
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a construction method of a chess deduction intelligent body based on CNN-PPO.
Background
The war chess deduction mainly utilizes the experience and rule summarized from the war practice to carry out deduction analysis on the battle process. With the rapid development of the computing power of computers, various new technologies are applied to the war game deduction, the war game deduction of computers becomes a main branch of war game deduction, and countries in the world also regard the war game deduction as a means for improving military capability.
In the specific war game pursuit, the general simplification is such a problem: under the limitation of a certain objective rule, a certain target is realized through the actions of force deployment, maneuvering, attack and the like, for example, the control point is seized or the force of enemies is killed. The aim of constructing a weapon deduction intelligent agent is to obtain a director which can autonomously make corresponding action decision according to the current battlefield situation. The intelligent agent is classified into a rule type and a learning type according to whether the intelligent agent has learning ability. The regular agent is realized by hard programming means, a plurality of branch loops are used for stipulating that the agent takes certain action at a certain time, and a commonly used technology is a behavior tree. Learning-oriented agents are agents with autonomous learning capabilities represented by machine learning models, and the models can update network parameters in the fight process, so that more excellent models can be obtained.
The existing intelligent body construction method mainly comprises a rule type model and a neural network model, and because the state space in the war game deduction is huge, the rule made according to the expert experience is difficult to cover all conditions, and only the states can be classified more generally, so that the rule type intelligent body is rigid in decision making and cannot flexibly deal with the emergency. The difficulty faced by the neural network model is mainly that the sparse reward given by the environment is difficult to effectively update network parameters, dimension explosion and the like.
Disclosure of Invention
In order to solve the above problems, the present invention provides the following solutions: a war game deduction intelligent body construction method based on CNN-PPO comprises the following steps:
acquiring initial situation data of a chess deduction platform, and preprocessing the initial situation data to obtain target situation data;
an influence map building module is used for inputting the target situation data into the influence map module and outputting the influence map module to obtain influence characteristics;
and optimally constructing a hybrid neural network model based on a convolutional neural network and a near-end strategy, splicing the target situation data and the influence characteristics, inputting the spliced data into the hybrid neural network model, and performing model iterative training until a target function is minimum and the network is converged, thereby realizing the construction of the CNN-PPO agent.
Preferably, the initial situation data is preprocessed to screen the initial situation data, remove non-standard data and obtain target situation data;
the initial situation data comprises attribute information of own party combat entity, attribute information of enemy party combat entity, map synopsis attribute information and scoreboard information;
the non-standard data comprises redundant data, missing format data, null values and error information.
Preferably, the overall architecture of the hybrid neural network model is a CNN-PPO architecture, and includes a convolutional neural network, an Actor _ new network, an Actor _ old network, and a criticc network;
the convolutional neural network is used for mining potential relation among target situation data and realizing extraction of hidden features;
the Actor _ new network, the Actor _ old network and the Critic network all use three layers of fully connected neural networks.
Preferably, before inputting the hybrid neural network model for model iterative training, the method further includes inputting the output of the convolutional neural network into an Actor network in the PPO architecture to obtain the output of the Actor network; and splicing the output of the Actor network and the output of the convolutional neural network, and inputting the spliced output into the Critic network to obtain the output of the Critic network.
Preferably, the output of the convolutional neural network is input into an Actor network in the PPO architecture, and obtaining the output of the Actor network includes inputting the output of the convolutional neural network into an Actor _ new network, so as to obtain two parameter values of μ and σ; establishing normal distribution based on the two parameter values to represent the distribution of action, wherein mu is the mean value of the normal distribution, and sigma is the equation of the normal distribution; and obtaining an action according to the normal distribution sampling, and obtaining the reward value given by the environment and the state of the next moment through interaction between the action and the environment.
Preferably, the output of the Actor network and the output of the convolutional neural network are spliced and input into the Critic network, and obtaining the output of the Critic network comprises inputting situation data of the next moment into the Critic network to obtain an output V _ofthe network and calculating a discount reward value; inputting the state values of T moments into a criticic network to obtain T V _ values; and calculating the mean square error of the discount reward values R and V _ and updating the criticic network by using a back propagation mechanism. Where V _ is the estimated revenue value and calculates the discount reward value taken by taking action a in state S.
Preferably, inputting the hybrid neural network model for model iterative training includes performing N-suboptimization on network parameters by using a mean square error loss function, and performing B-suboptimization on an Actor network and a convolutional neural network until a target function is minimum and the network converges.
Preferably, performing N sub-optimization on the network parameters by using a mean square error loss function, and performing B sub-optimization on the Actor network and the convolutional neural network comprises respectively inputting all state values in an experience pool into the Actor _ new network and the Actor _ old network to obtain normal distributions N1 and N2 of actions; inputting all actions in the experience pool into N1 and N2 to obtain probabilities p1 and p2, and calculating the value of p2/p1 based on the probability values of p1 and p 2; and calculating the error of the Actor network, updating parameters by using a back propagation mechanism, training the model until convergence, and realizing the construction of the CNN-PPO agent.
The invention discloses the following technical effects:
the invention provides a military chess deduction intelligent body construction method based on CNN-PPO, which is characterized in that potential association mining is carried out on initial situation data based on a convolutional neural network to obtain influence characteristic information, the influence characteristic and the initial situation data are input into a PPO algorithm model together for learning, a hybrid neural network model is formed by adopting the Convolutional Neural Network (CNN) and near-end strategy optimization (PPO), and the characteristic formed by an influence map is artificially added in the aspect of characteristic processing. This makes the convolutional neural network converge faster when processing the feature data, and the action selection given by the whole agent is more careful. The comprehension degree of the intelligent agent to the situation is increased, and the fighting intensity of the intelligent agent is increased to a certain degree.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a flow chart of a method according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
As shown in fig. 1, the invention provides a construction method of a chess deduction intelligent agent based on CNN-PPO, comprising the following steps:
acquiring initial situation data of a military chess deduction platform, and preprocessing the initial situation data to obtain target situation data;
an influence map building module is used for inputting the target situation data into the influence map module and outputting the influence map module to obtain influence characteristics;
and optimally constructing a hybrid neural network model based on a convolutional neural network and a near-end strategy, splicing the target situation data and the influence characteristics, inputting the spliced data into the hybrid neural network model, and performing model iterative training until a target function is minimum and the network is converged, thereby realizing the construction of the CNN-PPO agent.
Preprocessing the initial situation data to screen the initial situation data, removing non-standard data and obtaining target situation data;
the initial situation data comprises attribute information of own party combat entity, attribute information of enemy party combat entity, map synopsis attribute information and scoreboard information;
the non-standard data comprises redundant data, data with missing format, null value and error information.
The overall architecture of the hybrid neural network model is a CNN-PPO architecture, and comprises a convolutional neural network, an Actor _ new network, an Actor _ old network and a Critic network;
the convolutional neural network is used for mining potential relation among target situation data and realizing extraction of hidden features;
the Actor _ new network, the Actor _ old network and the Critic network all use three layers of fully connected neural networks.
Before inputting the hybrid neural network model for model iterative training, the method further comprises the steps of inputting the output of the convolutional neural network into an Actor network in the PPO architecture to obtain the output of the Actor network; and splicing the output of the Actor network and the output of the convolutional neural network, and inputting the spliced output into the Critic network to obtain the output of the Critic network.
Inputting the output of the convolutional neural network into an Actor network in a PPO architecture, and obtaining the output of the Actor network, wherein the output of the convolutional neural network is input into an Actor _ new network to obtain two parameter values of mu and sigma; establishing normal distribution based on the two parameter values to represent the distribution of action, wherein mu is the mean value of the normal distribution, and sigma is the equation of the normal distribution; and obtaining an action according to the normal distribution sampling, and obtaining the reward value given by the environment and the state of the next moment through interaction between the action and the environment.
Splicing the output of the Actor network and the output of the convolutional neural network, and inputting the spliced output and the output into a Critic network to obtain the output of the Critic network, wherein the obtaining of the output of the Critic network comprises the steps of inputting situation data of the next moment into the Critic network to obtain the output V of the network and calculating a discount reward value; inputting the state values of T moments into a criticic network to obtain T V _ values; and calculating the mean square error of the discount reward values R and V _ and updating the criticic network by using a back propagation mechanism. Where V _ is the estimated revenue value and calculates the discount reward value taken by taking action a in state S.
Inputting the hybrid neural network model for model iterative training, namely performing N-suboptimization on network parameters by using a mean square error loss function, and performing B-suboptimization on an Actor network and a convolutional neural network until a target function is minimum and the network is converged.
Performing N sub-optimization on network parameters by using a mean square error loss function, and performing B sub-optimization on an Actor network and a convolutional neural network, wherein all state values in an experience pool are respectively input into an Actor _ new network and an Actor _ old network to obtain normal distribution N1 and N2 of actions; inputting all actions in the experience pool into N1 and N2 to obtain probabilities p1 and p2, and calculating the value of p2/p1 based on the probability values of p1 and p 2; and calculating the error of the Actor network, updating parameters by using a back propagation mechanism, training the model until convergence, and realizing the construction of the CNN-PPO agent.
Example one
As shown in fig. 1, the construction method of a chess deduction intelligent agent based on CNN-PPO provided by the invention comprises the following steps:
step 1: and operating the weapon and chess deduction platform, creating a weapon and chess deduction scene, and obtaining situation data returned by the platform. The situation data are generated by the action of a randomly initialized neural network model Actor _ new network and robots built in the environment. The method specifically comprises the following steps:
1.1 regular intelligent bodies are built in the war game deduction platform, which can be used for training man-machine antagonism and machine-machine antagonism. The Actor _ new network is used for fighting with the built-in intelligent agent and generating situation data. The Actor _ new network is a three-layer fully-connected neural network.
Step 2: and (4) screening the situation data returned by the platform in the step 1, and removing irregular data. The irregular data mainly refers to redundant data, data with a missing format and the like, and the data are eliminated. In the data generated by the fighting built-in robot, a part of reward values are positive, the majority of reward values are negative, and the experience of positive rewards is preferentially collected during collection.
The step 2 specifically comprises the following steps:
2.1 the situation data mainly comprises own entity attribute, entity attribute which is discovered by the enemy, map attribute and scoreboard information.
2.2 the non-standard data mainly refers to null values, error messages and the like.
The invention adopts the idea of reinforcement learning and the idea of influence map. Reinforcement learning plans the problem into a Markov decision process, and solves the problem through iteration. The influence map divides the situation features into primary features and secondary features. The first-level characteristics comprise attribute information of own fighting entity and attribute information of enemy fighting entity; the secondary features comprise map visual information, score board information and influence map information.
And step 3: and inputting the screened data into an influence map module, wherein the input of the influence map module is situation information comprising own/enemy entity information and map information. The output is the influence characteristic of a certain point of the map.
The step 3 specifically comprises the following steps:
3.1 the construction of an influence map module, wherein the influence map module is a module for further extracting situation data, and the influence in a certain range around the entity of the own party is given by the following formula:
e=ine+high+da+di
ine in the formula is a visibility coefficient, wherein visibility is whether occlusion exists between two coordinates, and the absence of the occlusion is called visibility, and the presence of the occlusion is called non-visibility. high is elevation, i.e. altitude in the colloquial sense. da is the risk factor and di refers to the distance from the command point.
3.2 the output is generally set as the map point location of a certain area around the entity of the own party, taking the hexagonal grid as an example, the output is the influence coefficient of all hexagonal grids in n hexagonal grids away from the entity of the own party.
The reward function may give a negative value as a penalty to the agent when the own entity is in a region where the influence is negative, and a positive value as a reward to the agent when the own entity is in a region where the influence is positive.
The form of the reward function is as follows:
R=ra+rc+rd+a
wherein r isaA score representing the current surviving self units; r iscA score representing the seized control points; r isdA unit score representing elimination of enemies; and a represents the current situation score, and the score is the blood volume lost by striking at the last moment or the effective score of striking the enemy.
And 4, step 4: constructing a hybrid neural network, wherein the hybrid neural network is constructed by adopting a near-end policy optimization algorithm (PPO) architecture.
The step 4 specifically comprises the following steps:
4.1 constructing a convolution neural network to mine potential connection between situation data.
4.2, constructing a hybrid neural network overall architecture according to the PPO algorithm architecture, wherein the hybrid neural network overall architecture is a CNN-PPO architecture and is composed of 4 neural networks, namely a convolutional neural network, an Actor _ new network, an Actor _ old network and a criticic network.
The convolutional neural network is used for extracting hidden features, the CNN uses convolution kernels with 3 different sizes and pays attention to different potential features respectively, and a CNN model calculation formula is as follows:
xt=σcnn(wcnn⊙xt+bcnn)
wherein xtRepresenting the current state feature, wcnnRepresenting the weight of the filter, bcnnRepresenting a deviation parameter, σcnnRepresenting an activation function.
The Actor network according to the current state stObtaining values of mu and sigma, establishing normal distribution N according to the values, sampling from the distribution N to obtain action a, obtaining reward value r given by environment, and observing next state s after environment changet+1. The gradient is given as:
Figure BDA0003538836260000101
further, the Actor network is updated using the gradient.
Wherein P isθ(at|st) For the sampling strategy, Pθ'(at|st) And (4) sampling strategy after parameter updating.
The Critic network, based on the input state stAnd action atCalculating the action value function Q(s)t,at) The formula for calculating the loss of the criticic network is as follows:
loss=(r+γ(maxQ(s',a'))-Q(s,a))2
where r is the reward value given by the environment, γ is the discount factor, and Q (s, a) is an action value function, representing the benefit of taking action a in state s.
And constructing an Actor _ new network, an Actor _ old network and a Critic network according to the PPO algorithm architecture. The Actor _ new network uses three layers of fully-connected neural networks, the number of neurons in the first layer is 42, the number of neurons in the second layer is 128, and the number of neurons in the third layer is 15. The Critic network uses three layers of full-connection neural networks, the number of neurons in the first layer is 57, the number of neurons in the second layer is 64, and the number of neurons in the third layer is 1. The Actor _ new network is consistent with the Actor _ old network structure. And after the model is constructed, randomly initializing network parameters.
And 5: and (3) splicing the situation information and the influence characteristics output by the influence module in the step (3) and then inputting the spliced situation information and the influence characteristics into the convolutional neural network to obtain the output of the convolutional neural network. And (3) splicing the situation information and the influence characteristics output by the influence module in the step (3) and then inputting the spliced situation information and the influence characteristics into the convolutional neural network to obtain the output of the convolutional neural network. The input of the convolutional neural network is an 80-dimensional input vector formed by splicing an initial situation of 26 dimensions and an influence characteristic of 54 dimensions, and the output is an output vector of 42 dimensions.
The step 5 specifically comprises the following steps:
5.1, splicing the initial situation information and the characteristic information extracted by the influence map module, and merging and inputting the information into the convolutional neural network. The splicing mode is direct splicing and bitwise addition.
5.2 convolutional neural networks use a plurality of convolution kernels of different sizes to focus on different potential features.
Step 6: and inputting the output of the convolutional neural network into an Actor network in the PPO architecture, and acquiring the output of the Actor network.
The step 6 specifically comprises the following steps:
6.1 construct experience pool, store each experience information in the format of
Figure BDA0003538836260000121
And 7: and splicing the output of the Actor network and the output of the convolutional neural network, and inputting the spliced output and the output into the Critic network to obtain the output of the Critic network. And N sub-optimization is carried out on the network parameters by using a mean square error loss function, and B sub-optimization is carried out on the Actor network and the convolutional neural network until the target function is minimum and the network is converged.
The overall flow order of data in the four networks is: inputting the initial situation data into an influence map module to obtain secondary influence characteristics; splicing the initial situation data and the secondary influence characteristics, and inputting the initial situation data and the secondary influence characteristics into a convolutional neural network to obtain the output of the convolutional neural network; the output of the convolutional neural network is input into an Actor _ new network to obtain two values of mu and sigma, and a normal distribution is established by using the two values to represent the distribution of action; sampling from the normal distribution to obtain an action; the action interacts with the environment to obtain the reward value given by the environment and the state of the next moment; and inputting the situation data of the next moment into the Critic network to obtain the output V _ofthe network, and calculating the discount reward value. Inputting the state values of T moments into a criticic network to obtain T V _ values; calculating the mean square error of the discount reward values R and V _ R; the criticic network is then updated using a back-propagation mechanism.
Inputting all state values in the experience pool into an Actor _ new network and an Actor _ old network respectively to obtain normal distributions N1 and N2 of actions; inputting all a values in the experience pool into N1 and N2 to obtain probabilities p1 and p2, and then calculating the value of p2/p1 by using p1 and p 2; and then calculating the error of the Actor network by using the following formula, and updating the parameters by using a back propagation mechanism.
Figure BDA0003538836260000122
And training the model until convergence is achieved, namely the CNN-PPO agent is constructed.
The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.

Claims (8)

1. A construction method of a chess deduction intelligent body based on CNN-PPO is characterized by comprising the following steps:
acquiring initial situation data of a chess deduction platform, and preprocessing the initial situation data to obtain target situation data;
an influence map building module is used for inputting the target situation data into the influence map module and outputting the influence map module to obtain influence characteristics;
and optimally constructing a hybrid neural network model based on a convolutional neural network and a near-end strategy, splicing the target situation data and the influence characteristics, inputting the spliced data into the hybrid neural network model, and performing model iterative training until a target function is minimum and the network is converged, thereby realizing the construction of the CNN-PPO agent.
2. The construction method of chess deduction intelligent agent based on CNN-PPO as claimed in claim 1,
preprocessing the initial situation data to screen the initial situation data, removing non-standard data and obtaining target situation data;
the initial situation data comprises attribute information of own party combat entity, attribute information of enemy party combat entity, map synopsis attribute information and scoreboard information;
the non-standard data comprises redundant data, missing format data, null values and error information.
3. The construction method of chess deduction intelligent agent based on CNN-PPO as claimed in claim 1,
the overall architecture of the hybrid neural network model is a CNN-PPO architecture, and comprises a convolutional neural network, an Actor _ new network, an Actor _ old network and a Critic network;
the convolutional neural network is used for mining potential relation among target situation data and realizing extraction of hidden features;
the Actor _ new network, the Actor _ old network and the Critic network all use three layers of fully connected neural networks.
4. The construction method of chess deduction intelligent agent based on CNN-PPO as claimed in claim 3,
before inputting the hybrid neural network model for model iterative training, the method further comprises the steps of inputting the output of the convolutional neural network into an Actor network in the PPO architecture to obtain the output of the Actor network; and splicing the output of the Actor network and the output of the convolutional neural network, and inputting the spliced output into the Critic network to obtain the output of the Critic network.
5. The construction method of chess deduction intelligent agent based on CNN-PPO as claimed in claim 4,
inputting the output of the convolutional neural network into an Actor network in a PPO architecture, and obtaining the output of the Actor network, wherein the output of the convolutional neural network is input into an Actor _ new network to obtain two parameter values of mu and sigma; establishing normal distribution based on the two parameter values to represent the distribution of action, wherein mu is the mean value of the normal distribution, and sigma is the equation of the normal distribution; and obtaining an action according to the normal distribution sampling, and obtaining the reward value given by the environment and the state of the next moment through interaction between the action and the environment.
6. The construction method of chess deduction intelligent agent based on CNN-PPO as claimed in claim 5,
splicing the output of the Actor network and the output of the convolutional neural network, and inputting the spliced output and the output into a Critic network to obtain the output of the Critic network, wherein the situation data of the next moment is input into the Critic network to obtain the output V of the network; inputting the state values of T moments into a criticic network to obtain T V _ values; calculating the mean square error of the discount reward value R and the V _ value, and updating the critical network by using a back propagation mechanism; where V _ is the estimated revenue value and calculates the discount award value taken by taking action a in state S.
7. The construction method of chess deduction intelligent agent based on CNN-PPO as claimed in claim 1,
inputting the hybrid neural network model for model iterative training, namely performing N-suboptimization on network parameters by using a mean square error loss function, and performing B-suboptimization on an Actor network and a convolutional neural network until a target function is minimum and the network is converged.
8. The construction method of chess deduction intelligent agent based on CNN-PPO as claimed in claim 7,
performing N sub-optimization on network parameters by using a mean square error loss function, and performing B sub-optimization on an Actor network and a convolutional neural network, wherein all state values in an experience pool are respectively input into an Actor _ new network and an Actor _ old network to obtain normal distribution N1 and N2 of actions; inputting all actions in the experience pool into N1 and N2 to obtain probabilities p1 and p2, and calculating the value of p2/p1 based on the probability values of p1 and p 2; and calculating the error of the Actor network, updating parameters by using a back propagation mechanism, training the model until convergence, and realizing the construction of the CNN-PPO agent.
CN202210232129.XA 2022-03-09 2022-03-09 Construction method of soldier chess deduction intelligent body based on CNN-PPO Active CN114722998B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210232129.XA CN114722998B (en) 2022-03-09 2022-03-09 Construction method of soldier chess deduction intelligent body based on CNN-PPO

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210232129.XA CN114722998B (en) 2022-03-09 2022-03-09 Construction method of soldier chess deduction intelligent body based on CNN-PPO

Publications (2)

Publication Number Publication Date
CN114722998A true CN114722998A (en) 2022-07-08
CN114722998B CN114722998B (en) 2024-02-02

Family

ID=82238024

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210232129.XA Active CN114722998B (en) 2022-03-09 2022-03-09 Construction method of soldier chess deduction intelligent body based on CNN-PPO

Country Status (1)

Country Link
CN (1) CN114722998B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115829034A (en) * 2023-01-09 2023-03-21 白杨时代(北京)科技有限公司 Method and device for constructing knowledge rule execution framework

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130325774A1 (en) * 2012-06-04 2013-12-05 Brain Corporation Learning stochastic apparatus and methods
US20150100530A1 (en) * 2013-10-08 2015-04-09 Google Inc. Methods and apparatus for reinforcement learning
US20170024643A1 (en) * 2015-07-24 2017-01-26 Google Inc. Continuous control with deep reinforcement learning
CN108171796A (en) * 2017-12-25 2018-06-15 燕山大学 A kind of inspection machine human visual system and control method based on three-dimensional point cloud
CN109948642A (en) * 2019-01-18 2019-06-28 中山大学 Multiple agent cross-module state depth deterministic policy gradient training method based on image input
CN111605565A (en) * 2020-05-08 2020-09-01 昆山小眼探索信息科技有限公司 Automatic driving behavior decision method based on deep reinforcement learning
CN112861442A (en) * 2021-03-10 2021-05-28 中国人民解放军国防科技大学 Multi-machine collaborative air combat planning method and system based on deep reinforcement learning
CN113222106A (en) * 2021-02-10 2021-08-06 西北工业大学 Intelligent military chess deduction method based on distributed reinforcement learning
CN113947022A (en) * 2021-10-20 2022-01-18 哈尔滨工业大学(深圳) Near-end strategy optimization method based on model

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130325774A1 (en) * 2012-06-04 2013-12-05 Brain Corporation Learning stochastic apparatus and methods
US20150100530A1 (en) * 2013-10-08 2015-04-09 Google Inc. Methods and apparatus for reinforcement learning
US20170024643A1 (en) * 2015-07-24 2017-01-26 Google Inc. Continuous control with deep reinforcement learning
CN108171796A (en) * 2017-12-25 2018-06-15 燕山大学 A kind of inspection machine human visual system and control method based on three-dimensional point cloud
CN109948642A (en) * 2019-01-18 2019-06-28 中山大学 Multiple agent cross-module state depth deterministic policy gradient training method based on image input
CN111605565A (en) * 2020-05-08 2020-09-01 昆山小眼探索信息科技有限公司 Automatic driving behavior decision method based on deep reinforcement learning
CN113222106A (en) * 2021-02-10 2021-08-06 西北工业大学 Intelligent military chess deduction method based on distributed reinforcement learning
CN112861442A (en) * 2021-03-10 2021-05-28 中国人民解放军国防科技大学 Multi-machine collaborative air combat planning method and system based on deep reinforcement learning
CN113947022A (en) * 2021-10-20 2022-01-18 哈尔滨工业大学(深圳) Near-end strategy optimization method based on model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
OGUZHAN DOGRU等: "Actor–Critic Reinforcement Learning and Application in Developing Computer-Vision-Based Interface Tracking", ENGINEERING, vol. 7, no. 9, pages 1248 - 1261 *
崔文华;李东;唐宇波;柳少军;: "基于深度强化学习的兵棋推演决策方法框架", 国防科技, no. 02, pages 118 - 126 *
薛傲: "基于强化学习的兵棋推演智能对抗研究与实现", 中国优秀硕士学位论文全文数据库社会科学Ⅰ辑, no. 01, pages 6 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115829034A (en) * 2023-01-09 2023-03-21 白杨时代(北京)科技有限公司 Method and device for constructing knowledge rule execution framework

Also Published As

Publication number Publication date
CN114722998B (en) 2024-02-02

Similar Documents

Publication Publication Date Title
CN112329348B (en) Intelligent decision-making method for military countermeasure game under incomplete information condition
CN110929394B (en) Combined combat system modeling method based on super network theory and storage medium
CN113392521B (en) Method and system for constructing resource marshalling model for air-sea joint combat mission
CN113222106B (en) Intelligent soldier chess deduction method based on distributed reinforcement learning
Gmytrasiewicz et al. Bayesian update of recursive agent models
CN112364972A (en) Unmanned fighting vehicle team fire power distribution method based on deep reinforcement learning
CN114722998B (en) Construction method of soldier chess deduction intelligent body based on CNN-PPO
CN114485665A (en) Unmanned aerial vehicle flight path planning method based on sparrow search algorithm
CN113282100A (en) Unmanned aerial vehicle confrontation game training control method based on reinforcement learning
Wang et al. Deep reinforcement learning-based air combat maneuver decision-making: literature review, implementation tutorial and future direction
CN116661503B (en) Cluster track automatic planning method based on multi-agent safety reinforcement learning
CN116596343A (en) Intelligent soldier chess deduction decision method based on deep reinforcement learning
CN113283574B (en) Method and device for controlling intelligent agent in group confrontation, electronic equipment and storage medium
CN113988301B (en) Tactical strategy generation method and device, electronic equipment and storage medium
CN115909027A (en) Situation estimation method and device
CN113128698B (en) Reinforced learning method for multi-unmanned aerial vehicle cooperative confrontation decision
CN114662655A (en) Attention mechanism-based weapon and chess deduction AI hierarchical decision method and device
CN115220458A (en) Distributed decision-making method for multi-robot multi-target enclosure based on reinforcement learning
CN114202175A (en) Combat mission planning method and system based on artificial intelligence
CN113255893A (en) Self-evolution generation method of multi-agent action strategy
CN114239834A (en) Adversary relationship reasoning method and device based on multi-round confrontation attribute sharing
CN113324545A (en) Multi-unmanned aerial vehicle collaborative task planning method based on hybrid enhanced intelligence
Tran et al. Adaptation of a mamdani fuzzy inference system using neuro-genetic approach for tactical air combat decision support system
CN114611669B (en) Intelligent decision-making method for chess deduction based on double experience pool DDPG network
CN117252081A (en) Method for dynamically allocating air defense weapon-target to be driven

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant