CN109598342A

CN109598342A - A kind of decision networks model is from game training method and system

Info

Publication number: CN109598342A
Application number: CN201811410380.0A
Authority: CN
Inventors: 任金磊; 路鹰; 张耀磊; 李君�; 黄虎; 郑本昌; 张佳; 晁鲁静; 倪越; 吕静
Original assignee: China Academy of Launch Vehicle Technology CALT
Current assignee: China Academy of Launch Vehicle Technology CALT
Priority date: 2018-11-23
Filing date: 2018-11-23
Publication date: 2019-04-09
Anticipated expiration: 2038-11-23
Also published as: CN109598342B

Abstract

A kind of decision networks model includes the following steps: to obtain red EN network and blue party EN network after variation Step 1: make a variation using initial network parameter of the simulated annealing to EN network from game training method；Step 2: red EN network described in step 1 and blue party EN network, which are put into Antagonistic Environment, carries out game confrontation, the decision data and EN value of record confrontation key node；Step 3: being saved to the decision data of the triumph side of game confrontation in step 2 and EN value as effective sample, the data of the losing side are eliminated；Step 4: being trained according to the effective sample in step 3 to EN network, the network parameter after being optimized, using the network parameter after optimization as new initial network parameter；Step 5: circulating repetition step 1 to step 4, is realized from game training.By the present invention in that the AI intelligent decision-making body of stratification can be formed with from game training method, high-level aid decision is provided for game commander and is supported.

Description

A kind of decision networks model is from game training method and system

Technical field

The present invention relates to a kind of decision networks models from game training method and system, belongs to field of artificial intelligence.

Background technique

In recent years, artificial intelligence technology is quickly grown, and is made great progress in terms of autonomous game, in chess category pair The fields such as anti-, image/speech recognition, easy game confrontation have reached or are more than mankind's highest level.And using the U.S. as representative Military power based on AI Military Equipment Battling commander with confrontation control on put into a large amount of reasearch funds.It is contemplated that people Work will intelligently play an increasingly important role in decision domain, wherein intelligent simulation deduction can effectively improve commander The level of training of member, is the inexorable trend of future development using intelligence aided decision.There is the training method for representing meaning at present Have, AlphaGo Zero is tactful from game training method, error back propagation learning algorithm, Monte Carlo tree search (MCTS).

The achievement having lifted generation and having attracted attention is obtained in go field from game training technique.The research and development of DeepMind company The major technique of AlphaGo Zero just includes self game, and you are among us and we are among you, is fought mutually, and constantly self is evolved.

In addition, the supervised learning training method that error back propagation learning algorithm (abbreviation BP algorithm) is representative has become The normal process of trained deep neural network model.For the structure of network, deep neural network and traditional artificial mind Through network compared to being provided with more hidden layers and every layer is provided with more neuron numbers.

Monte Carlo tree searches for (MCTS) strategy, which is only applicable to the similar such tree structure of go can be from multichannel Training from game for a paths is randomly choosed in diameter.

Summary of the invention

The technical problem to be solved by the present invention is overcome the deficiencies of the prior art and provide a kind of decision networks model from Game training method and system, using from game training method, by exporting decision networks parameter variation to single, in game iteration In effectively improve the search efficiency of parameter, solve the problems, such as in intelligent decision single output class in sample deficiency and game confrontation.

The object of the invention is achieved by the following technical programs:

A kind of decision networks model includes the following steps: from game training method

Step 1: making a variation using initial network parameter of the simulated annealing to EN network, red is obtained after variation EN network and blue party EN network；

Step 2: red EN network described in step 1 and blue party EN network, which are put into Antagonistic Environment, carries out game pair It is anti-, the decision data and EN value of record confrontation key node；

Step 3: being protected to the decision data of the triumph side of game confrontation in step 2 and EN value as effective sample It deposits, the data of the losing side is eliminated；

Step 4: EN network is trained according to the effective sample in step 3, the network parameter after being optimized, it will Network parameter after optimization is as new initial network parameter；

Step 5: circulating repetition step 1 to step 4, is realized from game training.

For above-mentioned decision networks model from game training method, Antagonistic Environment described in step 2 is the symmetrical of incomplete condition Game fights scene.

Above-mentioned decision networks model from game training method, using back-propagation algorithm to the effective sample in step 4 into Row study, then pours into EN network and is trained.

Above-mentioned decision networks model is from game training method, using simulated annealing to initial network described in step 1 Parameter makes a variation, and the variation of the initial network parameter is random variation.

Above-mentioned decision networks model is made of from game training method, the EN network multiple EN sub-networks, each EN The feature input of network is same type, and the network structure of each EN sub-network is all the same.

Above-mentioned decision networks model is more than or equal to 100,000 from game training method, the circulating repetition number of the step 5 It is secondary.

For above-mentioned decision networks model from game training method, the decision networks model is the decision networks mould singly exported Type.

A kind of decision networks model fights module, data from game training system, including network parameter variation module, game Choose module, network training module, circulating repetition module；

The network parameter variation module makes a variation to the initial network parameter of EN network using simulated annealing, obtains Red EN network and blue party EN network after must making a variation are then output to the game confrontation module；

Game confrontation module by after variation red EN network and blue party EN network be put into Antagonistic Environment and win Confrontation is played chess, the decision data and EN value of record confrontation key node are then output to the data decimation module；

The data decimation module is protected using the decision data for the triumph side that game is fought and EN value as effective sample It deposits, then exports the decision data of preservation and EN value to the network training module；

The network training module is trained EN network according to effective sample, the network parameter after being optimized, will Network parameter after optimization is exported as new initial network parameter to the circulating repetition module；

Initial network parameter is exported and gives network parameter variation module by the circulating repetition module, is realized and is instructed from game Practice.

Above-mentioned decision networks model is from game training system, and the network training module is using back-propagation algorithm to effective Sample is learnt, and is then poured into EN network and is trained.

Above-mentioned decision networks model uses simulated annealing pair from game training system, the network parameter variation module Initial network parameter makes a variation, and the variation of the initial network parameter is random variation.

For above-mentioned decision networks model from game training system, it is non-complete item that the game, which fights the Antagonistic Environment in module, The symmetric game of part fights scene.

The present invention has the following beneficial effects: compared with the prior art

(1) by the present invention in that can be formed the AI intelligent decision-making body of stratification with from game training method, be referred to for game The person of waving provides high-level aid decision and supports；

(2) confrontation both sides' situation power method for quantitatively evaluating of the invention, can be effectively applied to state in complicated Antagonistic Environment The problems such as potential analysis is assessed can reach accurately and rapidly strength judgement from game environment；

(3) present invention can cover the symmetric game confrontation scene of incomplete condition, and applied widely, practicability is stronger；

(4) the method for the present invention and system are the valid data obtained in a large amount of training confrontation, reliability and accurate Property is higher.

Detailed description of the invention

Fig. 1 is the step flow chart of the method for the present invention；

Fig. 2 is that the present invention schemes from the generation of game training sample；

Fig. 3 is EN web frame figure of the present invention；

Fig. 4 be the present invention with hp be input EN1 web frame figure；

Fig. 5 is the present invention whether to be detected as the EN2 web frame figure inputted.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to implementation of the invention Mode is described in further detail.

A kind of decision networks model from game training method, as shown in Figure 1, the decision networks model be singly export certainly Plan network model, includes the following steps:

Step 1: carrying out random variation using initial network parameter of the simulated annealing to EN network, obtained after variation Red EN network and blue party EN network.The EN network is made of multiple EN sub-networks, and the feature input of each EN sub-network is The network structure of same type, each EN sub-network is all the same.

Step 2: red EN network described in step 1 and blue party EN network to be put into the symmetric game of incomplete condition It fights scene and carries out game confrontation, the decision data and EN value of record confrontation key node.

Step 3: being protected to the decision data of the triumph side of game confrontation in step 2 and EN value as effective sample It deposits, the data of the losing side is eliminated.

Step 4: according to the effective sample in step 3, using back-propagation algorithm to the effective sample in step 4 into Row study, then pour into EN network and be trained, the network parameter after being optimized, using the network parameter after optimization as newly Initial network parameter.

Step 5: circulating repetition step 1 to step 4, is realized from game training.Circulating repetition number is more than or equal to 100,000 It is secondary.

The network parameter variation module becomes the initial network parameter of EN network using simulated annealing at random Different, red EN network and blue party EN network after being made a variation are then output to the game confrontation module；

Game confrontation module by after variation red EN network and blue party EN network be put into the symmetrical of incomplete condition Game fights scene and carries out game confrontation, and the decision data and EN value of record confrontation key node are then output to the data choosing Modulus block；

The network training module learns effective sample according to effective sample, using back-propagation algorithm, then It pours into EN network to be trained, the network parameter after being optimized, join the network parameter after optimization as new initial network Number output is to the circulating repetition module；

Embodiment:

The present invention is directed to the dynamic decision demand for fighting game under the incomplete information condition of complex scene by force, is based on from game Training technique and dynamic non-cooperative games are theoretical, and with value assessment network (hereinafter referred to as EN network) for core, mirror image generates red Blue both sides' initial network, before each confrontation using simulated annealing by the network parameter of red blue both sides' initial network carry out with Machine variation, red AI and blue party AI after variation carry out game confrontation in the grafting scene symmetrically fought, collect enough To resisting sample, the decision data of one side of triumph and EN value is selected to retain as effective sample in the sample, picks out confrontation and lose The invalid sample lost learns effective sample by back-propagation algorithm, trains the version that strengthened in a wheel from after game EN network, new EN network replacement initial network is continued into variation and from game training, is recycled with this and achieve the purpose that evolution. Figure is generated as indicated with 2 from game training sample.

(1) EN network

Entire EN net is made of { EN1, EN2 ... ..., ENn } several subnets, and each EN subnet possesses the feature of same type Input, identical network structure, the output of each subnet is as input.The present invention is illustrated by taking fully-connected network structure as an example (while can extend to the network structures such as convolutional neural networks), finally obtains the output of entire EN.Using this method for splitting Convenient for individually training for each EN subnet, training effectiveness is improved；Another advantage is to be easy to subsequent Expansion development, that is, is worked as There is new understanding for confrontation scene, when needing to increase the characteristic type of EN, can first train subnet, obtain good effect Whole network training is added in the subnet afterwards, improves the accuracy of EN.

In the present invention, EN network such as Fig. 3 is constituted using two sub- EN { EN1, EN2 }, and the network inputs feature of EN1 is to fight Unit blood value (hp), network be four-layer network network, wherein hide the number of plies be two layers, input hp characteristic parameter for discretized space 0, 1,2,3,4,5,6 } value in exports as one-dimensional real number space, represents we determined by combat unit blood value or enemy Strength is strong and weak, and as hp higher, EN1 output is larger, as the reduction EN1 of hp is gradually reduced.EN1 network structure is designed such as Fig. 4 institute Show.Hp is input layer, a⁽²⁾And a⁽³⁾For hidden layer, EN is output layer.

The network inputs feature of EN2 is the detected state (Ship_Detect) of combat unit, network structure and EN1 Network structure is identical, and EN2 represents the strength power that we determined or enemy whether are detected by combat unit, is detected shape State in EN2 value inversely, i.e., when naval vessel is not detected, EN2 is larger, as combat unit is detected one by one, EN2 will be gradually reduced.The design of EN2 network structure is as shown in Figure 5.

(2) from game training method

The present invention learns effective sample by back-propagation algorithm, trains the EN network for the version that strengthened, will be new EN network replacement initial network continue variation and from game training, recycled with this and achieve the purpose that evolution.Specific training step It is rapid as follows:

Step 1: being generated from game network；

Step 2: initial state EN0 randomly selects network parameter according to network model from [- 10,10] is denoted as W₀；

Step 3: according to simulated annealing initial temperature t₀To W₀Make a variation, generate two variation value network EN0A, EN0B, EN0A parameter are denoted as W_0A、W_0B；

Step 4: variation after two networks be put into countermeasure system carry out from game fight, using the sample of triumph side as Effective sample retains, and the data of the losing side are eliminated；

Step 5: with effective sample using error backpropagation algorithm to the network parameter W of EN0₀It is trained, after training The network EN1 to be evolved, corresponding network parameter are W₁；

Step 6: taking new temperature according to coefficient of temperature drop a=0.5, make a variation to EN1.

Step 7: above-mentioned 4th step to the 6th step of circulating repetition, is realized from game training.

It is analyzed using method of the invention in vessel position anticipation, trained network model is subjected to confrontation survey Examination, test result is as follows table 1.The result shows that prejudging network model based on trained vessel position, enemy's ship is carried out pre- The Average Accuracy sentenced is 81.8%.And do not use vessel position anticipation network model situation anticipation Average Accuracy for 50.07%.

Table 1

Furthermore there are also following features by the present invention:

(1) platform of the invention that a simulation is needed from game training method is realized；

(2) realize that effect of the invention at least needs training 100,000 disks confrontation；

(3) application scenarios of the invention must be symmetrical game confrontation scene.

AlphaGo Zero's is master of the present invention from game training technique and of the invention distinguish from game training technique Solve single training method for exporting decision networks model.

The difference of the present invention and BP algorithm is mainly to solve the problems, such as that the game under symmetrical scene is fought, and is in no sample The training method realized under conditions of this.

Of the invention makes a variation the parameter of value assessment network from game training requirement simultaneously, and MCTS is obviously uncomfortable With.

The content that description in the present invention is not described in detail belongs to the well-known technique of those skilled in the art.

Claims

1. a kind of decision networks model is from game training method, characterized by the following steps:

Step 1: making a variation using initial network parameter of the simulated annealing to EN network, red EN net is obtained after variation Network and blue party EN network；

Step 2: red EN network described in step 1 and blue party EN network, which are put into Antagonistic Environment, carries out game confrontation, note The decision data and EN value of record confrontation key node；

Step 3: the decision data of the triumph side of game confrontation in step 2 and EN value are saved as effective sample, it will The data of the losing side are eliminated；

Step 4: being trained according to the effective sample in step 3 to EN network, the network parameter after being optimized will optimize Network parameter afterwards is as new initial network parameter；

2. a kind of decision networks model according to claim 1 is from game training method, it is characterised in that: institute in step 2 It states the symmetric game that Antagonistic Environment is incomplete condition and fights scene.

3. a kind of decision networks model according to claim 1 is from game training method, it is characterised in that: passed using reversed It broadcasts algorithm to learn the effective sample in step 4, then pours into EN network and be trained.

4. a kind of decision networks model according to claim 1 is from game training method, it is characterised in that: institute in step 1 It states and is made a variation using simulated annealing to initial network parameter, the variation of the initial network parameter is random variation.

5. a kind of decision networks model according to claim 1 is from game training method, it is characterised in that: the EN network It is made of multiple EN sub-networks, the feature input of each EN sub-network is same type, and the network structure of each EN sub-network is homogeneous Together.

6. a kind of decision networks model according to claim 1 is from game training method, it is characterised in that: the step 5 Circulating repetition number be more than or equal to 100,000 times.

7. a kind of decision networks model according to claim 1 is from game training method, it is characterised in that: the decision-making mode Network model is the decision networks model singly exported.

8. a kind of decision networks model is from game training system, it is characterised in that: including network parameter variation module, game confrontation Module, data decimation module, network training module, circulating repetition module；

The network parameter variation module makes a variation to the initial network parameter of EN network using simulated annealing, is become Red EN network and blue party EN network after different are then output to the game confrontation module；

Game confrontation module by after variation red EN network and blue party EN network be put into progress game pair in Antagonistic Environment Anti-, the decision data and EN value of record confrontation key node are then output to the data decimation module；

The data decimation module is saved using the decision data for the triumph side that game is fought and EN value as effective sample, so The decision data of preservation and EN value are exported to the network training module afterwards；

The network training module is trained EN network according to effective sample, and the network parameter after being optimized will optimize Network parameter afterwards is exported as new initial network parameter to the circulating repetition module；

Initial network parameter is exported and gives network parameter variation module by the circulating repetition module, is realized from game training.

9. a kind of decision networks model according to claim 8 is from game training system, it is characterised in that: the network instruction Practice module to learn effective sample using back-propagation algorithm, then pours into EN network and be trained.

10. a kind of decision networks model according to claim 8 is from game training system, it is characterised in that: the network Parameter variation module makes a variation to initial network parameter using simulated annealing, and the variation of the initial network parameter is random Variation.

11. a kind of decision networks model according to claim 8 is from game training system, it is characterised in that: the game It fights the symmetric game that the Antagonistic Environment in module is incomplete condition and fights scene.