CN110166428A

CN110166428A - Intelligence defence decision-making technique and device based on intensified learning and attacking and defending game

Info

Publication number: CN110166428A
Application number: CN201910292304.2A
Authority: CN
Inventors: 胡浩; 张玉臣; 杨峻楠; 谢鹏程; 刘玉岭; 马博文; 冷强; 张畅; 陈周文; 林野
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2019-04-12
Filing date: 2019-04-12
Publication date: 2019-08-23
Anticipated expiration: 2039-04-12
Also published as: CN110166428B

Abstract

The invention belongs to technical field of network security, in particular to a kind of intelligence defence decision-making technique and device based on intensified learning and attacking and defending game, this method includes: constructing attacking and defending betting model under bounded rationality constraint, and it generates for extracting the attacking and defending figure that network state and attacking and defending act in betting model, the attacking and defending figure is set as centered on host, attacking and defending node of graph extracts network state, and attacking and defending movement is analyzed on attacking and defending figure side；Defender obtains the selection that defence income makes defender in face of making optimal defence policies when different attackers automatically when network state transition probability is unknown, through on-line study.The present invention is effectively compressed game state space, reduces storage and operation expense；Defender with attacker fight according to environmental feedback carry out intensified learning, face different attacks when can adaptively make optimal selection；Defender's pace of learning is promoted, defence income is improved, reduces and historical data is relied on, effectively promotes real-time and intelligence when defender's decision.

Description

Intelligence defence decision-making technique and device based on intensified learning and attacking and defending game

Technical field

The invention belongs to technical field of network security, in particular to a kind of intelligence based on intensified learning and attacking and defending game is anti- Imperial decision-making technique and device.

Background technique

In recent years, information security events are increased, bring huge loss, according to statistics, A Liyun to network security 2017 be only daily will by 1,600,000,000 times or so attacks, for different attackers, may each attacking and defending scene can only go out It is now primary, but for the defender for being represented with Ali Yun Wei, it daily will be in face of a large amount of identical attacking and defending scenes.Consider It is limited to network device hardware resource, how to comprehensively consider defence costs and benefits, to defend maximum revenue as target, makes to prevent Driver reaches a kind of balanced between risk and investment, carries out defender in a large amount of identical attacking and defending scenes to income On-line study and update, safety officer face under appropriate safety condition the predicament of " optimal policy is difficult to choose ".Game theory with Target antagonism, relationship Non-synergic possessed by network-combination yarn and tactful interdependence high fit.Currently based on game theory Defence decision-making technique, which can be divided into, to be assumed to assume two classes with bounded rationality based on rational: first is that complete based on attacking and defending participant The defence decision-making technique of rationality.Rational assume premise be each participant can reason selection optimal policy make oneself benefit Benefit maximizes, while can predict the policy selection of other participants.It is applied to wireless sensor safe field, is attacked by establishing Non-cooperative game model between person and sensor trusted node provides optimal attack strategies according to Nash Equilibrium, can be to worm The efficiency of virus attack and defence policies is analyzed.By establishing the repetition between intruding detection system and wireless sensor node Betting model can analyze the forwarding strategy of node packet.Second is that the defence decision-making technique based on attacking and defending participant's bounded rationality.Have Limit rationality means that attacking and defending both sides will not find at the very start optimal policy, can learn attacking and defending game in attacking and defending game, close Suitable study mechanism is won victory in game.Such method is unfolded mainly around evolutionary Game, and evolutionary Game is with group It completes to learn by imitating the dominating stragegy of other members using biological evolution mechanism for research object.Join in evolutionary Game Information exchange is excessive between people and mainly studies adjustment process, trend and the stability of attacking and defending collective strategy, no Conducive to the real-time policy selection for instructing individual member.Better study mechanism simulation ping-pong process how is taken, defence is improved and determines The accuracy and timeliness of plan become technical problem urgently to be resolved.

Summary of the invention

For this purpose, the present invention provides a kind of intelligence defence decision-making technique and device based on intensified learning and attacking and defending game, fit For real attacking and defending network environment, realizes the intelligent defence decision of on-line study ability, there is stronger practicability and can grasp The property made.

According to design scheme provided by the present invention, a kind of intelligence defence decision-making party based on intensified learning and attacking and defending game Method includes following content:

A attacking and defending betting model) is constructed under bounded rationality constraint, and generate for extract in betting model network state with The attacking and defending figure of attacking and defending movement, the attacking and defending figure are set as centered on host, and attacking and defending node of graph extracts network state, attacking and defending figure side point Analyse attacking and defending movement；

B it) is acted based on network state and attacking and defending, relies on attacking and defending betting model, intensified learning is carried out to attacking and defending gambling process, According to system feedback in attacking and defending both sides confrontation, so that defender is optimal anti-in face of making automatically when different attackers under bounded rationality Drive the selection of strategy.

Above-mentioned, A) in, six element group representations of attacking and defending betting model, i.e. AD-SGM=(N, S, D, R, Q, π), wherein N table Show that the player for participating in game, S indicate that Stochastic Game state set, D indicate defender's set of actions, R indicates defender immediately Return, Q indicate that defender's state-movement revenue function, π indicate defender's defence policies.

Above-mentioned, attacking and defending figure is indicated with binary group, i.e. G=(S, E), wherein S indicates that node security state set, E indicate Attack or defence movement cause the transfer of node state.

Preferably, generate attack graph when, first to target network scan obtain Network security factor, then with attack template In conjunction with attack instance is carried out, be on the defensive instantiation in conjunction with defence template, ultimately produces attacking and defending figure, wherein attacking and defending game The state set of model is extracted by attacking and defending node of graph, and defence set of actions is extracted by attacking and defending figure side.

Above-mentioned, B) in, in intensified learning, model intensified learning mechanism is exempted from using wolf hill climbing WoLF-PHC, is passed through Return is obtained with environmental interaction and ambient condition shifts knowledge, and knowledge utilization income indicates, sets defender's height policy learning Rate carries out intensified learning to adapt to attacker's strategy, by updating income, determines the optimal defence policies of defender.

Preferably, income is expressed as The strategy of intensified learning are as follows:Wherein, α is income learning rate；γ is discount factor, R_d(s,d, S' return immediately of the defender after state s executes defence movement d network transitions to state s') is indicated.

Further, the judgment criteria using Average Strategy as triumph and failure, formula indicate are as follows:

Further, exempt from model intensified learning mechanism, introduce state-movement locus money for tracking recent visit Current return is distributed to state-movement of recent visit, is updated using eligibility trace to income by lattice mark.

Further, in intensified learning, the eligibility trace for defining each state-movement is e (s, a), if current network state For s*, eligibility trace withMode is updated, and recent visit is distributed in current return State-movement, wherein γ is discount factor, and λ is track decay factor.

Further, a kind of intelligence defence decision making device based on intensified learning and attacking and defending game includes:

Attacking and defending figure generation module for constructing attacking and defending betting model under bounded rationality constraint, and generates rich for extracting The attacking and defending figure of network state and attacking and defending movement in model is played chess, which is set as centered on host, and attacking and defending node of graph extracts Attacking and defending movement is analyzed on network state, attacking and defending figure side；

Defence policies choose module, are acted based on network state and attacking and defending, in conjunction with attacking and defending betting model, to attacking and defending game Cheng Jinhang intensified learning, according to environmental feedback in attacking and defending both sides confrontation, so that defender faces different attackers under bounded rationality Shi Zidong makes the selection of optimal defence policies.

Beneficial effects of the present invention:

Attacking and defending graph model in the present invention centered on host is acted for network state and attacking and defending, is effectively compressed game shape State space；Defender uses intensified learning mechanism, with attacker fight in learn according to the feedback of environment so that limited Defender under rationality can make optimal selection when facing different attackers automatically；Eligibility trace is added in decision making device, Improve the pace of learning of defender, reduce the dependence to historical data, effectively promoted defender's decision when real-time and Intelligence.

Detailed description of the invention:

Fig. 1 is intelligently to defend decision process schematic diagram in embodiment；

Fig. 2 is that attacking and defending state shifts schematic diagram in embodiment；

Fig. 3 is intensified learning mechanism principle figure in embodiment；

Fig. 4 is Experimental Network structure in embodiment；

Fig. 5 is network vulnerability information schematic diagram in embodiment；

Fig. 6 is attack figure in embodiment；

Fig. 7 is to defend action diagram in embodiment；

Fig. 8 is to defend action description in embodiment；

Fig. 9 is experimental setup parameters figure in embodiment；

Figure 10 is that decision situation map is defendd in embodiment；

Figure 11 is that income situation map is defendd in embodiment.

Specific embodiment:

To make the object, technical solutions and advantages of the present invention clearer, understand, with reference to the accompanying drawing with technical solution pair The present invention is described in further detail.The technical term being related in embodiment is as follows:

Intensified learning is a kind of on-line study method of classics, and participant carries out independent study by the feedback of environment, Compared to biological evolution type mode of learning, pace of learning is fast, meets that change between attack and defend is fast, the strong feature of timeliness.The non-cooperation of game Property, target antagonism and the features such as tactful interdependence meet the essential characteristic of network-combination yarn.The embodiment of the present invention, referring to Fig. 1 It is shown, a kind of intelligence defence decision-making technique based on intensified learning and attacking and defending game is provided, includes:

Attacking and defending betting model is constructed under bounded rationality constraint, and is generated for extracting in betting model network state and attacking The attacking and defending figure of anti-movement, the attacking and defending figure are set as centered on host, and attacking and defending node of graph extracts network state, the analysis of attacking and defending figure side Attacking and defending movement；

It is acted based on network state and attacking and defending and intensified learning is carried out to attacking and defending betting model, foundation system in attacking and defending both sides confrontation System feedback, so that defender faces the selection for making optimal defence policies when different attackers automatically under bounded rationality.

Dynamic threats trace analysis based on attribute attack graph in attack path deduction, threatens transition probability, front and back pieces to push away Disconnected, resolution loop, in real time analysis, comprehensive multipath, privilege-escalation and access visit relationship etc. have a clear superiority.

Intensified learning mechanism is introduced into attacking and defending game, attacking and defending betting model is constructed under bounded rationality constraint, and raw At the attacking and defending figure centered on host, acted for extracting the network state in betting model and attacking and defending；Pass through intensified learning reality The defence decision that present line automates in real time.

Network-combination yarn betting model describes the randomness of network state transfer using probability value, due to current network state master It is related with previous network state, state transfer relationship is indicated using first order Markov, as shown in Fig. 2, transition probability For P (s_t,a_t,d_t,s_t+1), wherein s is network state, and (a, d) is attacking and defending movement.Since network-combination yarn both sides have target pair Vertical property and Non-synergic, attacking and defending both sides can deliberately hide the key message of oneself, and transition probability is set as the unknown of attacking and defending both sides Information.On this basis, betting model is constructed.In another embodiment of the present invention, attacking and defending Stochastic Game Model (attack Defense stochastic game model, AD-SGM) it is indicated with a hexa-atomic group of AD-SGM=(N, S, D, R, Q, π), In, N=(attacker, defender) is two players for participating in game, respectively represent network attack person and defender；S =(s₁,s₂,…,s_n) it is Stochastic Game state set, it is made of network state；D=(D₁,D₂,…,D_n) it is defender's behavior aggregate It closes, wherein D_k={ d₁,d₂,…d_mIt is defender in game state S_kSet of actions；R_d(s_i,d,s_j) it is defender in state s_iIt executes defence and acts d network transitions to state s_jReturn immediately afterwards；Q_d(s_i, d) and it is to indicate in state s_iLower defender takes Expected revenus after acting d；π_d(s_k) it is defender in state s_kDefence policies.

Defence policies and defence the movement concept that be two different, defence policies are the set of defence movement.Defence policies Defender is defined in the form of probability vector selects anything to act in each network state, such as π_d(s_k)=(π_d(s_k,d₁),…, π_d(s_k,d_m)) it is defender in network state s_kStrategy, π_d(s_k,d_m)) for its selection act d_mProbability, wherein

By creating network-combination yarn figure G, from the Node extraction network state of attacking and defending figure G, the side analysis attacking and defending of attacking and defending figure G is dynamic Make, for extracting pursuit-evasion strategy.In another embodiment of the present invention, attacking and defending chart is shown as a binary group G=(S, E), wherein S ={ s₁,s₂,…,s_nIt is node security state set, s_i=< host, privilege >, wherein host is the unique of node Mark, privilege={ none, user, root } respectively indicate without any permission, have normal user permission, have Administrator right.E=(E_a,E_d) it is directed edge, indicate that attack or defence movement cause the transfer of node state, e_k=(s_r,v/d,s_d), k=a, d, wherein s_rFor source node, s_dFor purpose node.

Further, when attacking and defending map generalization, first to target network scan obtain Network security factor, then with attack Template, which combines, carries out attack instance, then the instantiation that is on the defensive in conjunction with defence template, ultimately produces attacking and defending figure.Attacking and defending is random The state set of betting model is extracted by attacking and defending node of graph, and defence set of actions is extracted by the side of attacking and defending figure.Specific steps can be set It is calculated as shown in algorithm 1:

1. attacking and defending figure generating algorithm of algorithm

Wherein, the 1) step be to generate all possible state nodes using Network security factor and initialize side；2) -11) Step is attack instance, generates all attack sides；The 12) -18) step is defence instantiation, generate all defence sides；19)- 23) step is all isolated nodes of removal；The 24) step be output attacking and defending figure.

In the embodiment of the present invention, intensified learning mechanism is introduced into attacking and defending game, describes the study and improvement of pursuit-evasion strategy Process.WoLF-PHC is that one kind typically exempts from model nitrification enhancement, and study mechanism is as shown in Figure 3.The present invention another In embodiment, in intensified learning Agent by with environment interact obtain return and ambient condition transfer knowledge, knowledge receipts Beneficial Q_dIt indicates, passes through and update Q_dTo be learnt.Its revenue function Q_dAre as follows:

In formula (1), α is income learning rate；γ is discount factor.The strategy of intensified learning are as follows:

Further, WoLF-PHC wolf hill climbing makes defender have two different plans by introducing WoLF mechanism Slightly learning rate uses low policy learning rate δ when winning_w, high policy learning rate δ is used when failure_l, as shown in formula (5).Two A learning rate enables defender to rapidly adapt to the strategy of attacker in performance difference than expected, and energy is careful when doing very well than expected Study, while ensure that convergence.Judgment criteria of the WoLF-PHC algorithm using Average Strategy as triumph and failure, As shown in formula (6) (7).

C (s)=C (s)+1 (7)

In order to improve the pace of learning of WoLF-PHC algorithm, algorithm is reduced to the degree of dependence of data volume, the present invention is another In a embodiment, eligibility trace is introduced in WoLF-PHC.Eligibility trace can track particular state-movement locus of recent visit, so Current return is distributed to state-movement of recent visit afterwards.Further, the eligibility trace for defining each state-movement is e (s, a), if current network state is s*, eligibility trace is updated in a manner of shown in formula (8), and wherein λ is track decay factor.

Defend decision-making technique to obtain better effects based on WoLF-PHC, it is reasonable to carry out to tetra- parameters of α, δ, λ and γ Setting.1) income learning rate α value range is 0 < α < 1, and more important, pace of learning is awarded in the accumulation of the bigger representative of α more rearward Also faster；The stability of the smaller algorithm of α is better.2) policy learning rate δ value range is 0 < δ < 1, is obtained, is taken according to experimentWhen can obtain better effects.3) eligibility trace decay factor λ value range is 0 < λ < 1, is responsible for state-movement Prestige is distributed, is considered as the scale of time, the λ the big, and it is bigger to distribute to historic state-movement prestige.4) discount factor γ value range is 0 < γ < 1, represents defender to the preference returned immediately with future returns.When γ is close to 0, indicate Future returns are unimportant, more value and return immediately；When γ is close to 1, representative is returned unimportant immediately, more values future Return.

Agent in WoLF-PHC, as shown in figure 3, the defender in corresponding attacking and defending Stochastic Game Model AD-SGM, Game state in the state corresponding A D-SGM of Agent, the defence movement in the behavior corresponding A D-SGM of Agent, Agent's is vertical Return the return immediately in corresponding A D-SGM, the defence policies in the tactful corresponding A D-SGM of Agent.On the basis of the above, Specific defence decision making algorithm may be designed as shown in algorithm 2:

Algorithm 2. defends decision making algorithm

1) initialization of the step to attacking and defending Stochastic Game Model AD-SGM and relevant parameter, wherein network state and attacking and defending are dynamic Work is extracted by algorithm 1, the 2) step defender detect current network state, the 3) -22) step is on the defensive decision and on-line study, Wherein 4) -5) step chooses defence movement according to current strategies, and the 6) -14) step using eligibility trace to income Q_dIt is updated, the 15) -21) step is according to new income Q_dDefence policies π is updated using hill-climbing algorithm_d.The space complexity of algorithm is concentrated mainly on To R_d(s,d,s')、e(s,d)、π_d(s,d)、And Q_dThe storage of (s, d), if | S | it is status number, | D | it is each state The measure number of defender, then space complexity be O (4 | S | | D |+| S |²·|D|).Algorithm do not need to game equilibrium into Row solves, and greatly reduces computation complexity compared with existing Stochastic Game Model, enhances the actual effect of algorithm.

Based on above-mentioned intelligence defence decision-making technique, the embodiment of the present invention also provides a kind of rich based on intensified learning and attacking and defending The intelligence defence decision making device played chess includes:

Target network is carried out using the above-mentioned intelligence defence decision-making technique based on intensified learning and attacking and defending game and defends plan Intelligence slightly is chosen.

For the validity for further verifying technical solution in the embodiment of the present invention, by building typical case as shown in Fig. 4 Enterprise network is tested.Attacking and defending event occurs in Intranet, and attacker comes from outer net.Network administrator is responsible for as defender The safety of Intranet.Due to the setting of firewall 1 and firewall 2, outer net normal users can only access Web server, and Web service The accessible database server of device, ftp server and e-mail server.Experimental Network is carried out using Nessus tool Scanning, Experimental Network vulnerability information are as shown in Fig. 5.

With reference to MIT Lincoln laboratory attacking and defending behavior database building attack, defence template, using A identified attacks person host, W identifies Web server, D identification database server, F mark ftp server, E and identifies e-mail server, utilizes attacking and defending Figure generating means construct network-combination yarn figure, for convenient for showing and describsion, attacking and defending figure is divided into attack graph and defence figure, respectively as attached Shown in Fig. 6 and attached drawing 7.Defend defence action description in figure as shown in Fig. 8.Construct the attacking and defending betting model of experiment scene:

1. N=(attacker, defender) is the player for participating in game, respectively represent network attack person and defence Person；

2. Stochastic Game state set S=(s₀,s₁,s₂,s₃,s₄,s₅,s₆), Stochastic Game state is made of network state, By the Node extraction in Fig. 5 and Fig. 6；

3. defender's set of actions are as follows: D=(D₀,D₁,D₂,D₃,D₄,D₅,D₆), wherein D₀={ NULL } D₁={ d₁,d₂}D₂ ={ d₃,d₄}D₃={ d₁,d₅,d₆}D₄={ d₁,d₅,d₆}D₅={ d₁,d₂,d₇}D₆={ d₃,d₄, it is extracted by the side of Fig. 6；

4. defender returns R immediately_d(s_i,d,s_j) quantized result are as follows:

(R_d(s₀,NULL,s₀),R_d(s₀,NULL,s₁), R_d(s₀,NULL,s₂))=(0, -40, -59)

(R_d(s₁,d₁,s₀),R_d(s₁,d₁,s₁),R_d(s₁,d₁,s₂)；R_d(s₁,d₂,s₀),R_d(s₁,d₂,s₁),R_d(s₁,d₂, s₂))=(40,0, -29；5,-15,-32)

(R_d(s₂,d₃,s₀),R_d(s₂,d₃,s₁),R_d(s₂,d₃,s₂),R_d(s₂,d₃,s₃),R_d(s₂,d₃,s₄),R_d(s₂,d₃, s₅)；R_d(s₂,d₄,s₀),R_d(s₂,d₄,s₁),R_d(s₂,d₄,s₂),R_d(s₂,d₄,s₃),R_d(s₂,d₄,s₄),R_d(s₂,d₄,s₅))= (24,9,-15,-55,-49,-65；19,5,-21,-61,-72,-68)

(R_d(s₃,d₁,s₂),R_d(s₃,d₁,s₃),R_d(s₃,d₁,s₆)；R_d(s₃,d₅,s₂),R_d(s₃,d₅,s₃),R_d(s₃,d₅, s₆)；R_d(s₃,d₆,s₂),R_d(s₃,d₆,s₃),R_d(s₃,d₆,s₆))=(21, -16, -72；15,-23,-81；-21,-36,-81)

(R_d(s₄,d₁,s₂),R_d(s₄,d₁,s₄),R_d(s₄,d₁,s₆)；R_d(s₄,d₅,s₂),R_d(s₄,d₅,s₄),R_d(s₄,d₅, s₆)；R_d(s₄,d₆,s₂),R_d(s₄,d₆,s₄),R_d(s₄,d₆,s₆))=(26,0, -62；11,-23,-75；9,-25,-87)

(R_d(s₅,d₁,s₂),R_d(s₅,d₁,s₅),R_d(s₅,d₁,s₆)；R_d(s₅,d₂,s₂),R_d(s₅,d₂,s₅),R_d(s₅,d₂, s₆)；R_d(s₅,d₇,s₂),R_d(s₅,d₇,s₅),R_d(s₅,d₇,s₆))=(29,0, -63；11,-21,-76；2,-27,-88)

(R_d(s₆,d₃,s₃),R_d(s₆,d₃,s₄),R_d(s₆,d₃,s₅),R_d(s₆,d₃,s₆)；R_d(s₆,d₄,s₃),R_d(s₆,d₄, s₄),R_d(s₆,d₄,s₅),R_d(s₆,d₄,s₆))=(- 23, -21, -19, -42；-28,-31,-24,-49)

5. for the learning performance of more fully detection algorithm, the state action income Q of defender_d(s_i, d) initialization when 0 is uniformly set, does not introduce additional priori knowledge.

6. the defence policies π of defender_dAverage Strategy is taken to be initialized, i.e. π_d(s_k,d₁)=π_d(s_k,d₂)=... π_d (s_k,d_m)) andAdditional priori knowledge is not introduced.

Influence of the different parameters setting to algorithm is tested, with state s in Fig. 6 and Fig. 7₂For, attacker is initial in experiment Strategy be randomized policy, analyzing different parameter values will affect the speed and effect of study, to different parameter settings do into One pacing examination, tests six kinds of different parameter settings, specific parameter setting is as shown in Fig. 9.

Defender is in state s₂D is acted to defence₃And d₄Select probability the results are shown in Figure 10.It can be seen from Figure 10 Survey lower algorithm is arranged in different parameters pace of learning and convergence.Show that the pace of learning of setting 1,3,6 is very fast in Figure 10, three kinds Lower algorithm is set, optimal strategy can be obtained by the study within 1500 times, but 3 and 6 convergence is poor.Although 3 Hes are arranged Setting 6 can learn to arrive optimal strategy, but will appear concussion later, be not provided with 1 stability it is good.

Defence income can represent algorithm to the degree of optimization of strategy, in order to ensure financial value is not only to react primary defence As a result, taking the average value of 1000 defence incomes, every 1000 average yields variation is as shown in figure 11.It can be with from Figure 11 See that the income of setting 3 is significantly lower than other settings, but the superiority and inferiority of other settings is difficult to differentiate between.Therefore, it is arranged 1 in six groups of parameters It is best suited for this scene.

Eligibility trace bring computing overhead is tested, algorithm carries out 10 Wan Cifang when having counted 20 times respectively with and without eligibility trace The time of imperial decision, 20 average value are as follows: qualified mark 9.51s, disqualification mark 3.74s.Although the introducing meeting of eligibility trace So that the decision-making time increases nearly 2.5 times, but introduce after eligibility trace and still there was only 9.51s the time required to 100,000 decisions, still It can satisfy the demand of real-time.

By testing above, further demonstrates the present invention and construct attacking and defending Stochastic Game Model simultaneously under bounded rationality constraint The network-combination yarn figure extracted for network state and pursuit-evasion strategy is generated, game state space has been effectively compressed；Defender passes through Study can obtain the optimal defence policies for current attack, improve the rapid automatized defence capability to unknown attack, With stronger practicability and operability.

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For device disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.

The unit and method and step of each example described in conjunction with the examples disclosed in this document, can with electronic hardware, The combination of computer software or the two is realized, in order to clearly illustrate the interchangeability of hardware and software, in above description In generally describe each exemplary composition and step according to function.These functions are held with hardware or software mode Row, specific application and design constraint depending on technical solution.Those of ordinary skill in the art can be to each specific Using using different methods to achieve the described function, but this realization be not considered as it is beyond the scope of this invention.

Those of ordinary skill in the art will appreciate that all or part of the steps in the above method can be instructed by program Related hardware is completed, and described program can store in computer readable storage medium, such as: read-only memory, disk or CD Deng.Optionally, one or more integrated circuits also can be used to realize, accordingly in all or part of the steps of above-described embodiment Ground, each module/unit in above-described embodiment can take the form of hardware realization, can also use the shape of software function module Formula is realized.The present invention is not limited to the combinations of the hardware and software of any particular form.

The foregoing description of the disclosed embodiments makes professional and technical personnel in the field can be realized or use the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the application.Therefore, the application It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims

1. a kind of intelligence defence decision-making technique based on intensified learning and attacking and defending game, which is characterized in that include following content:

A attacking and defending betting model) is constructed under bounded rationality constraint, and is generated for extracting network state and attacking and defending in betting model The attacking and defending figure of movement, the attacking and defending figure are set as centered on host, and attacking and defending node of graph extracts network state, and the analysis of attacking and defending figure side is attacked Anti- movement；

B it) is acted based on network state and attacking and defending, in conjunction with attacking and defending betting model, intensified learning, attacking and defending is carried out to attacking and defending gambling process According to environmental feedback in both sides' confrontation, so that defender faces to make optimal defence plan automatically when different attackers under bounded rationality Selection slightly.

2. the intelligence defence decision-making technique according to claim 1 based on intensified learning and attacking and defending game, which is characterized in that A in), six element group representations of attacking and defending betting model, i.e. AD-SGM=(N, S, D, R, Q, π), wherein N indicates to participate in the office of game Middle people, S indicate that Stochastic Game state set, D indicate defender's set of actions, and R indicates that defender returns immediately, and Q indicates defence Person's state-movement revenue function, π indicate defender's defence policies.

3. the intelligence defence decision-making technique according to claim 1 based on intensified learning and attacking and defending game, which is characterized in that Attacking and defending figure indicates with binary group, i.e. G=(S, E), wherein S indicates network node safe condition set, E indicate attack or Defence acts the transfer for causing node state.

4. the intelligence defence decision-making technique according to claim 3 based on intensified learning and attacking and defending game, which is characterized in that When generating attack graph, target network is scanned obtain Network security factor first, it is real then to carry out attack in conjunction with attack template Exampleization, be on the defensive instantiation in conjunction with defence template, ultimately produces attacking and defending figure, wherein the state set of attacking and defending betting model It is extracted by attacking and defending node of graph, defence set of actions is extracted by attacking and defending figure side.

5. the intelligence defence decision-making technique according to claim 1 based on intensified learning and attacking and defending game, which is characterized in that B in), in intensified learning, model intensified learning mechanism is exempted from using wolf hill climbing WoLF-PHC, by obtaining back with environmental interaction Report and ambient condition shift knowledge, and knowledge utilization income indicates, set defender's height policy learning rate to adapt to different attacks Person's strategy, income renewal process utilize intensified learning mechanism, determine the optimal defence policies of defender.

6. the intelligence defence decision-making technique according to claim 5 based on intensified learning and attacking and defending game, which is characterized in that Income is expressed asThe strategy of intensified learning Are as follows:Wherein, α is income learning rate；γ is discount factor, R_d(s, d, s') indicates defender Return immediately after state s executes defence movement d network transitions to state s'.

7. the intelligence defence decision-making technique according to claim 6 based on intensified learning and attacking and defending game, which is characterized in that Judgment criteria using Average Strategy as triumph and failure, formula indicate are as follows:

8. the intelligence defence decision-making technique according to claim 6 based on intensified learning and attacking and defending game, which is characterized in that Exempt from model intensified learning mechanism, introduce state-movement locus eligibility trace for tracking recent visit, will currently return point State-movement of dispensing recent visit is updated income using eligibility trace.

9. the intelligence defence decision-making technique according to claim 8 based on intensified learning and attacking and defending game, which is characterized in that In intensified learning, define each state-movement eligibility trace be e (s, a), if current network state be s*, eligibility trace withMode is updated, and current return is distributed to state-movement of recent visit, In, γ is discount factor, and λ is track decay factor.

10. a kind of intelligence defence decision making device based on intensified learning and attacking and defending game, characterized by comprising:

Attacking and defending figure generation module for constructing attacking and defending betting model under bounded rationality constraint, and is generated for extracting game mould The attacking and defending figure of network state and attacking and defending movement, the attacking and defending figure are set as centered on host in type, and attacking and defending node of graph extracts network Attacking and defending movement is analyzed on state, attacking and defending figure side；

Defence policies choose module, are acted based on network state and attacking and defending, in conjunction with attacking and defending betting model, to attacking and defending gambling process into Row intensified learning, according to environmental feedback in attacking and defending both sides confrontation so that when defender faces different attackers under bounded rationality from The dynamic selection for making optimal defence policies.