CN110166428B

CN110166428B - Intelligent defense decision-making method and device based on reinforcement learning and attack and defense game

Info

Publication number: CN110166428B
Application number: CN201910292304.2A
Authority: CN
Inventors: 胡浩; 张玉臣; 杨峻楠; 谢鹏程; 刘玉岭; 马博文; 冷强; 张畅; 陈周文; 林野
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2019-04-12
Filing date: 2019-04-12
Publication date: 2021-05-07
Anticipated expiration: 2039-04-12
Also published as: CN110166428A

Abstract

The invention belongs to the technical field of network security, and particularly relates to an intelligent defense decision method and device based on reinforcement learning and attack and defense gaming, wherein the method comprises the following steps: constructing an attack and defense game model under the constraint of limited rationality, and generating an attack and defense graph for extracting a network state and an attack and defense action in the game model, wherein the attack and defense graph is set to take a host as a center, the nodes of the attack and defense graph extract the network state, and the attack and defense graph analyzes the attack and defense action; when the network state transition probability is unknown, the defender obtains the defense benefits through online learning, so that the defender can automatically select the optimal defense strategy when facing different attackers. The game state space is effectively compressed, and the storage and operation expenses are reduced; the defender performs reinforcement learning according to the environmental feedback in the counterwork with the attacker, and can make an optimal selection in a self-adaptive way in the face of different attacks; the learning speed of defenders is improved, the defense benefits are improved, the dependence on historical data is reduced, and the real-time performance and intelligence of the defenders in decision making are effectively improved.

Description

Intelligent defense decision-making method and device based on reinforcement learning and attack and defense game

Technical Field

The invention belongs to the technical field of network security, and particularly relates to an intelligent defense decision method and device based on reinforcement learning and attack and defense gaming.

Background

In recent years, information security events become frequent, huge losses are brought to network security, according to statistics, the arry cloud only suffers attacks about 16 hundred million times every day in 2017, each attack and defense scene may only appear once for different attackers, but for defenders represented by the arry cloud, the arry cloud faces a large number of same attack and defense scenes every day. Considering that the hardware resources of network equipment are limited, how to comprehensively consider defense cost and income, and the goal of maximizing defense income is achieved, so that a defender can achieve balance between risk and investment, and how to learn and update income online in a large number of identical attack and defense scenes, and a security administrator faces the dilemma that 'optimal strategy is difficult to select' under the appropriate security condition. The game theory is highly matched with the target oppositiveness, relationship non-cooperation and strategy dependency of network attack and defense. The existing defense decision method based on the game theory can be divided into two types based on a complete rational hypothesis and a limited rational hypothesis: the method is a complete and rational defense decision-making method based on the attack and defense participants. The premise of the complete rational assumption is that each participant can intelligently select the optimal strategy to maximize the benefit of the participant and can predict the strategy selection of other participants. The method is applied to the field of wireless sensor security, and can analyze the efficiency of a worm virus attack and defense strategy by establishing a non-cooperative game model between an attacker and a sensor trust node and giving an optimal attack strategy according to Nash equilibrium. By establishing a repeated game model between the intrusion detection system and the wireless sensor node, the forwarding strategy of the node packet can be analyzed. And secondly, a defense decision method based on the limited rationality of the attack and defense participants. The limited rationality means that the attacking and defending parties cannot find the optimal strategy at the beginning, the attacking and defending game can be learned in the attacking and defending game, and the proper learning mechanism is the key for winning in the game. The method is mainly developed around an evolutionary game, wherein the evolutionary game takes a group as a research object, adopts a biological evolution mechanism and completes learning by simulating the dominant strategy of other members. In the evolutionary game, the information exchange among participants is too much, the adjustment process, the trend and the stability of the strategy of the attack and defense group are mainly researched, and the method is not beneficial to guiding the real-time strategy selection of individual members. How to adopt a better learning mechanism to simulate an attack and defense process and improve the accuracy and timeliness of a defense decision becomes a technical problem to be solved urgently.

Disclosure of Invention

Therefore, the intelligent defense decision-making method and device based on reinforcement learning and attack and defense gaming are suitable for the actual attack and defense network environment, realize the intelligent defense decision-making of online learning capacity, and have strong practicability and operability.

According to the design scheme provided by the invention, the intelligent defense decision-making method based on reinforcement learning and attack and defense gaming comprises the following contents:

A) constructing an attack and defense game model under the constraint of limited rationality, and generating an attack and defense graph for extracting a network state and an attack and defense action in the game model, wherein the attack and defense graph is set to take a host as a center, the nodes of the attack and defense graph extract the network state, and the attack and defense graph analyzes the attack and defense action;

B) based on the network state and the attack and defense actions, the attack and defense game process is intensively learned by relying on an attack and defense game model, and the defenders can automatically select the optimal defense strategy when facing different attackers in a limited manner according to system feedback in the countermeasures of the attack and defense parties.

In the above description, in a), the attack and defense game model is represented by six element groups, i.e., AD-SGM ═ N, S, D, R, Q, and pi, where N represents the player participating in the game, S represents the random game state set, D represents the defender action set, R represents the defender immediate return, Q represents the defender state-action revenue function, and pi represents the defender defense strategy.

In the above description, the attack and defense graph is represented by a binary group, i.e. G ═ S, E, where S represents a set of node security states, and E represents a transition of a node state caused by an attack or defense action.

Preferably, when the attack graph is generated, the target network is scanned to obtain network security elements, then the network security elements are combined with the attack template to perform attack instantiation, then the attack template is combined with the defense template to perform defense instantiation, and finally the attack and defense graph is generated, wherein the state set of the attack and defense game model is extracted by the nodes of the attack and defense graph, and the defense action set is extracted by the edges of the attack and defense graph.

In the step B), in the reinforcement learning, a WoLF hill climbing strategy WoLF-PHC model-free reinforcement learning mechanism is adopted, return and environment state transfer knowledge is acquired through interaction with the environment, the knowledge is expressed by using gains, the strategy learning rate of the defender is set to adapt to the strategy of the attacker, the reinforcement learning is performed through the updated gains, and the optimal defense strategy of the defender is determined.

Preferably, the benefit is expressed as

The strategy of reinforcement learning is as follows:

wherein, alpha is the learning yield; gamma is a discount factor, R_d(s, d, s ') represents the immediate return of the defender after state s performs the defensive action d the network transitions to state s'.

Furthermore, an average strategy is adopted as a criterion for winning and losing, and the formula is expressed as follows:

furthermore, in the model-free reinforcement learning mechanism, an eligibility trace for tracking the recently visited state-action track is introduced, the current reward is distributed to the recently visited state-action, and the earnings are updated by utilizing the eligibility trace.

Furthermore, in reinforcement learning, the qualification trace of each state-action is defined as e (s, a), and the current network state is set as s, and the qualification trace is defined as

The mode is updated and the current reward is assigned to the most recently visited state-action, where γ is the discount factor and λ is the trajectory decay factor.

Furthermore, an intelligent defense decision-making device based on reinforcement learning and attack and defense gaming comprises:

the attack and defense graph generation module is used for constructing an attack and defense game model under the limited rational constraint and generating an attack and defense graph for extracting the network state and the attack and defense actions in the game model, the attack and defense graph is set to take the host as the center, the nodes of the attack and defense graph extract the network state, and the attack and defense graph edge analyzes the attack and defense actions;

the defense strategy selection module is used for carrying out reinforcement learning on the attack and defense game process based on the network state and the attack and defense actions and combining an attack and defense game model, and the defense parties can feed back according to the environment in the confrontation process, so that the defender can automatically select the optimal defense strategy when facing different attackers in a limited manner.

The invention has the beneficial effects that:

in the invention, an attack and defense graph model taking a host as a center is used for network state and attack and defense actions, so that the game state space is effectively compressed; the defender adopts a reinforcement learning mechanism to learn according to the feedback of the environment in the confrontation with the attacker, so that the defender with limited rationality can automatically make the optimal selection when facing different attackers; the qualification trace is added into the decision device, so that the learning speed of a defender is improved, the dependence on historical data is reduced, and the real-time performance and intelligence of the defender in decision making are effectively improved.

Description of the drawings:

FIG. 1 is a schematic diagram of an embodiment of an intelligent defense decision flow;

FIG. 2 is a schematic diagram illustrating an exemplary attack/defense state transition;

FIG. 3 is a schematic diagram of an embodiment reinforcement learning mechanism;

FIG. 4 shows an experimental network structure in an example;

FIG. 5 is a diagram illustrating exemplary network vulnerability information;

FIG. 6 is a graph of attack action in an example;

FIG. 7 is a graph of defense action in the examples;

FIG. 8 is an embodiment defensive action description;

FIG. 9 is a parameter chart of the experimental setup in the example;

FIG. 10 is an embodiment defense decision situation diagram;

FIG. 11 is a defense revenue situation diagram in an embodiment.

The specific implementation mode is as follows:

in order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described in detail below with reference to the accompanying drawings and technical solutions. The technical terms involved in the examples are as follows:

the reinforcement learning is a classic online learning method, and the participators perform independent learning through environmental feedback, so that the reinforcement learning method has the advantages of high learning speed, high attack and defense conversion speed and high timeliness compared with a biological evolution type learning mode. The characteristics of non-cooperative property, target opponency, strategy dependency and the like of the game all accord with the basic characteristics of network attack and defense. The embodiment of the invention, as shown in fig. 1, provides an intelligent defense decision method based on reinforcement learning and attack and defense gaming, which comprises the following steps:

constructing an attack and defense game model under the constraint of limited rationality, and generating an attack and defense graph for extracting a network state and an attack and defense action in the game model, wherein the attack and defense graph is set to take a host as a center, the nodes of the attack and defense graph extract the network state, and the attack and defense graph analyzes the attack and defense action;

the attack and defense game model is subjected to reinforcement learning based on the network state and attack and defense actions, and a defender automatically selects an optimal defense strategy when facing different attackers in a limited manner according to system feedback in the counterwork of the attack and defense parties.

The dynamic threat tracking analysis based on the attribute attack graph has obvious advantages in the aspects of attack path inference, threat transfer probability, context inference, loop resolution, real-time analysis, multi-path synthesis, authority promotion, access relation and the like.

Introducing a reinforcement learning mechanism into an attack and defense game, constructing an attack and defense game model under the limited rational constraint, and generating an attack and defense diagram taking a host as a center for extracting a network state and attack and defense actions in the game model; and online real-time automatic defense decision making is realized through reinforcement learning.

The network attack and defense game model adopts probability value to describe the randomness of network state transition, because the current network state is mainly related to the previous network stateIn this regard, the state transition relationship is expressed by first order Markov, and as shown in FIG. 2, the transition probability is P(s)_t,a_t,d_t,s_t+1) Wherein, s is the network state, and (a, d) are the attack and defense actions. As the two network attacking and defending parties have the target oppositiveness and the non-cooperative property, the two network attacking and defending parties can hide the key information of the two network attacking and defending parties deliberately, and the transition probability is set as the unknown information of the two network attacking and defending parties. On the basis, a game model is constructed. In another embodiment of the present invention, an attack and defense random game model (AD-SGM) is represented by a six-tuple AD-SGM ═ (N, S, D, R, Q, pi), where N ═ is two players participating in a game, and represents a network attacker and an defender respectively; s ═ S₁,s₂,…,s_n) The game state is a random game state set and consists of network states; d ═ D (D)₁,D₂,…,D_n) Is a set of defensive actions, wherein D_k＝{d₁,d₂,…d_mThe defender is in the game state S_kThe set of actions of (1); r_d(s_i,d,s_j) Is in state s for defending person_iNetwork transition to state s with execution of defensive action d_jThen returning immediately; q_d(s_iD) is represented in the state s_iThe expected income after the lower defender takes the action d; pi_d(s_k) Is in state s for defending person_kThe defense strategy of (1).

The defense strategy and the defense action are two different concepts, and the defense strategy is a set of defense actions. The defense policy specifies what actions the defender chooses at each network state, e.g., in the form of a probability vector_d(s_k)＝(π_d(s_k,d₁),…,π_d(s_k,d_m) Is defender in network state s_kStrategy of (2), n_d(s_k,d_m) Select action d for it_mWherein, the probability of

Extracting nodes from the attack and defense graph G by creating the network attack and defense graph GAnd analyzing the attack and defense actions by the network state and the edge of the attack and defense graph G, and extracting an attack and defense strategy. In another embodiment of the present invention, the attack and defense diagram is represented as a two-tuple G ═ S, E, where S ═ S₁,s₂,…,s_nIs the set of node security states, s_iThe node identifier is a unique identifier of the node, and the node identifier represents that the node identifier does not have any authority, has a normal user authority, and has an administrator authority. E ═ E (E)_a,E_d) Is a directed edge, indicating that the occurrence of an attack or defense action causes a transition in the state of the node, e_k＝(s_r,v/d,s_d) K is a, d, wherein s_rIs a source node, s_dIs a destination node.

Further, when the attack and defense graph is generated, a target network is scanned to obtain network security elements, then the network security elements are combined with an attack template to carry out attack instantiation, then the attack and defense templates are combined to carry out defense instantiation, and finally the attack and defense graph is generated. The state set of the attack and defense random game model is extracted by the nodes of the attack and defense graph, and the defense action set is extracted by the edges of the attack and defense graph. The specific steps can be designed as shown in algorithm 1:

algorithm 1. attack and defense graph generation algorithm

Generating all possible state nodes by utilizing network security elements and initializing edges in the step 1); step 2) to step 11) are attack instantiation, and all attack edges are generated; 12) to 18) step is defense instantiation, generating all defense edges; 19) to 23) are to remove all isolated nodes; step 24) is to output the attack and defense graph.

In the embodiment of the invention, a reinforcement learning mechanism is introduced into the attack and defense game, and the learning and improvement process of the attack and defense strategy is described. WoLF-PHC is a typical model-free reinforcement learning algorithm, and the learning mechanism is shown in FIG. 3. In another embodiment of the invention, the Agent obtains the knowledge of return and environmental state transition through interaction with the environment in reinforcement learning, and the knowledge uses the benefit Q_dBy updating Q_dTo perform learning. Its profit function Q_dComprises the following steps:

in the formula (1), α is the learning yield; gamma is a discount factor. The strategy of reinforcement learning is as follows:

further, the WoLF-PHC WoLF hill climbing strategy enables defenders to have two different strategy learning rates by introducing a WoLF mechanism, and adopts a low strategy learning rate delta when winning_wUsing a high strategy learning rate delta when failing_lAs shown in formula (5). The two learning rates enable a defender to quickly adapt to the strategy of an attacker when the defender is worse than expected performance, and enable the defender to learn cautiously when the defender is better than the expected performance, and meanwhile, the convergence of the algorithm is guaranteed. The WoLF-PHC algorithm adopts an average strategy as a judgment standard for winning and failing, and is shown in formulas (6) and (7).

C(s)＝C(s)+1 (7)

In order to improve the learning speed of the WoLF-PHC algorithm and reduce the dependence degree of the algorithm on the data quantity, in another embodiment of the invention, an eligibility trace is introduced into the WoLF-PHC. The eligibility trace can track the most recently visited state-action trace and then assign the current reward to the most recently visited state-action. Further, the qualification trace of each state-action is defined as e (s, a), the current network state is set as s, and the qualification trace is updated in a manner shown in formula (8), wherein λ is a trace attenuation factor.

In order to obtain a better effect based on a WoLF-PHC defense decision method, four parameters of alpha, delta, lambda and gamma are reasonably set. 1) The value range of the profit learning rate alpha is more than 0 and less than 1, the larger alpha represents that the more important the accumulated reward is, the faster the learning speed is; the smaller the α, the better the stability of the algorithm. 2) The value range of the strategy learning rate delta is more than 0 and less than 1, and the strategy learning rate delta is obtained according to experiments by adopting

Better effect can be obtained. 3) The evaluation range of the qualification track attenuation factor lambda is 0 < lambda < 1, the qualification track attenuation factor lambda is responsible for allocating credit to the state and the action, the qualification track attenuation factor lambda can be regarded as the scale of time, and the credit allocated to the historical state and the action is larger when the lambda is larger. 4) The value range of the discount factor gamma is 0 < gamma < 1, which represents the preference of defenders for immediate return and future return. When gamma is close to 0, the future return is not important, and the immediate return is emphasized; when γ is close to 1, it means that immediate return is insignificant, and future return is emphasized more.

As shown in FIG. 3, the Agent in the WoLF-PHC corresponds to a defender in an attack and defense random game model AD-SGM, the state of the Agent corresponds to the game state in the AD-SGM, the behavior of the Agent corresponds to the defense action in the AD-SGM, the immediate return of the Agent corresponds to the immediate return in the AD-SGM, and the strategy of the Agent corresponds to the defense strategy in the AD-SGM. On the basis of the above, a specific defense decision algorithm can be designed as shown in algorithm 2:

algorithm 2. defense decision algorithm

1) attack and defense random game model AD-SGM and initial of related parametersInitialization, in which the network state and the attack and defense actions are extracted by algorithm 1, step 2) defender detects the current network state, steps 3) -22) make defense decisions and online learning, wherein steps 4) -5) select defense actions according to the current strategy, steps 6) -14) utilize eligibility traces to yield Q_dUpdating, steps 15) -21) according to the new yield Q_dUpdating defense strategy pi by using hill climbing algorithm_d. The spatial complexity of the algorithm is mainly focused on the pair R_d(s,d,s')、e(s,d)、π_d(s,d)、

And Q_d(S, D) if | S | is the number of states, | D | is the number of measures for each state defender, then the spatial complexity is O (4 |, S |, D | + | S |²| D |). The algorithm does not need to solve the game equilibrium, greatly reduces the computational complexity compared with the existing random game model, and enhances the effectiveness of the algorithm.

Based on the above intelligent defense decision method, an embodiment of the present invention further provides an intelligent defense decision device based on reinforcement learning and attack and defense gaming, including:

The intelligent defense decision method based on reinforcement learning and attack and defense game is adopted to intelligently select the target network defense strategy.

In order to further verify the effectiveness of the technical scheme in the embodiment of the invention, experiments are carried out by building a typical enterprise network as shown in fig. 4. The attack and defense event occurs in the internal network, and the attacker comes from the external network. The network administrator is responsible for the security of the intranet as a defender. Due to the arrangement of the firewall 1 and the firewall 2, a normal user of the extranet can only access the Web server, and the Web server can access a database server, an FTP server and an email server. The experimental network is scanned using the Nessus tool, and the vulnerability information of the experimental network is shown in FIG. 5.

An attack and defense template is constructed by referring to an MIT Lincoln laboratory attack and defense behavior database, an attacker host, a W identification Web server, a D identification database server, an F identification FTP server and an E identification electronic mail server are adopted, a network attack and defense graph is constructed by using an attack and defense graph generating device, and the attack and defense graph is divided into an attack graph and a defense graph which are respectively shown in an attached drawing 6 and an attached drawing 7 for convenience of display and description. The defensive action in the defensive map is described as shown in figure 8. Constructing an attack and defense game model of an experimental scene:

n ═ attemper, defenser, for those participating in the game, representing network attackers and defenders, respectively;

② random game state set S ═ S (S)₀,s₁,s₂,s₃,s₄,s₅,s₆) The random game state is composed of network states and extracted by the nodes in fig. 5 and 6;

the action set of defenders is as follows: d ═ D (D)₀,D₁,D₂,D₃,D₄,D₅,D₆) Wherein D is₀＝{NULL}D₁＝{d₁,d₂}D₂＝{d₃,d₄}D₃＝{d₁,d₅,d₆}D₄＝{d₁,d₅,d₆}D₅＝{d₁,d₂,d₇}D₆＝{d₃,d₄}, extracted from the edge of FIG. 6;

defender reporting R immediately_d(s_i,d,s_j) The quantization results of (a) are:

(R_d(s₀,NULL,s₀),R_d(s₀,NULL,s₁)，R_d(s₀,NULL,s₂))＝(0,-40,-59)

(R_d(s₁,d₁,s₀),R_d(s₁,d₁,s₁),R_d(s₁,d₁,s₂)；R_d(s₁,d₂,s₀),R_d(s₁,d₂,s₁),R_d(s₁,d₂,s₂))＝(40,0,-29；5,-15,-32)

(R_d(s₂,d₃,s₀),R_d(s₂,d₃,s₁),R_d(s₂,d₃,s₂),R_d(s₂,d₃,s₃),R_d(s₂,d₃,s₄),R_d(s₂,d₃,s₅)；R_d(s₂,d₄,s₀),R_d(s₂,d₄,s₁),R_d(s₂,d₄,s₂),R_d(s₂,d₄,s₃),R_d(s₂,d₄,s₄),R_d(s₂,d₄,s₅))＝(24,9,-15,-55,-49,-65；19,5,-21,-61,-72,-68)

(R_d(s₃,d₁,s₂),R_d(s₃,d₁,s₃),R_d(s₃,d₁,s₆)；R_d(s₃,d₅,s₂),R_d(s₃,d₅,s₃),R_d(s₃,d₅,s₆)；R_d(s₃,d₆,s₂),R_d(s₃,d₆,s₃),R_d(s₃,d₆,s₆))＝(21,-16,-72；15,-23,-81；-21,-36,-81)

(R_d(s₄,d₁,s₂),R_d(s₄,d₁,s₄),R_d(s₄,d₁,s₆)；R_d(s₄,d₅,s₂),R_d(s₄,d₅,s₄),R_d(s₄,d₅,s₆)；R_d(s₄,d₆,s₂),R_d(s₄,d₆,s₄),R_d(s₄,d₆,s₆))＝(26,0,-62；11,-23,-75；9,-25,-87)

(R_d(s₅,d₁,s₂),R_d(s₅,d₁,s₅),R_d(s₅,d₁,s₆)；R_d(s₅,d₂,s₂),R_d(s₅,d₂,s₅),R_d(s₅,d₂,s₆)；R_d(s₅,d₇,s₂),R_d(s₅,d₇,s₅),R_d(s₅,d₇,s₆))＝(29,0,-63；11,-21,-76；2,-27,-88)

(R_d(s₆,d₃,s₃),R_d(s₆,d₃,s₄),R_d(s₆,d₃,s₅),R_d(s₆,d₃,s₆)；R_d(s₆,d₄,s₃),R_d(s₆,d₄,s₄),R_d(s₆,d₄,s₅),R_d(s₆,d₄,s₆))＝(-23,-21,-19,-42；-28,-31,-24,-49)

for more sufficient detection of learning performance of algorithm, defender's state action income Q_d(s_iAnd d) setting 0 at initialization without introducing additional prior knowledge.

Defense strategy for defending person_dInitialisation by means of an averaging strategy, i.e. pi_d(s_k,d₁)＝π_d(s_k,d₂)＝…π_d(s_k,d_m) And are

No additional a priori knowledge is introduced.

Testing the influence of different parameter settings on the algorithm, at state s in fig. 6 and 7₂For example, the initial strategy of the attacker in the experiment is a random strategy, the speed and the effect of learning are influenced by analyzing different parameter values, different parameter settings are further tested, and six different parameter settings are tested, wherein the specific parameter settings are shown in fig. 9.

Defensive person in state s₂For defense action d₃And d₄The selection probability results of (2) are shown in fig. 10. From fig. 10, the learning speed and convergence of the algorithm at different parameter settings can be observed. Fig. 10 shows that the learning speed of

settings

1, 3, and 6 is fast, and the algorithm can obtain the optimal strategy through learning within 1500 times under three settings, but the convergence of 3 and 6 is poor. Although

settings

3 and 6 learn the best strategy, oscillations will occur later and stability is good without setting 1.

The defense gains can represent the optimization degree of the algorithm to the strategy, and in order to ensure that the profit value does not reflect the defense result only once, the average value of the defense gains is 1000 times, and the average profit change of each 1000 times is shown in fig. 11. It can be seen from fig. 11 that the benefit of setting 3 is significantly lower than the other settings, but the other settings are difficult to distinguish between good and bad. Thus, setting 1 out of six sets of parameters is most suitable for the present scenario.

The operation overhead brought by testing the qualification trace respectively counts the time of 10 ten thousand defense decisions of the algorithm when 20 qualification traces exist and do not exist, and the average value of 20 times is as follows: qualified trace 9.51s, unqualified trace 3.74 s. Although the introduction of the qualification trace can increase the decision time by nearly 2.5 times, the time required for decision making 10 ten thousand times after the qualification trace is introduced is still only 9.51s, and the requirement of real-time performance can be met.

Through the experiments, the method further verifies that an attack and defense random game model is constructed under the limited rational constraint and a network attack and defense diagram for extracting the network state and the attack and defense strategy is generated, so that the game state space is effectively compressed; the defender can obtain the optimal defense strategy aiming at the current attack through learning, the rapid automatic defense capacity to the unknown attack is improved, and the practicability and the operability are strong.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The elements of the various examples and method steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and the components and steps of the examples have been described in a functional generic sense in the foregoing description for clarity of hardware and software interchangeability. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Those skilled in the art will appreciate that all or part of the steps of the above methods may be implemented by instructing the relevant hardware through a program, which may be stored in a computer-readable storage medium, such as: read-only memory, magnetic or optical disk, and the like. Alternatively, all or part of the steps of the foregoing embodiments may also be implemented by using one or more integrated circuits, and accordingly, each module/unit in the foregoing embodiments may be implemented in the form of hardware, and may also be implemented in the form of a software functional module. The present invention is not limited to any specific form of combination of hardware and software.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An intelligent defense decision-making method based on reinforcement learning and attack and defense gaming is characterized by comprising the following contents:

B) the attack and defense game model is subjected to reinforcement learning based on the network state and attack and defense actions, and a defender automatically selects an optimal defense strategy when facing different attackers in a limited manner according to system feedback in the counterwork of the attack and defense parties;

B) in the reinforcement learning, a WoLF hill climbing strategy WoLF-PHC model-free reinforcement learning mechanism is adopted, return and environment state transfer knowledge is acquired through interaction with the environment, the knowledge is expressed by utilizing profits, the learning rate of the defense strategy is set to adapt to the strategy of an attacker, reinforcement learning is carried out through updating the profits, and the optimal defense strategy of the defense is determined;

the benefit is expressed as

The strategy of reinforcement learning is as follows:

2. The intelligent defense decision method based on reinforcement learning and attack and defense game as claimed in claim 1, wherein in A), the attack and defense game model is represented by six-element group, i.e. AD-SGM ═ (N, S, D, R, Q, pi), wherein N represents the player participating in the game, S represents the random game state set, D represents the defender action set, R represents the defender immediate return, Q represents the defender state-action revenue function, and pi represents the defender defense strategy.

3. The intelligent defense decision method based on reinforcement learning and attack and defense gaming according to claim 1, characterized in that the attack and defense graph is represented by a binary group, i.e. G ═ S, E, where S represents a node security state set and E represents a transition of a node state caused by an attack action or a defense action.

4. The intelligent defense decision method based on reinforcement learning and attack and defense gaming of claim 3, characterized in that when generating the attack graph, the network security elements are obtained by scanning the target network, then attack instantiation is performed in combination with the attack template, defense instantiation is performed in combination with the defense template, and finally the attack and defense graph is generated, wherein the state set of the attack and defense gaming model is extracted by the nodes of the attack and defense graph, and the defense action set is extracted by the edges of the attack and defense graph.

5. The intelligent defense decision method based on reinforcement learning and attack and defense gaming according to claim 1, characterized in that an average strategy is adopted as a criterion for winning and losing, and the formula is expressed as:

6. the intelligent defense decision method based on reinforcement learning and attack and defense gaming according to claim 1, characterized in that in the model-free reinforcement learning mechanism, a qualification trace for tracking a state-action track of the latest visit is introduced, the current reward is distributed to the state-action of the latest visit, and the profit is updated by using the qualification trace.

7. Intelligent defense based on reinforcement learning and attack and defense gaming according to claim 6The decision-making method is characterized by defining the qualification trace of each state-action as e (s, a) in the reinforcement learning, and setting the current network state as s, the qualification trace is used as

8. An intelligent defense decision-making device based on reinforcement learning and attack and defense games is characterized in that the intelligent defense decision-making method based on reinforcement learning and attack and defense games of any one of claims 1 to 7 is adopted to intelligently select a target network defense strategy.