CN110166428B - Intelligent defense decision-making method and device based on reinforcement learning and attack and defense game - Google Patents

Intelligent defense decision-making method and device based on reinforcement learning and attack and defense game Download PDF

Info

Publication number
CN110166428B
CN110166428B CN201910292304.2A CN201910292304A CN110166428B CN 110166428 B CN110166428 B CN 110166428B CN 201910292304 A CN201910292304 A CN 201910292304A CN 110166428 B CN110166428 B CN 110166428B
Authority
CN
China
Prior art keywords
defense
attack
reinforcement learning
state
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910292304.2A
Other languages
Chinese (zh)
Other versions
CN110166428A (en
Inventor
胡浩
张玉臣
杨峻楠
谢鹏程
刘玉岭
马博文
冷强
张畅
陈周文
林野
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information Engineering University of PLA Strategic Support Force
Original Assignee
Information Engineering University of PLA Strategic Support Force
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information Engineering University of PLA Strategic Support Force filed Critical Information Engineering University of PLA Strategic Support Force
Priority to CN201910292304.2A priority Critical patent/CN110166428B/en
Publication of CN110166428A publication Critical patent/CN110166428A/en
Application granted granted Critical
Publication of CN110166428B publication Critical patent/CN110166428B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/20Network architectures or network communication protocols for network security for managing network security; network security policies in general

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention belongs to the technical field of network security, and particularly relates to an intelligent defense decision method and device based on reinforcement learning and attack and defense gaming, wherein the method comprises the following steps: constructing an attack and defense game model under the constraint of limited rationality, and generating an attack and defense graph for extracting a network state and an attack and defense action in the game model, wherein the attack and defense graph is set to take a host as a center, the nodes of the attack and defense graph extract the network state, and the attack and defense graph analyzes the attack and defense action; when the network state transition probability is unknown, the defender obtains the defense benefits through online learning, so that the defender can automatically select the optimal defense strategy when facing different attackers. The game state space is effectively compressed, and the storage and operation expenses are reduced; the defender performs reinforcement learning according to the environmental feedback in the counterwork with the attacker, and can make an optimal selection in a self-adaptive way in the face of different attacks; the learning speed of defenders is improved, the defense benefits are improved, the dependence on historical data is reduced, and the real-time performance and intelligence of the defenders in decision making are effectively improved.

Description

Intelligent defense decision-making method and device based on reinforcement learning and attack and defense game
Technical Field
The invention belongs to the technical field of network security, and particularly relates to an intelligent defense decision method and device based on reinforcement learning and attack and defense gaming.
Background
In recent years, information security events become frequent, huge losses are brought to network security, according to statistics, the arry cloud only suffers attacks about 16 hundred million times every day in 2017, each attack and defense scene may only appear once for different attackers, but for defenders represented by the arry cloud, the arry cloud faces a large number of same attack and defense scenes every day. Considering that the hardware resources of network equipment are limited, how to comprehensively consider defense cost and income, and the goal of maximizing defense income is achieved, so that a defender can achieve balance between risk and investment, and how to learn and update income online in a large number of identical attack and defense scenes, and a security administrator faces the dilemma that 'optimal strategy is difficult to select' under the appropriate security condition. The game theory is highly matched with the target oppositiveness, relationship non-cooperation and strategy dependency of network attack and defense. The existing defense decision method based on the game theory can be divided into two types based on a complete rational hypothesis and a limited rational hypothesis: the method is a complete and rational defense decision-making method based on the attack and defense participants. The premise of the complete rational assumption is that each participant can intelligently select the optimal strategy to maximize the benefit of the participant and can predict the strategy selection of other participants. The method is applied to the field of wireless sensor security, and can analyze the efficiency of a worm virus attack and defense strategy by establishing a non-cooperative game model between an attacker and a sensor trust node and giving an optimal attack strategy according to Nash equilibrium. By establishing a repeated game model between the intrusion detection system and the wireless sensor node, the forwarding strategy of the node packet can be analyzed. And secondly, a defense decision method based on the limited rationality of the attack and defense participants. The limited rationality means that the attacking and defending parties cannot find the optimal strategy at the beginning, the attacking and defending game can be learned in the attacking and defending game, and the proper learning mechanism is the key for winning in the game. The method is mainly developed around an evolutionary game, wherein the evolutionary game takes a group as a research object, adopts a biological evolution mechanism and completes learning by simulating the dominant strategy of other members. In the evolutionary game, the information exchange among participants is too much, the adjustment process, the trend and the stability of the strategy of the attack and defense group are mainly researched, and the method is not beneficial to guiding the real-time strategy selection of individual members. How to adopt a better learning mechanism to simulate an attack and defense process and improve the accuracy and timeliness of a defense decision becomes a technical problem to be solved urgently.
Disclosure of Invention
Therefore, the intelligent defense decision-making method and device based on reinforcement learning and attack and defense gaming are suitable for the actual attack and defense network environment, realize the intelligent defense decision-making of online learning capacity, and have strong practicability and operability.
According to the design scheme provided by the invention, the intelligent defense decision-making method based on reinforcement learning and attack and defense gaming comprises the following contents:
A) constructing an attack and defense game model under the constraint of limited rationality, and generating an attack and defense graph for extracting a network state and an attack and defense action in the game model, wherein the attack and defense graph is set to take a host as a center, the nodes of the attack and defense graph extract the network state, and the attack and defense graph analyzes the attack and defense action;
B) based on the network state and the attack and defense actions, the attack and defense game process is intensively learned by relying on an attack and defense game model, and the defenders can automatically select the optimal defense strategy when facing different attackers in a limited manner according to system feedback in the countermeasures of the attack and defense parties.
In the above description, in a), the attack and defense game model is represented by six element groups, i.e., AD-SGM ═ N, S, D, R, Q, and pi, where N represents the player participating in the game, S represents the random game state set, D represents the defender action set, R represents the defender immediate return, Q represents the defender state-action revenue function, and pi represents the defender defense strategy.
In the above description, the attack and defense graph is represented by a binary group, i.e. G ═ S, E, where S represents a set of node security states, and E represents a transition of a node state caused by an attack or defense action.
Preferably, when the attack graph is generated, the target network is scanned to obtain network security elements, then the network security elements are combined with the attack template to perform attack instantiation, then the attack template is combined with the defense template to perform defense instantiation, and finally the attack and defense graph is generated, wherein the state set of the attack and defense game model is extracted by the nodes of the attack and defense graph, and the defense action set is extracted by the edges of the attack and defense graph.
In the step B), in the reinforcement learning, a WoLF hill climbing strategy WoLF-PHC model-free reinforcement learning mechanism is adopted, return and environment state transfer knowledge is acquired through interaction with the environment, the knowledge is expressed by using gains, the strategy learning rate of the defender is set to adapt to the strategy of the attacker, the reinforcement learning is performed through the updated gains, and the optimal defense strategy of the defender is determined.
Preferably, the benefit is expressed as
Figure BDA0002025319820000021
The strategy of reinforcement learning is as follows:
Figure BDA0002025319820000022
wherein, alpha is the learning yield; gamma is a discount factor, Rd(s, d, s ') represents the immediate return of the defender after state s performs the defensive action d the network transitions to state s'.
Furthermore, an average strategy is adopted as a criterion for winning and losing, and the formula is expressed as follows:
Figure BDA0002025319820000031
furthermore, in the model-free reinforcement learning mechanism, an eligibility trace for tracking the recently visited state-action track is introduced, the current reward is distributed to the recently visited state-action, and the earnings are updated by utilizing the eligibility trace.
Furthermore, in reinforcement learning, the qualification trace of each state-action is defined as e (s, a), and the current network state is set as s, and the qualification trace is defined as
Figure BDA0002025319820000032
The mode is updated and the current reward is assigned to the most recently visited state-action, where γ is the discount factor and λ is the trajectory decay factor.
Furthermore, an intelligent defense decision-making device based on reinforcement learning and attack and defense gaming comprises:
the attack and defense graph generation module is used for constructing an attack and defense game model under the limited rational constraint and generating an attack and defense graph for extracting the network state and the attack and defense actions in the game model, the attack and defense graph is set to take the host as the center, the nodes of the attack and defense graph extract the network state, and the attack and defense graph edge analyzes the attack and defense actions;
the defense strategy selection module is used for carrying out reinforcement learning on the attack and defense game process based on the network state and the attack and defense actions and combining an attack and defense game model, and the defense parties can feed back according to the environment in the confrontation process, so that the defender can automatically select the optimal defense strategy when facing different attackers in a limited manner.
The invention has the beneficial effects that:
in the invention, an attack and defense graph model taking a host as a center is used for network state and attack and defense actions, so that the game state space is effectively compressed; the defender adopts a reinforcement learning mechanism to learn according to the feedback of the environment in the confrontation with the attacker, so that the defender with limited rationality can automatically make the optimal selection when facing different attackers; the qualification trace is added into the decision device, so that the learning speed of a defender is improved, the dependence on historical data is reduced, and the real-time performance and intelligence of the defender in decision making are effectively improved.
Description of the drawings:
FIG. 1 is a schematic diagram of an embodiment of an intelligent defense decision flow;
FIG. 2 is a schematic diagram illustrating an exemplary attack/defense state transition;
FIG. 3 is a schematic diagram of an embodiment reinforcement learning mechanism;
FIG. 4 shows an experimental network structure in an example;
FIG. 5 is a diagram illustrating exemplary network vulnerability information;
FIG. 6 is a graph of attack action in an example;
FIG. 7 is a graph of defense action in the examples;
FIG. 8 is an embodiment defensive action description;
FIG. 9 is a parameter chart of the experimental setup in the example;
FIG. 10 is an embodiment defense decision situation diagram;
FIG. 11 is a defense revenue situation diagram in an embodiment.
The specific implementation mode is as follows:
in order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described in detail below with reference to the accompanying drawings and technical solutions. The technical terms involved in the examples are as follows:
the reinforcement learning is a classic online learning method, and the participators perform independent learning through environmental feedback, so that the reinforcement learning method has the advantages of high learning speed, high attack and defense conversion speed and high timeliness compared with a biological evolution type learning mode. The characteristics of non-cooperative property, target opponency, strategy dependency and the like of the game all accord with the basic characteristics of network attack and defense. The embodiment of the invention, as shown in fig. 1, provides an intelligent defense decision method based on reinforcement learning and attack and defense gaming, which comprises the following steps:
constructing an attack and defense game model under the constraint of limited rationality, and generating an attack and defense graph for extracting a network state and an attack and defense action in the game model, wherein the attack and defense graph is set to take a host as a center, the nodes of the attack and defense graph extract the network state, and the attack and defense graph analyzes the attack and defense action;
the attack and defense game model is subjected to reinforcement learning based on the network state and attack and defense actions, and a defender automatically selects an optimal defense strategy when facing different attackers in a limited manner according to system feedback in the counterwork of the attack and defense parties.
The dynamic threat tracking analysis based on the attribute attack graph has obvious advantages in the aspects of attack path inference, threat transfer probability, context inference, loop resolution, real-time analysis, multi-path synthesis, authority promotion, access relation and the like.
Introducing a reinforcement learning mechanism into an attack and defense game, constructing an attack and defense game model under the limited rational constraint, and generating an attack and defense diagram taking a host as a center for extracting a network state and attack and defense actions in the game model; and online real-time automatic defense decision making is realized through reinforcement learning.
The network attack and defense game model adopts probability value to describe the randomness of network state transition, because the current network state is mainly related to the previous network stateIn this regard, the state transition relationship is expressed by first order Markov, and as shown in FIG. 2, the transition probability is P(s)t,at,dt,st+1) Wherein, s is the network state, and (a, d) are the attack and defense actions. As the two network attacking and defending parties have the target oppositiveness and the non-cooperative property, the two network attacking and defending parties can hide the key information of the two network attacking and defending parties deliberately, and the transition probability is set as the unknown information of the two network attacking and defending parties. On the basis, a game model is constructed. In another embodiment of the present invention, an attack and defense random game model (AD-SGM) is represented by a six-tuple AD-SGM ═ (N, S, D, R, Q, pi), where N ═ is two players participating in a game, and represents a network attacker and an defender respectively; s ═ S1,s2,…,sn) The game state is a random game state set and consists of network states; d ═ D (D)1,D2,…,Dn) Is a set of defensive actions, wherein Dk={d1,d2,…dmThe defender is in the game state SkThe set of actions of (1); rd(si,d,sj) Is in state s for defending personiNetwork transition to state s with execution of defensive action djThen returning immediately; qd(siD) is represented in the state siThe expected income after the lower defender takes the action d; pid(sk) Is in state s for defending personkThe defense strategy of (1).
The defense strategy and the defense action are two different concepts, and the defense strategy is a set of defense actions. The defense policy specifies what actions the defender chooses at each network state, e.g., in the form of a probability vectord(sk)=(πd(sk,d1),…,πd(sk,dm) Is defender in network state skStrategy of (2), nd(sk,dm) Select action d for itmWherein, the probability of
Figure BDA0002025319820000051
Extracting nodes from the attack and defense graph G by creating the network attack and defense graph GAnd analyzing the attack and defense actions by the network state and the edge of the attack and defense graph G, and extracting an attack and defense strategy. In another embodiment of the present invention, the attack and defense diagram is represented as a two-tuple G ═ S, E, where S ═ S1,s2,…,snIs the set of node security states, siThe node identifier is a unique identifier of the node, and the node identifier represents that the node identifier does not have any authority, has a normal user authority, and has an administrator authority. E ═ E (E)a,Ed) Is a directed edge, indicating that the occurrence of an attack or defense action causes a transition in the state of the node, ek=(sr,v/d,sd) K is a, d, wherein srIs a source node, sdIs a destination node.
Further, when the attack and defense graph is generated, a target network is scanned to obtain network security elements, then the network security elements are combined with an attack template to carry out attack instantiation, then the attack and defense templates are combined to carry out defense instantiation, and finally the attack and defense graph is generated. The state set of the attack and defense random game model is extracted by the nodes of the attack and defense graph, and the defense action set is extracted by the edges of the attack and defense graph. The specific steps can be designed as shown in algorithm 1:
algorithm 1. attack and defense graph generation algorithm
Figure BDA0002025319820000061
Generating all possible state nodes by utilizing network security elements and initializing edges in the step 1); step 2) to step 11) are attack instantiation, and all attack edges are generated; 12) to 18) step is defense instantiation, generating all defense edges; 19) to 23) are to remove all isolated nodes; step 24) is to output the attack and defense graph.
In the embodiment of the invention, a reinforcement learning mechanism is introduced into the attack and defense game, and the learning and improvement process of the attack and defense strategy is described. WoLF-PHC is a typical model-free reinforcement learning algorithm, and the learning mechanism is shown in FIG. 3. In another embodiment of the invention, the Agent obtains the knowledge of return and environmental state transition through interaction with the environment in reinforcement learning, and the knowledge uses the benefit QdBy updating QdTo perform learning. Its profit function QdComprises the following steps:
Figure BDA0002025319820000071
in the formula (1), α is the learning yield; gamma is a discount factor. The strategy of reinforcement learning is as follows:
Figure BDA0002025319820000072
further, the WoLF-PHC WoLF hill climbing strategy enables defenders to have two different strategy learning rates by introducing a WoLF mechanism, and adopts a low strategy learning rate delta when winningwUsing a high strategy learning rate delta when failinglAs shown in formula (5). The two learning rates enable a defender to quickly adapt to the strategy of an attacker when the defender is worse than expected performance, and enable the defender to learn cautiously when the defender is better than the expected performance, and meanwhile, the convergence of the algorithm is guaranteed. The WoLF-PHC algorithm adopts an average strategy as a judgment standard for winning and failing, and is shown in formulas (6) and (7).
Figure BDA0002025319820000073
Figure BDA0002025319820000074
C(s)=C(s)+1 (7)
In order to improve the learning speed of the WoLF-PHC algorithm and reduce the dependence degree of the algorithm on the data quantity, in another embodiment of the invention, an eligibility trace is introduced into the WoLF-PHC. The eligibility trace can track the most recently visited state-action trace and then assign the current reward to the most recently visited state-action. Further, the qualification trace of each state-action is defined as e (s, a), the current network state is set as s, and the qualification trace is updated in a manner shown in formula (8), wherein λ is a trace attenuation factor.
Figure BDA0002025319820000075
In order to obtain a better effect based on a WoLF-PHC defense decision method, four parameters of alpha, delta, lambda and gamma are reasonably set. 1) The value range of the profit learning rate alpha is more than 0 and less than 1, the larger alpha represents that the more important the accumulated reward is, the faster the learning speed is; the smaller the α, the better the stability of the algorithm. 2) The value range of the strategy learning rate delta is more than 0 and less than 1, and the strategy learning rate delta is obtained according to experiments by adopting
Figure BDA0002025319820000081
Better effect can be obtained. 3) The evaluation range of the qualification track attenuation factor lambda is 0 < lambda < 1, the qualification track attenuation factor lambda is responsible for allocating credit to the state and the action, the qualification track attenuation factor lambda can be regarded as the scale of time, and the credit allocated to the historical state and the action is larger when the lambda is larger. 4) The value range of the discount factor gamma is 0 < gamma < 1, which represents the preference of defenders for immediate return and future return. When gamma is close to 0, the future return is not important, and the immediate return is emphasized; when γ is close to 1, it means that immediate return is insignificant, and future return is emphasized more.
As shown in FIG. 3, the Agent in the WoLF-PHC corresponds to a defender in an attack and defense random game model AD-SGM, the state of the Agent corresponds to the game state in the AD-SGM, the behavior of the Agent corresponds to the defense action in the AD-SGM, the immediate return of the Agent corresponds to the immediate return in the AD-SGM, and the strategy of the Agent corresponds to the defense strategy in the AD-SGM. On the basis of the above, a specific defense decision algorithm can be designed as shown in algorithm 2:
algorithm 2. defense decision algorithm
Figure BDA0002025319820000082
Figure BDA0002025319820000091
1) attack and defense random game model AD-SGM and initial of related parametersInitialization, in which the network state and the attack and defense actions are extracted by algorithm 1, step 2) defender detects the current network state, steps 3) -22) make defense decisions and online learning, wherein steps 4) -5) select defense actions according to the current strategy, steps 6) -14) utilize eligibility traces to yield QdUpdating, steps 15) -21) according to the new yield QdUpdating defense strategy pi by using hill climbing algorithmd. The spatial complexity of the algorithm is mainly focused on the pair Rd(s,d,s')、e(s,d)、πd(s,d)、
Figure BDA0002025319820000092
And Qd(S, D) if | S | is the number of states, | D | is the number of measures for each state defender, then the spatial complexity is O (4 |, S |, D | + | S |2| D |). The algorithm does not need to solve the game equilibrium, greatly reduces the computational complexity compared with the existing random game model, and enhances the effectiveness of the algorithm.
Based on the above intelligent defense decision method, an embodiment of the present invention further provides an intelligent defense decision device based on reinforcement learning and attack and defense gaming, including:
the attack and defense graph generation module is used for constructing an attack and defense game model under the limited rational constraint and generating an attack and defense graph for extracting the network state and the attack and defense actions in the game model, the attack and defense graph is set to take the host as the center, the nodes of the attack and defense graph extract the network state, and the attack and defense graph edge analyzes the attack and defense actions;
the defense strategy selection module is used for carrying out reinforcement learning on the attack and defense game process based on the network state and the attack and defense actions and combining an attack and defense game model, and the defense parties can feed back according to the environment in the confrontation process, so that the defender can automatically select the optimal defense strategy when facing different attackers in a limited manner.
The intelligent defense decision method based on reinforcement learning and attack and defense game is adopted to intelligently select the target network defense strategy.
In order to further verify the effectiveness of the technical scheme in the embodiment of the invention, experiments are carried out by building a typical enterprise network as shown in fig. 4. The attack and defense event occurs in the internal network, and the attacker comes from the external network. The network administrator is responsible for the security of the intranet as a defender. Due to the arrangement of the firewall 1 and the firewall 2, a normal user of the extranet can only access the Web server, and the Web server can access a database server, an FTP server and an email server. The experimental network is scanned using the Nessus tool, and the vulnerability information of the experimental network is shown in FIG. 5.
An attack and defense template is constructed by referring to an MIT Lincoln laboratory attack and defense behavior database, an attacker host, a W identification Web server, a D identification database server, an F identification FTP server and an E identification electronic mail server are adopted, a network attack and defense graph is constructed by using an attack and defense graph generating device, and the attack and defense graph is divided into an attack graph and a defense graph which are respectively shown in an attached drawing 6 and an attached drawing 7 for convenience of display and description. The defensive action in the defensive map is described as shown in figure 8. Constructing an attack and defense game model of an experimental scene:
n ═ attemper, defenser, for those participating in the game, representing network attackers and defenders, respectively;
② random game state set S ═ S (S)0,s1,s2,s3,s4,s5,s6) The random game state is composed of network states and extracted by the nodes in fig. 5 and 6;
the action set of defenders is as follows: d ═ D (D)0,D1,D2,D3,D4,D5,D6) Wherein D is0={NULL}D1={d1,d2}D2={d3,d4}D3={d1,d5,d6}D4={d1,d5,d6}D5={d1,d2,d7}D6={d3,d4}, extracted from the edge of FIG. 6;
defender reporting R immediatelyd(si,d,sj) The quantization results of (a) are:
(Rd(s0,NULL,s0),Rd(s0,NULL,s1),Rd(s0,NULL,s2))=(0,-40,-59)
(Rd(s1,d1,s0),Rd(s1,d1,s1),Rd(s1,d1,s2);Rd(s1,d2,s0),Rd(s1,d2,s1),Rd(s1,d2,s2))=(40,0,-29;5,-15,-32)
(Rd(s2,d3,s0),Rd(s2,d3,s1),Rd(s2,d3,s2),Rd(s2,d3,s3),Rd(s2,d3,s4),Rd(s2,d3,s5);Rd(s2,d4,s0),Rd(s2,d4,s1),Rd(s2,d4,s2),Rd(s2,d4,s3),Rd(s2,d4,s4),Rd(s2,d4,s5))=(24,9,-15,-55,-49,-65;19,5,-21,-61,-72,-68)
(Rd(s3,d1,s2),Rd(s3,d1,s3),Rd(s3,d1,s6);Rd(s3,d5,s2),Rd(s3,d5,s3),Rd(s3,d5,s6);Rd(s3,d6,s2),Rd(s3,d6,s3),Rd(s3,d6,s6))=(21,-16,-72;15,-23,-81;-21,-36,-81)
(Rd(s4,d1,s2),Rd(s4,d1,s4),Rd(s4,d1,s6);Rd(s4,d5,s2),Rd(s4,d5,s4),Rd(s4,d5,s6);Rd(s4,d6,s2),Rd(s4,d6,s4),Rd(s4,d6,s6))=(26,0,-62;11,-23,-75;9,-25,-87)
(Rd(s5,d1,s2),Rd(s5,d1,s5),Rd(s5,d1,s6);Rd(s5,d2,s2),Rd(s5,d2,s5),Rd(s5,d2,s6);Rd(s5,d7,s2),Rd(s5,d7,s5),Rd(s5,d7,s6))=(29,0,-63;11,-21,-76;2,-27,-88)
(Rd(s6,d3,s3),Rd(s6,d3,s4),Rd(s6,d3,s5),Rd(s6,d3,s6);Rd(s6,d4,s3),Rd(s6,d4,s4),Rd(s6,d4,s5),Rd(s6,d4,s6))=(-23,-21,-19,-42;-28,-31,-24,-49)
for more sufficient detection of learning performance of algorithm, defender's state action income Qd(siAnd d) setting 0 at initialization without introducing additional prior knowledge.
Defense strategy for defending persondInitialisation by means of an averaging strategy, i.e. pid(sk,d1)=πd(sk,d2)=…πd(sk,dm) And are
Figure BDA0002025319820000111
No additional a priori knowledge is introduced.
Testing the influence of different parameter settings on the algorithm, at state s in fig. 6 and 72For example, the initial strategy of the attacker in the experiment is a random strategy, the speed and the effect of learning are influenced by analyzing different parameter values, different parameter settings are further tested, and six different parameter settings are tested, wherein the specific parameter settings are shown in fig. 9.
Defensive person in state s2For defense action d3And d4The selection probability results of (2) are shown in fig. 10. From fig. 10, the learning speed and convergence of the algorithm at different parameter settings can be observed. Fig. 10 shows that the learning speed of settings 1, 3, and 6 is fast, and the algorithm can obtain the optimal strategy through learning within 1500 times under three settings, but the convergence of 3 and 6 is poor. Although settings 3 and 6 learn the best strategy, oscillations will occur later and stability is good without setting 1.
The defense gains can represent the optimization degree of the algorithm to the strategy, and in order to ensure that the profit value does not reflect the defense result only once, the average value of the defense gains is 1000 times, and the average profit change of each 1000 times is shown in fig. 11. It can be seen from fig. 11 that the benefit of setting 3 is significantly lower than the other settings, but the other settings are difficult to distinguish between good and bad. Thus, setting 1 out of six sets of parameters is most suitable for the present scenario.
The operation overhead brought by testing the qualification trace respectively counts the time of 10 ten thousand defense decisions of the algorithm when 20 qualification traces exist and do not exist, and the average value of 20 times is as follows: qualified trace 9.51s, unqualified trace 3.74 s. Although the introduction of the qualification trace can increase the decision time by nearly 2.5 times, the time required for decision making 10 ten thousand times after the qualification trace is introduced is still only 9.51s, and the requirement of real-time performance can be met.
Through the experiments, the method further verifies that an attack and defense random game model is constructed under the limited rational constraint and a network attack and defense diagram for extracting the network state and the attack and defense strategy is generated, so that the game state space is effectively compressed; the defender can obtain the optimal defense strategy aiming at the current attack through learning, the rapid automatic defense capacity to the unknown attack is improved, and the practicability and the operability are strong.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The elements of the various examples and method steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and the components and steps of the examples have been described in a functional generic sense in the foregoing description for clarity of hardware and software interchangeability. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
Those skilled in the art will appreciate that all or part of the steps of the above methods may be implemented by instructing the relevant hardware through a program, which may be stored in a computer-readable storage medium, such as: read-only memory, magnetic or optical disk, and the like. Alternatively, all or part of the steps of the foregoing embodiments may also be implemented by using one or more integrated circuits, and accordingly, each module/unit in the foregoing embodiments may be implemented in the form of hardware, and may also be implemented in the form of a software functional module. The present invention is not limited to any specific form of combination of hardware and software.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1. An intelligent defense decision-making method based on reinforcement learning and attack and defense gaming is characterized by comprising the following contents:
A) constructing an attack and defense game model under the constraint of limited rationality, and generating an attack and defense graph for extracting a network state and an attack and defense action in the game model, wherein the attack and defense graph is set to take a host as a center, the nodes of the attack and defense graph extract the network state, and the attack and defense graph analyzes the attack and defense action;
B) the attack and defense game model is subjected to reinforcement learning based on the network state and attack and defense actions, and a defender automatically selects an optimal defense strategy when facing different attackers in a limited manner according to system feedback in the counterwork of the attack and defense parties;
B) in the reinforcement learning, a WoLF hill climbing strategy WoLF-PHC model-free reinforcement learning mechanism is adopted, return and environment state transfer knowledge is acquired through interaction with the environment, the knowledge is expressed by utilizing profits, the learning rate of the defense strategy is set to adapt to the strategy of an attacker, reinforcement learning is carried out through updating the profits, and the optimal defense strategy of the defense is determined;
the benefit is expressed as
Figure FDA0002970891450000011
The strategy of reinforcement learning is as follows:
Figure FDA0002970891450000012
wherein, alpha is the learning yield; gamma is a discount factor, Rd(s, d, s ') represents the immediate return of the defender after state s performs the defensive action d the network transitions to state s'.
2. The intelligent defense decision method based on reinforcement learning and attack and defense game as claimed in claim 1, wherein in A), the attack and defense game model is represented by six-element group, i.e. AD-SGM ═ (N, S, D, R, Q, pi), wherein N represents the player participating in the game, S represents the random game state set, D represents the defender action set, R represents the defender immediate return, Q represents the defender state-action revenue function, and pi represents the defender defense strategy.
3. The intelligent defense decision method based on reinforcement learning and attack and defense gaming according to claim 1, characterized in that the attack and defense graph is represented by a binary group, i.e. G ═ S, E, where S represents a node security state set and E represents a transition of a node state caused by an attack action or a defense action.
4. The intelligent defense decision method based on reinforcement learning and attack and defense gaming of claim 3, characterized in that when generating the attack graph, the network security elements are obtained by scanning the target network, then attack instantiation is performed in combination with the attack template, defense instantiation is performed in combination with the defense template, and finally the attack and defense graph is generated, wherein the state set of the attack and defense gaming model is extracted by the nodes of the attack and defense graph, and the defense action set is extracted by the edges of the attack and defense graph.
5. The intelligent defense decision method based on reinforcement learning and attack and defense gaming according to claim 1, characterized in that an average strategy is adopted as a criterion for winning and losing, and the formula is expressed as:
Figure FDA0002970891450000021
6. the intelligent defense decision method based on reinforcement learning and attack and defense gaming according to claim 1, characterized in that in the model-free reinforcement learning mechanism, a qualification trace for tracking a state-action track of the latest visit is introduced, the current reward is distributed to the state-action of the latest visit, and the profit is updated by using the qualification trace.
7. Intelligent defense based on reinforcement learning and attack and defense gaming according to claim 6The decision-making method is characterized by defining the qualification trace of each state-action as e (s, a) in the reinforcement learning, and setting the current network state as s, the qualification trace is used as
Figure FDA0002970891450000022
The mode is updated and the current reward is assigned to the most recently visited state-action, where γ is the discount factor and λ is the trajectory decay factor.
8. An intelligent defense decision-making device based on reinforcement learning and attack and defense games is characterized in that the intelligent defense decision-making method based on reinforcement learning and attack and defense games of any one of claims 1 to 7 is adopted to intelligently select a target network defense strategy.
CN201910292304.2A 2019-04-12 2019-04-12 Intelligent defense decision-making method and device based on reinforcement learning and attack and defense game Active CN110166428B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910292304.2A CN110166428B (en) 2019-04-12 2019-04-12 Intelligent defense decision-making method and device based on reinforcement learning and attack and defense game

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910292304.2A CN110166428B (en) 2019-04-12 2019-04-12 Intelligent defense decision-making method and device based on reinforcement learning and attack and defense game

Publications (2)

Publication Number Publication Date
CN110166428A CN110166428A (en) 2019-08-23
CN110166428B true CN110166428B (en) 2021-05-07

Family

ID=67639176

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910292304.2A Active CN110166428B (en) 2019-04-12 2019-04-12 Intelligent defense decision-making method and device based on reinforcement learning and attack and defense game

Country Status (1)

Country Link
CN (1) CN110166428B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110659492B (en) * 2019-09-24 2021-10-15 北京信息科技大学 Multi-agent reinforcement learning-based malicious software detection method and device
CN111988415B (en) * 2020-08-26 2021-04-02 绍兴文理学院 Mobile sensing equipment calculation task safety unloading method based on fuzzy game
CN112221160B (en) * 2020-10-22 2022-05-17 厦门渊亭信息科技有限公司 Role distribution system based on random game
CN113132398B (en) * 2021-04-23 2022-05-31 中国石油大学(华东) Array honeypot system defense strategy prediction method based on Q learning
CN113810406B (en) * 2021-09-15 2023-04-07 浙江工业大学 Network space security defense method based on dynamic defense graph and reinforcement learning
CN114844668A (en) * 2022-03-17 2022-08-02 清华大学 Defense resource configuration method, device, equipment and readable medium
CN115296850A (en) * 2022-07-08 2022-11-04 中电信数智科技有限公司 Network attack and defense exercise distributed learning method based on artificial intelligence
CN115348064B (en) * 2022-07-28 2023-09-26 南京邮电大学 Dynamic game-based power distribution network defense strategy design method under network attack
CN116032653A (en) * 2023-02-03 2023-04-28 中国海洋大学 Method, device, equipment and storage medium for constructing network security game strategy
CN116708042B (en) * 2023-08-08 2023-11-17 中国科学技术大学 Strategy space exploration method for network defense game decision

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014100738A1 (en) * 2012-12-21 2014-06-26 InsideSales.com, Inc. Instance weighted learning machine learning model
CN104994569A (en) * 2015-06-25 2015-10-21 厦门大学 Multi-user reinforcement learning-based cognitive wireless network anti-hostile interference method
CN107135224A (en) * 2017-05-12 2017-09-05 中国人民解放军信息工程大学 Cyber-defence strategy choosing method and its device based on Markov evolutionary Games
CN108512837A (en) * 2018-03-16 2018-09-07 西安电子科技大学 A kind of method and system of the networks security situation assessment based on attacking and defending evolutionary Game
CN108809979A (en) * 2018-06-11 2018-11-13 中国人民解放军战略支援部队信息工程大学 Automatic intrusion response decision-making technique based on Q-learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014100738A1 (en) * 2012-12-21 2014-06-26 InsideSales.com, Inc. Instance weighted learning machine learning model
CN104994569A (en) * 2015-06-25 2015-10-21 厦门大学 Multi-user reinforcement learning-based cognitive wireless network anti-hostile interference method
CN107135224A (en) * 2017-05-12 2017-09-05 中国人民解放军信息工程大学 Cyber-defence strategy choosing method and its device based on Markov evolutionary Games
CN108512837A (en) * 2018-03-16 2018-09-07 西安电子科技大学 A kind of method and system of the networks security situation assessment based on attacking and defending evolutionary Game
CN108809979A (en) * 2018-06-11 2018-11-13 中国人民解放军战略支援部队信息工程大学 Automatic intrusion response decision-making technique based on Q-learning

Also Published As

Publication number Publication date
CN110166428A (en) 2019-08-23

Similar Documents

Publication Publication Date Title
CN110166428B (en) Intelligent defense decision-making method and device based on reinforcement learning and attack and defense game
CN111966698B (en) Block chain-based trusted federation learning method, system, device and medium
CN107483486B (en) Network defense strategy selection method based on random evolution game model
CN107135224B (en) Network defense strategy selection method and device based on Markov evolution game
CN107566387B (en) Network defense action decision method based on attack and defense evolution game analysis
Song et al. Training genetic programming on half a million patterns: an example from anomaly detection
CN110460572A (en) Mobile target defence policies choosing method and equipment based on Markov signaling games
Zennaro et al. Modelling penetration testing with reinforcement learning using capture‐the‐flag challenges: Trade‐offs between model‐free learning and a priori knowledge
CN108809979A (en) Automatic intrusion response decision-making technique based on Q-learning
Huang et al. Markov differential game for network defense decision-making method
CN113505855B (en) Training method for challenge model
Chen et al. Marnet: Backdoor attacks against cooperative multi-agent reinforcement learning
CN113033822A (en) Antagonistic attack and defense method and system based on prediction correction and random step length optimization
CN110493262A (en) It is a kind of to improve the network attack detecting method classified and system
Chen et al. Smoothing matters: Momentum transformer for domain adaptive semantic segmentation
Zhang et al. Building robust ensembles via margin boosting
CN116582349A (en) Attack path prediction model generation method and device based on network attack graph
CN115580430A (en) Attack tree-pot deployment defense method and device based on deep reinforcement learning
Xenopoulos et al. Graph neural networks to predict sports outcomes
Li et al. Robust moving target defense against unknown attacks: A meta-reinforcement learning approach
CN116707870A (en) Defensive strategy model training method, defensive strategy determining method and equipment
CN116192424A (en) Method for attacking global data distribution in federation learning scene
Moskal et al. Simulating attack behaviors in enterprise networks
Guan et al. A Bayesian Improved Defense Model for Deceptive Attack in Honeypot-Enabled Networks
CN112583844A (en) Big data platform defense method for advanced sustainable threat attack

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant