CN115580430A

CN115580430A - Attack tree-pot deployment defense method and device based on deep reinforcement learning

Info

Publication number: CN115580430A
Application number: CN202211054557.4A
Authority: CN
Inventors: 陈晋音; 胡书隆; 李晓豪; 宣琦; 郑雅羽
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2022-08-31
Filing date: 2022-08-31
Publication date: 2023-01-06

Abstract

The invention discloses a honeypot deployment defense method and a honeypot deployment defense device for attack trees based on deep reinforcement learning, wherein the method comprises the steps of obtaining network topology information and sequentially constructing attack trees; combining a convolutional neural network with a Q-Learning algorithm to create a DQN model, training an intelligent agent based on the DQN model, wherein the intelligent agent is used as a penetration attack party, and the training target is to generate a current optimal penetration attack path; according to the current optimal penetration attack path, all nodes experienced by the current optimal attack path are used as nodes which are likely to be attacked, and the nodes which are likely to be attacked are determined by monitoring the flow change condition of each node which is likely to be attacked; calculating the difficulty of attack vulnerability corresponding to each vulnerable node; obtaining a node with the maximum attack probability according to the flow change condition of the vulnerable node and the difficulty of attacking the vulnerability, and deploying a honeypot at the node; and the deployment of the honeypots is continuously updated, so that attackers are induced to fall into the deployed honeypots, and defense is finished.

Description

Attack tree-pot deployment defense method and device based on deep reinforcement learning

Technical Field

The invention belongs to the field of network space security oriented and deep reinforcement learning defense, and particularly relates to an attack tree-pot deployment defense method and device based on deep reinforcement learning.

Background

The automatic penetration test is characterized in that a manual penetration test process is converted into an automatic process without manual whole-process participation, so that the labor and material cost of manual penetration is reduced to a certain extent. Early automated penetration testing methods mainly took the form of attack graphs, which are meant to model systems and how they are affected by specific vulnerabilities. In an attack graph, a node is typically a state of a system, where the state is configured by the current system, i.e., operating system, permissions, network connections, etc., and edges connecting the nodes are known vulnerabilities. Once the attack graph is constructed, the action sequence (such as attack vulnerability) of an attacker can be searched, the nodes passed by the attacker from the starting point to the end point and the action taken on each node are called attack paths, and the attacker approaches the target system according to the attack paths. In addition, finding the attack path may incorporate artificial intelligence based path planning techniques. The main problem with this approach, however, is that it requires a priori knowledge of the complete network topology and per-machine configuration information, and is therefore impractical from an attacker's perspective, and also requires manual setup of charts for each new system being evaluated.

The attack tree is evolved from an attack graph, the method based on the attack tree can ensure that actions taken by utilizing the attack tree to perform automatic penetration testing are infinitely close to penetration safety experts, and the method plays an important role in the field of automatic penetration testing. This method of modeling the security threats that a given system may be exposed to is proposed by Schneier in 1999 and represents attacks on targets in the form of a tree structure. By analyzing the attack tree, the relationship between each attack mode can be better understood. Similarly, it is feasible to apply Reinforcement Learning (RL) to analyze the attack tree, such as using Q-learning to find the attack path, but there still exists the problem that the motion space and the sample space are too small. Compared with reinforcement learning, deep Reinforcement Learning (DRL) is a method more suitable for analyzing attack trees, because it combines deep learning and reinforcement learning, and adopts trial and error method to find the optimal solution, it can perform penetration test on larger scale network combined attack trees.

Honeypot technology (Honeypot) is a trap technology in network defense, and researches and learns the attack purpose and the attack means of an adversary by attracting and luring attackers and recording the attack behaviors of the attackers so as to protect real service resources. The Honeynet technology (Honeynet) is proposed to be derived from honeypot technology, a Honeynet composed of a plurality of honeypots can perform more efficient active defense, the Honeynet is composed of honeypot groups deployed in a centralized manner by a decoy service module, and the Honeynet is a honeypot technology with high interactive research type. The honeynet technology is characterized in that a plurality of honeypot hosts designed in advance are used for luring an attacker to attack, so that the attacker can mistakenly think that an attack object is a true machine, the purpose of confusing the attacker is achieved, and meanwhile, the attack behavior and situation information of the attacker are obtained and then are analyzed and evaluated. The mechanism is a very effective active defense mechanism.

However, the traditional honeypot technology has the defects of static configuration, fixed deployment and the like, and is easily identified and bypassed by attackers to lose the decoy value. Therefore, how to improve the dynamicity and decoy of honeypots becomes a key problem in the honeypot field. Similarly, the traditional honey net needs physical machine deployment, so that the deployment process has the problems of complex implementation, high cost, difficult flow control and the like. One approach to attempt to solve this problem is to apply a Deep Reinforcement Learning (DRL) technique to the construction of the attack tree, and dynamically deploy honeypots using a DRL algorithm again according to the vulnerability and traffic information of each node of the attack tree so as to intelligentize the honeypot deployment process.

Since the introduction of deep reinforcement learning, artificial intelligence has been one of the directions in which artificial intelligence is attracting much attention, and with the rapid development and application of reinforcement learning, reinforcement learning has been widely used in the fields of robot control, game gaming, computer vision, unmanned driving, and the like. The Reinforcement Learning (RL) is an artificial intelligence optimization technology, and has the key advantages that an environment model is not needed to generate an attack strategy, the optimal strategy is learned through interaction with the environment, and the deep reinforcement learning fully utilizes a neural network as a parameter structure and combines the perception capability of the deep learning and the decision capability of the reinforcement learning to optimize the deep reinforcement learning strategy. The honeypot is deployed by utilizing deep reinforcement learning, the deployment position of the honeypot can be optimized while the attack tree is dynamically updated, and a penetration attacker is introduced into the honeypot, so that the purpose of defending penetration attack is achieved.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an attack tree-pot deployment defense method and device based on deep reinforcement learning.

In order to achieve the purpose, the technical scheme of the invention is as follows: the first aspect of the embodiment of the invention provides an attack honeypot deployment defense method based on deep reinforcement learning, which specifically comprises the following steps:

s1, acquiring network topology information and sequentially constructing an attack tree;

s2, combining the convolutional neural network with a Q-Learning algorithm to create a DQN model, and training an intelligent agent based on the DQN model, wherein the intelligent agent is used as a penetration attack party, and the training target is to generate a current optimal penetration attack path;

s3, according to the current optimal penetration attack path obtained in the step S2, all nodes experienced by the current optimal penetration attack path are used as nodes which are likely to be attacked, and the nodes which are likely to be attacked are determined by monitoring the flow change condition of each node which is likely to be attacked; calculating the difficulty of attack vulnerability corresponding to each vulnerable node based on the access matrix, the access complexity and the authenticity index; obtaining a node with the maximum attack probability according to the flow change condition of the vulnerable node and the difficulty of attacking the vulnerability, and deploying a honeypot at the node; and the deployment of the honeypots is continuously updated, so that attackers are induced to fall into the deployed honeypots, and defense is finished.

Further, the acquiring the network topology information in step S1 includes: and (3) carrying out port scanning on the target network system by using a vulnerability scanning tool Shodan so as to obtain the IP address of each device, the operating system type of each device, the port opened by each IP address, the service list in which each port is running, the communication protocol adopted by each port and the key information for constructing the real network topology, including the communication relationship among different subnets.

Further, the process of constructing the attack tree in step S1 specifically includes: generating an attack tree through the MulVAL based on the acquired network topology information; wherein, the attack tree comprises all information of the network topology nodes; all information of the network topology nodes comprises the vulnerabilities of the nodes and executable operations according to vulnerability utilization relations; the executable operation comprises penetration attack, transverse movement, user authority obtaining and Root authority promotion.

Further, the process of training the agent based on the DQN model in step S2 specifically includes:

s201, updating the parameter theta of the current network in the DQN model in real time, copying the parameter of the current network into a target network after every N iterations, and then updating the network parameter by minimizing the mean square error between the current Q value and the target Q value, wherein the mean square error function is as follows:

L(θ _i )＝E _s,a,r,s' [(Y _i -Q(s,a|θ _i )) ² ]

wherein the content of the first and second substances,

r represents the immediate reward of the DQN model;

calculating the network gradient according to the following formula, and updating the parameter theta of the current network:

s202, vulnerability information such as vulnerability service, vulnerability attributes and vulnerability types corresponding to each network topology node of the attack tree is converted into a simplified matrix to be used as state input of the DQN model trained in the step S2. According to the Bellman optimal equation theory, as long as the step S201 is continuously iterated and updated, the target Q value is infinitely close to the current Q value, so that the training is finally completed, and the training target is obtained, namely the current optimal penetration attack path is generated:

further, the calculation process of the instant prize r of the DQN model specifically comprises: the basic scoring and penetration feasibility scoring indexes of the vulnerability corresponding to each network topology node of the attack tree are combined with a general vulnerability scoring system to award different vulnerabilities which successfully penetrate different nodes, and the instant award r of the DQN model is the award Score _vul The method comprises the following steps:

wherein baseScore represents the basal score and expletablitysscore represents the penetration feasibility score.

Further, the step S3 specifically includes the following steps:

s301, according to the current optimal attack path obtained by training in the step S2, all nodes experienced by the current optimal attack path are all nodes which can be attacked, and the nodes which can be attacked are screened to obtain nodes which can be attacked;

s302, scoring vulnerability of the vulnerability according to vulnerability information corresponding to each node in the attack tree and combining three indexes of a CVSS vulnerability scoring system, wherein the vulnerability information is respectively an access matrix AV, an access complexity AC and an authenticity AU, the ranges of the three indexes are (0, 1), and the three indexes are respectively marked as V, C and U;

s303, calculating difficulty D of attack vulnerability according to the V, C and U indexes obtained in the step S302, wherein a calculation formula is as follows:

s304, combining the flow change corresponding to the vulnerable node obtained in the step S301 with vulnerability analysis of the vulnerable node, so as to obtain a node with the maximum attack probability, and deploying a honeypot at the node;

s305, in the defense process of intelligently deploying honeypots, if the agent sinks into the honeypots previously deployed in the step S304, a positive reward r is given, and if the agent does not sink into honeypots, a negative reward-r is given;

s306, storing the state conversion process in an experience playback experience buffer pool as a training data set of the DQN model; sampling N training data sets from an experience buffer pool, and performing updating training on a current Q network and a target Q network in the DQN model;

s307, repeating the training process of the steps S301-S306 to the DQN model, continuously updating the deployment of the honeypots, and inducing penetration attackers to trap into the deployed honeypots, so that the purpose of defending penetration attacks is achieved.

Further, the step S301 includes: the process of screening the nodes which are possibly attacked: combining the generated attack tree with a software defined network, monitoring the traffic change condition of each node which is likely to be attacked, calculating the traffic change matrix corresponding to the current time and the next time of each node which is likely to be attacked, carrying out dimension comparison on the traffic change matrix corresponding to the current time and the next time, and if the change of all dimension data exceeds a self-defined threshold value, considering the node which is likely to be attacked as a vulnerable node.

Further, the step S305 of granting the positive reward r if the agent falls into the honeypot previously deployed in the step S304, and granting the negative reward-r if the agent does not fall into the honeypot includes:

if the agent is trapped in the honeypot set in advance, the agent is rewarded with the 2D set in the step S303, and if the agent is not trapped in the honeypot and bypasses the honeypot to carry out penetration attack operation on the host around the honeypot, the agent is rewarded negatively with the agent-2D.

A second aspect of the embodiments of the present invention provides an attack nectar deployment defense device based on deep reinforcement learning, including one or more processors, for the attack nectar deployment defense method based on deep reinforcement learning.

A third aspect of the embodiments of the present invention provides a computer-readable storage medium, on which a program is stored, where the program, when executed by a processor, is used for the attack tree honey pot deployment defense method based on deep reinforcement learning.

The invention has the beneficial effects that: the method combines an attack tree generation technology and a honeypot deployment technology, deploys a virtual honeypot host in a dynamic and static combination mode according to the vulnerability information and the flow information of the tree nodes while dynamically constructing the attack tree, optimizes the selection of the penetration path and the deployment of the honeypot by utilizing deep reinforcement learning, and prevents a target system from being attacked by introducing a penetration attacker into the honeypot deployed according to the vulnerability information of the attack tree nodes and the flow change of the nodes, thereby improving the safety of the network system and achieving the purpose of defending the penetration attack.

Drawings

FIG. 1 is a schematic diagram of the process of the present invention;

fig. 2 is a schematic diagram of the DQN algorithm structure used in the method of the invention;

FIG. 3 is a schematic view of the apparatus of the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. The word "if," as used herein, may be interpreted as "at \8230; \8230when" or "when 8230; \823030when" or "in response to a determination," depending on the context.

The overall thought of the invention is as follows: 1) Constructing a real network topology according to vulnerability information obtained by Shodon scanning, and constructing an attack tree through MulVAL to optimize a penetration path; 2) Constructing an infiltration path optimizing process as an MDP process, training by using a DQN algorithm, inputting the state into vulnerability information of each node, rewarding setting basic scores and expected scores of different vulnerabilities according to a CVSS, and outputting actions, namely attack paths selected by the DQN model; 3) Combining an SDN flow controller with an attack tree, after an optimal attack path is generated, detecting the change of a flow matrix on each node on the optimal attack path to conjecture a node which is easy to attack, combining the node with vulnerability analysis of a corresponding node to judge the node which is easy to attack most, and deploying honeypots at the node; 4) Modeling a honeypot deployment process as an MDP process, inputting the state into the flow load of each node, setting the reward according to whether the honeypot is trapped in the honeypot, giving a double CVSS scoring reward if the honeypot is trapped in the honeypot, and giving a negative reward if the honeypot is not trapped in the honeypot, so that a penetration attacker is trapped in the deployed honeypot, and the purpose of defending the penetration attack is achieved. 5) The method comprises the steps of firstly combining an attack tree with SDN flow control, deducing vulnerable nodes according to a penetration attack path obtained by utilizing the training of the attack tree, and deducing the vulnerable nodes by combining real-time flow matrix change on each node, so that honeypots are deployed on the attack tree in a dynamic and static combination mode.

The attack tree honey pot deployment defense method and device based on deep reinforcement learning of the invention are explained in detail below with reference to the accompanying drawings. The features of the following examples and embodiments may be combined with each other without conflict.

As shown in fig. 1 to fig. 2, an embodiment of the present invention provides an attack honeypot deployment defense method based on deep reinforcement learning, which specifically includes the following steps:

1) Acquiring network topology information to construct an attack tree:

1.1 Obtain network topology information: utilizing a vulnerability scanning tool Shodan to carry out port scanning on a target network system so as to obtain the IP address of each device, the operating system type of each device, the port opened by each IP address, the service list in which each port is running, the communication protocol adopted by each port, the communication relation among different subnets and other key information for constructing a real network topology;

1.2 Construct an attack tree: generating an actual attack tree corresponding to a given network topology by the acquired network topology information through a MulVAL tool, and finding all possible penetration paths for the given input network topology; the attack tree contains all information of the network topology node, such as a vulnerability existing in the node and executable operations according to the vulnerability utilization relation, such as penetration attack and transverse movement, so that the user authority is obtained and further promoted to Root authority and other related operations.

2) And combining the convolutional neural network with a Q-Learning algorithm to create a DQN model, and training an agent based on the DQN model, wherein the agent is used as a penetration attack party, and the training target is to generate the current optimal penetration attack path.

In the embodiment of the present invention, the DQN model specifically includes: the convolutional neural network, a representative value-based approach, is combined with the Q-Learning algorithm to create a DQN model, whose input is the current state, which undergoes nonlinear transformations of 3 convolutional layers and 2 fully-connected layers, and finally generates a Q value for each action at the output layer. The DQN model uses a target network mechanism, namely, on the basis of the current value network structure, a target value network with the same structure is built to form an overall model frame of the DQN, in the training process, a predicted Q value output by the current value network is used for selecting an action a, and the other target value network is used for calculating a target Q value.

In particular, compared with the Q-Learning algorithm, the DQN model relieves the instability problem of the representation function of the nonlinear network. For example, the DQN model uses an empirical replay pool to deposit transfer samples. At each time step t, the transfer samples obtained by the agent interacting with the environment are stored in an empirical playback buffer pool. During training, a small batch of transfer samples is randomly selected, and a random gradient descent (SGD) algorithm is used to update the network parameter theta.

Specifically, compared with the Q-Learning algorithm, the DQN model also modifies the calculation mode of the Q value. Thus, Q (s, a | θ) in the DQN model _i ) An output representing the current value network is used to evaluate the cost function of the current state action. And Q (s, a | theta |) _i ^- ) Representing the output of a target value network, generally by

As the target Q value.

Where s represents the current state, θ _i Representing the parameters of the current network in the DQN model, s' representing the next state that occurs after taking action a, θ _i ^- Representing the parameters of the target network in the DQN model, a 'is the possible action in the s' state. r represents the instantaneous reward of the DQN model. Gamma is a discount factor, and a larger discount factor indicates more emphasis on long-term return.

The process of training the agent based on the DQN model specifically comprises the following steps:

2.1 ) the parameters θ of the current network in the DQN model are updated in real time. And copying the parameters of the current network into the target network after every N iterations. The network parameters are then updated by minimizing the mean square error between the current Q value and the target Q value. The mean square error function is:

L(θ _i )＝E _s,a,r,s' [(Y _i -Q(s,a|θ _i )) ² ]

wherein the content of the first and second substances,

r represents the instantaneous reward of the DQN model.

The calculation process of the instant reward r of the DQN model specifically comprises the following steps: the basic Scoring and penetration feasibility Scoring indexes of the Vulnerability corresponding to each network topology node of the attack tree are combined with CVSS (Common Vulnerability Scoring System) to award different vulnerabilities which successfully penetrate different nodes, and the instant award r of the DQN model is the award Score _vul The method comprises the following steps:

By the method, the influence of each change of the Q value on the network parameters is reduced, namely, the correlation between the target Q value and the current Q value is reduced, the stability of strategy training is also improved, the network gradient is calculated according to the following method, and the parameter theta of the current network is updated:

2.2 Vulnerability information such as vulnerability services, vulnerability attributes and vulnerability types corresponding to each network topology node of the attack tree is converted into a simplified matrix to be used as state input of the DQN model trained in the step (2). According to the Bellman optimal equation theory, as long as the step of 2.1) is continuously updated in an iterative manner, the target Q value is infinitely close to the current Q value, so that the training is finally completed, and the training target is obtained, namely the current optimal penetration attack path is generated:

3) According to the current optimal penetration attack path obtained in the step S2, all nodes experienced by the current optimal attack path are used as nodes which are likely to be attacked, and the nodes which are likely to be attacked are determined by monitoring the flow change condition of each node which is likely to be attacked; calculating the difficulty of attack vulnerability corresponding to each vulnerable node based on the access matrix, the access complexity and the authenticity index; obtaining a node with the maximum attack probability according to the flow change condition of the vulnerable node and the difficulty of attacking the vulnerability, and deploying a honeypot at the node; and the deployment of the honeypots is continuously updated, so that attackers are induced to fall into the deployed honeypots, and defense is finished.

3.1 According to the current optimal attack path obtained by training in the step 2), all nodes experienced by the current optimal attack path are all nodes which are likely to be attacked, and the nodes which are likely to be attacked are screened to obtain vulnerable nodes.

The process of screening the nodes which are possibly attacked comprises the following steps: combining the generated attack tree with a Software Defined Network (SDN), obtaining load flow information of each node which may be attacked, monitoring a flow change condition of each node which may be attacked, calculating a flow change matrix corresponding to each node which may be attacked at the current time and the next time, performing dimension comparison on the flow change matrices corresponding to the current time and the next time, and considering the node which may be attacked as a vulnerable node if all dimension data changes exceed a user-Defined threshold (in the embodiment of the present invention, the user-Defined threshold is selected to be 50%).

3.2 According to vulnerability information corresponding to each node in an attack tree generated by the MulVAL and combining three indexes of a CVSS vulnerability scoring system, scoring vulnerability of the vulnerability, wherein the vulnerability is respectively an access matrix AV (access vector), an access complexity AC (access complexity) and an authenticity AU (authenticity), the ranges of the three indexes are (0, 1), and the three indexes are respectively marked as V, C and U;

3.3 3.2) calculating the difficulty D of the attack vulnerability according to the indexes V, C and U obtained in the step 3.2), wherein the larger the value of D is, the larger the attack difficulty is, and the calculation formula is as follows:

3.4 Combining the traffic change corresponding to the vulnerable node obtained in the step (3.1) with vulnerability analysis of the vulnerable node, thereby obtaining a node with the maximum attack probability, and deploying honeypots in the node.

3.5 In the defense process of intelligently deploying honeypots), the initial state s at the current moment is the traffic load matrix of each node at the current moment; action a refers to the selection of a path of penetration attack, giving a positive reward r if the agent falls into a honeypot previously deployed in step 3.4), and giving a negative reward-r if it does not fall into a honeypot. The next state s' refers to the traffic load matrix at the next moment;

the positive award r is specifically set to be the 2D award set in the agent step 3.3) if the agent falls into a honeypot set in advance, and to be the negative award to the agent-2D if the agent does not fall into the honeypot but bypasses the honeypot to perform an operation of penetration attack on the host around the honeypot.

3.6 Store the state transition process (state s, action a, reward r, next state s') in the experience replay experience buffer pool as the training data set of the DQN model; sampling N training data sets from an experience buffer pool, and performing updating training on a current Q network and a target Q network in the DQN model;

3.7 Repeating the steps 3.1) -3.6) to the training process of the DQN model, continuously updating the deployment of honeypots, and inducing penetration attackers to trap into deployed honeypots, thereby achieving the purpose of defending penetration attacks.

Corresponding to the embodiment of the attack tree honey pot deployment defense method based on the deep reinforcement learning, the invention also provides an embodiment of an attack tree honey pot deployment defense device based on the deep reinforcement learning.

Referring to fig. 3, an attack nectar deployment defense device based on deep reinforcement learning according to an embodiment of the present invention includes one or more processors, and is configured to implement the attack nectar deployment defense method based on deep reinforcement learning in the foregoing embodiment.

The attack tree honey pot deployment defense device based on deep reinforcement learning of the embodiment of the invention can be applied to any equipment with data processing capability, such as computers and other equipment or devices. The apparatus embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a device in a logical sense, a processor of any device with data processing capability reads corresponding computer program instructions in the nonvolatile memory into the memory for operation. In terms of hardware, as shown in fig. 3, a hardware structure diagram of an arbitrary device with data processing capability where the attack tree honey pot deployment defense apparatus based on deep reinforcement learning according to the present invention is located is shown, in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 3, an arbitrary device with data processing capability where the apparatus is located in the embodiment may generally include other hardware according to an actual function of the arbitrary device with data processing capability, which is not described again.

The specific details of the implementation process of the functions and actions of each unit in the above device are the implementation processes of the corresponding steps in the above method, and are not described herein again.

For the device embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

An embodiment of the present invention further provides a computer-readable storage medium, on which a program is stored, where the program, when executed by a processor, implements the attack tree honey pot deployment defense method based on deep reinforcement learning in the foregoing embodiments.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium can be any data processing capable device, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.

The above embodiments are only used for illustrating the design idea and features of the present invention, and the purpose of the present invention is to enable those skilled in the art to understand the content of the present invention and implement the present invention accordingly, and the protection scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes and modifications made in accordance with the principles and concepts disclosed herein are intended to be included within the scope of the present invention.

Claims

1. A defense method for deploying attack tree honey pots based on deep reinforcement learning is characterized by comprising the following steps:

s1, acquiring network topology information and sequentially constructing attack trees;

s3, according to the current optimal penetration attack path obtained in the step S2, all nodes experienced by the current optimal penetration attack path are used as nodes which are likely to be attacked, and the nodes which are likely to be attacked are determined by monitoring the flow change condition of each node which is likely to be attacked; calculating the difficulty of attack vulnerability corresponding to each vulnerable node based on the access matrix, the access complexity and the authenticity index; obtaining a node with the maximum attack probability according to the flow change condition of the vulnerable node and the difficulty of attacking the vulnerability, and deploying a honeypot at the node; and the deployment of honeypots is continuously updated, so that attackers are induced to fall into the deployed honeypots, and defense is completed.

2. The attack tree honey pot deployment defense method based on deep reinforcement learning as claimed in claim 1, wherein the step S1 of obtaining network topology information comprises: and (3) carrying out port scanning on the target network system by using a vulnerability scanning tool Shodan, thereby obtaining the IP address of each device, the operating system type of each device, the port opened by each IP address, the service list in which each port is running, the communication protocol adopted by each port and the key information for constructing the real network topology, including the communication relationship among different subnets.

3. The attack tree honey pot deployment defense method based on deep reinforcement learning of claim 1, wherein the process of constructing the attack tree in the step S1 is specifically as follows: generating an attack tree through the MulVAL based on the acquired network topology information; wherein, the attack tree comprises all information of network topology nodes; all information of the network topology nodes comprises the vulnerabilities of the nodes and executable operations according to vulnerability utilization relations; the executable operation comprises penetration attack, transverse movement, user authority obtaining and Root authority promotion.

4. The attack tree honey pot deployment defense method based on deep reinforcement learning according to claim 1, wherein the process of training the agent based on the DQN model in the step S2 is specifically:

L(θ _i )＝E _s,a,r,s' [(Y _i -Q(s,a|θ _i )) ² ]

wherein the content of the first and second substances,

r represents the immediate reward of the DQN model;

5. the attack tree honeypot deployment defense method based on deep reinforcement learning of claim 4, wherein the calculation process of the instant reward r of the DQN model is specifically as follows: the basic Score and the penetration feasibility Score of the vulnerability corresponding to each network topology node of the attack tree are combined with a universal vulnerability scoring system to award different vulnerabilities which successfully penetrate different nodes, and the instant award r of the DQN model is the award Score _vul The setting is as follows:

6. The attack tree honey pot deployment defense method based on deep reinforcement learning as claimed in claim 1, wherein the step S3 specifically comprises the following steps:

s301, according to the current optimal attack path obtained by training in the step S2, all nodes experienced by the current optimal attack path are all nodes which are likely to be attacked, and the nodes which are likely to be attacked are screened to obtain nodes which are likely to be attacked;

s304, combining the flow change corresponding to the vulnerable node obtained in the step S301 with vulnerability analysis of the vulnerable node, so as to obtain a node with the maximum attack probability, and deploying honeypots in the node;

s305, in the defense process of intelligently deploying honeypots, if the agent falls into honeypots which are deployed in step S304 in advance, a positive reward r is given, and if the agent does not fall into honeypots, a negative reward-r is given;

7. The method for defending against deployment of attack tree honey pots based on deep reinforcement learning according to claim 6, wherein the step S301 comprises: the process of screening the nodes which are possibly attacked: combining the generated attack tree with a software defined network, monitoring the traffic change condition of each node which is likely to be attacked, calculating the traffic change matrix corresponding to the current time and the next time of each node which is likely to be attacked, carrying out dimension comparison on the traffic change matrix corresponding to the current time and the next time, and if the change of all dimension data exceeds a self-defined threshold value, considering the node which is likely to be attacked as a vulnerable node.

8. The attack tree honey pot deployment defense method based on deep reinforcement learning as claimed in claim 6, wherein the step S305 gives a positive reward r if the agent sinks into the honey pot previously deployed in the step S304, and gives a negative reward-r if the agent does not sink into the honey pot comprises:

if the agent falls into the honeypot which is set in advance, the agent is rewarded with the 2D set in the step S303, and if the agent does not fall into the honeypot but bypasses the honeypot to carry out penetration attack on the host around the honeypot, the agent is rewarded with the-2D.

9. An attack nectar deployment defense device based on deep reinforcement learning, which is characterized by comprising one or more processors and is used for realizing the attack nectar deployment defense method based on deep reinforcement learning of any one of claims 1 to 8.

10. A computer-readable storage medium, on which a program is stored, which, when being executed by a processor, is configured to implement the attack tree canister deployment defense method based on deep reinforcement learning according to any one of claims 1 to 8.