CN113810406B

CN113810406B - Network space security defense method based on dynamic defense graph and reinforcement learning

Info

Publication number: CN113810406B
Application number: CN202111078688.1A
Authority: CN
Inventors: 陈晋音; 李晓豪; 李玮峰; 贾澄钰
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-09-15
Filing date: 2021-09-15
Publication date: 2023-04-07
Anticipated expiration: 2041-09-15
Also published as: CN113810406A

Abstract

The invention discloses a network security defense method based on a dynamic defense graph and reinforcement learning, which comprises the steps of scanning target network information through network vulnerability scanners such as Nmap, taking a structure information and vulnerability information group of a network topology structure in the scanned information as input information of the dynamic defense graph to generate the dynamic defense graph, further training all penetration paths in the whole attack graph by utilizing the depth reinforcement learning to obtain an optimal defense path, arranging a corresponding honeypot or intrusion detection system, finally dynamically updating the defense graph again according to the deployment information of the intrusion detection system and the honeypot, and obtaining the optimal defense path again by utilizing the depth reinforcement learning. The method can improve the efficiency and the accuracy of the network security defense and can save the defense cost of the network security defense.

Description

Network space security defense method based on dynamic defense graph and reinforcement learning

Technical Field

The invention belongs to the field of network security protection facing dynamic defense graphs and reinforcement learning, and particularly relates to a network security protection method based on deep reinforcement learning and dynamic construction of a network model.

Background

With the rapid development of computer technology, network attack technology is also rapidly developing, and various network attack machine events emerge endlessly. There are many kinds of sensitive information that inevitably attract various human attacks such as information disclosure, information theft, data tampering, data addition and deletion, computer virus, etc. from all over the world. In order to ensure the security of the cyberspace, the key is the analysis of the cyberspace topology, the vulnerabilities, and the determination of the optimal defense strategy of the network to prevent the attackers from exploiting these vulnerabilities for illegal penetration. Different from the traditional artificial network defense, the deep reinforcement learning and dynamic defense graph technology can derive the optimal defense path in advance and carry out defense dynamically.

The defense graph is a network security assessment technology based on a model. From the perspective of defenders, on the basis of comprehensively analyzing various network configurations and vulnerability information, all possible defense paths are found out, and a visualization method for representing an attack process scene is provided, so that network security management personnel can be helped to intuitively understand the relationship among all vulnerabilities in a target network, the relationship between the vulnerabilities and the network security configuration and potential threats generated by the vulnerabilities. The network security assessment technology based on the defense graph model is used for carrying out deep security assessment modeling and analysis on the basis of the defense graph.

Reinforcement learning is generally a continuous decision-making process, the basic idea of which is to learn the optimal strategy for learning purposes by maximizing the cumulative rewards that an agent receives from the environment. The deep reinforcement learning fully utilizes the neural network as a parameter structure, and optimizes a deep reinforcement learning strategy by combining the perception capability of the deep learning and the decision capability of the reinforcement learning, so that an intelligent agent can continuously learn by self from the environment where the intelligent agent is located along with the passage of time.

However, when traditionally using a defensive graph for network evaluation, static network data is considered, and static analysis can only determine a priori risks of network components. However, dynamic defense graphs can update these risks based on evidence that any network component may be compromised, such as from security information and time management and Intrusion Detection Systems (IDS). Dynamic analysis also allows analysis of the path of the attacker to determine nodes that are more likely to be attacked in the next step. This enables an administrator to assess the security risk of valuable resources in the network.

However, in either static or dynamic defense graphs, most do not take into account the abilities of the attacker, and therefore the likelihood that a particular attack is performed. Without these considerations, threats and their effects are easily misjudged and cause significant cost loss.

At present, the defense graph technology still has some problems, such as the defects of static network data input, tedious defense paths, incapability of dynamically judging network environment changes and the like, and the dual defense effects are achieved in order to realize dynamic detection of network environment and data and optimal path judgment of the defense paths

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a network security defense method based on a dynamic defense graph and deep reinforcement learning.

In order to achieve the purpose, the technical scheme of the invention is as follows: a network security defense method based on dynamic defense graphs and reinforcement learning specifically comprises the following steps:

(1) Scanning and detecting a host, a port and a vulnerability of a target network, and storing and classifying the obtained scanning information; defining a scanning information data set, and analyzing the connectivity relation between hosts;

(2) Respectively generating nodes and edges of a defense graph by using the scanning information data set acquired in the step (1) to construct the defense graph;

(3) Building a deep reinforcement learning simulation attacker environment, taking the nodes and edges of the defense graph built in the step (2) as the input of the deep reinforcement learning, obtaining the most easily penetrated path of the defense graph through the deep reinforcement learning, arranging a corresponding honeypot or intrusion detection system on the most easily penetrated path, and recording real-time information;

(4) And (4) repeatedly scanning and detecting the host, the port and the vulnerability of the target network according to the intrusion information recorded in real time in the step (3), constructing a dynamic defense graph, and iteratively updating the most easily penetrated path.

Further, the step (1) specifically includes the following sub-steps:

(1.1) scanning and detecting a host, a port and a vulnerability of a target network, acquiring vulnerability information and host configuration information of the target network, and storing and classifying the acquired scanning information;

(1.2) defining the scan information dataset to contain N _host A set X of individual hosts,

each host represents x _i ∈R ^V×H (i＝1,2,....,N _host ) I.e. x _i Is a matrix containing V x H elements, wherein V represents the host vulnerability and H represents the connectivity relationship between hosts.

Further, the step (2) specifically includes the following sub-steps:

(2.1) nodes generating the defense graph: according to the scanning information data set acquired in the step (1), the loophole of the host is used as a CVE (composite video encryption) section in the defense graphPoint N _CVE Taking the precondition of the vulnerability as a preposed node N in the defense graph _pre Taking the post-condition of the vulnerability as a post-node N in the defense graph _post Taking nodes meeting a plurality of preconditions of the vulnerability as joint nodes N in the defense graph _f ；

(2.2) according to the network topological structure relationship, performing connectivity analysis on the nodes of the defense graph, performing directional connection to form the edges of the defense graph, wherein E _f Representing a connecting edge of the joint node and a plurality of front nodes around the joint node; e _n Denoted as CVE node N _CVE With its front node N _pre And a rear node N _post The connecting edge of (1);

(2.3) constructing a defense graph by using the nodes and edges of the defense graph obtained in the step (2.1) and the step (2.2), wherein the defense graph is represented by DeffendGraph = { E = _f ,E _n ,N _pre ,N _post ,N _CVE ,N _f }。

Further, the step (3) specifically includes the following sub-steps:

(3.1) according to the number N of all nodes of the defense graph, constructing an N multiplied by N model map, simultaneously writing the connectivity relation among the nodes into an action set of an attacker intelligent agent, and sequentially writing all the nodes into an attacker intelligent agent state set to construct a deep reinforcement learning simulation attacker environment;

(3.2) pre-training an attacker agent attacker on the basis of a deep Q network algorithm (DQN) in reinforcement learning, and pre-training to obtain a target strategy pi _t Building a deep reinforcement learning simulation attacker environment and constructing a reinforcement learning training model;

(3.3) inputting the NxN model map obtained in the step (3.1) into the reinforcement learning training model obtained in the step (3.2), and learning the strategy pi of the pre-training model according to the depth intensity _t Generating attack sequence state action pairs of the attackers at T moments, and connecting all the attack sequence state action pairs of the attackers to obtain a path which is most easily penetrated;

and (3.4) according to the most easily penetrated path obtained in the step (3.3), arranging the honeypot system and the intrusion detection system in the nodes and the host in the most easily penetrated path, and recording real-time information.

Further, the step (4) specifically includes the following sub-steps:

and (4.1) when the intrusion detection system or the honeypot system has attacker information, namely the attacker has attack activity on the node or the path, scanning and detecting the host, the port and the vulnerability of the target network again, and adding the newly added vulnerability of the honeypot system into the input information of the defense graph.

And (4.2) when a dynamic defense graph is constructed, incremental deletion is carried out on the basis of the primary defense graph, the information scanned in the step (4.2) is compared with the information scanned in the step (1.1) to obtain the difference on the topological structure and the difference between host vulnerabilities under the same topological structure, and nodes and edges are deleted on the basis of the initial defense graph aiming at the differences.

And (4.3) repeating the steps (3.2) to (3.4), and iterating the updated most easily penetrated path to construct a dynamic defense graph.

The technical conception of the invention is as follows: in the deep reinforcement learning training for simulating the attackers to attack the target network, the attackers can realize penetration to the target network according to the vulnerability information, the topological structure information and the like of the target network, and attack such as information extraction, virus planting and the like is carried out on the host of the target network, so that the target network loses the safety. Based on the situation, the optimal path judgment is carried out by utilizing the dynamic attack graph and the reinforcement learning so as to carry out network protection, and meanwhile, the honeypot system and the intrusion detection system are arranged on the basis of the training result, so that the aim of network security protection is achieved. Firstly, acquiring network configuration information and vulnerability information of a target network by using vulnerability scanning tools such as Nmap and the like, and classifying and sorting the information; secondly, the classified information is used as input of a defense graph, nodes and edges are respectively constructed by utilizing a defense graph algorithm, and a complete defense graph is generated; then, taking the node information and the side information in the defense graph as the state and action of deep reinforcement learning to be input, and acquiring the most easily penetrated path of the defense graph by using a Deep Q Network (DQN); then, arranging a honeypot system and an intrusion detection system on the path, and carrying out real-time interaction to obtain attacker information; and finally, constructing an initial signal of a dynamic defense graph by utilizing honeypot information and information of an intrusion detection system which are interacted in real time, scanning information of a target network by utilizing scanners such as Nmap and the like again, dynamically constructing the defense graph again, calculating a path which is most easily penetrated again by utilizing deep reinforcement learning, and arranging the honeypot and the intrusion detection system, so that the aim of protecting the target network is fulfilled, and the efficiency and the accuracy of network security defense are improved.

The invention has the following beneficial effects: 1) The method utilizes a defensive map technology to visually display the model structure of the target network; 2) The invention utilizes the dynamic defense graph updating technology to reduce the efficiency cost of generating the pure attack graph; 3) The method trains the most easily attacked path of the target network by deep reinforcement learning so as to save the defense cost of network security defense; 4) The invention utilizes the honeypot technology and the intrusion detection system as the defense method, and takes the signals thereof as the initial signals of the dynamic defense graph, thereby more automatically implementing the second training to achieve the purpose of network security defense.

Drawings

FIG. 1 is a schematic diagram of the process of the present invention;

FIG. 2 is a schematic diagram of a dynamic defense graph of the method of the present invention;

fig. 3 is a schematic diagram of an algorithm structure of DQN in reinforcement learning in the method of the present invention.

Detailed Description

The following detailed description of specific embodiments of the present invention is provided in conjunction with the accompanying drawings.

Referring to fig. 1 to 3, a network security protection defense method based on a dynamic defense graph and reinforcement learning includes the following steps:

(1) Network information data generation, comprising the following substeps:

(1.1) scanning target network information: the method comprises the steps of scanning and detecting a host, a port and a vulnerability of a target network through a current open-source network vulnerability scanner Nmap, and storing and classifying the scanning information. Since the Nmap has good open source, the vulnerability information and the configuration information of the target network can be easily obtained by using the integrating codes of the Nmap integration, such as vulnerability scanning, routing tracking and the like.

(1.2) defining the scan information dataset to contain N _host A set X of one sample is taken,

each sample represents x _i ∈R ^V×H (i＝1,2,....,N _host ) I.e. x _i Is a matrix containing V x H elements, wherein V represents host vulnerability and H represents connectivity relationship between hosts. Creating N vulnerability CVSS scoring sets S according to the serial number of the host vulnerability and the universal vulnerability scoring system (CVSS) scoring of the National Vulnerability Database (NVD), wherein S = { S = ₁ ,s ₂ ,...,s _N Each sample represents s _i ∈R ² (i =1, 2.... N), i.e. s _i Is a matrix containing 2 elements, which are the basic score of the vulnerability and the blasting score of the vulnerability respectively.

(2) Respectively generating nodes and edges of a defense graph by using the scanning information data set acquired in the step (1) to construct the defense graph; the method specifically comprises the following substeps:

(2.1) generation of defensive graph nodes: dividing different information into different nodes according to the scanning information data set acquired in the step (1), wherein a host is taken as a unit, and a vulnerability of the host is taken as a CVE node N in a defense graph _CVE The CVE nodes correspond to vulnerabilities of the hosts, each host may have multiple vulnerabilities, i.e., one host node may have multiple CVE nodes; establishing corresponding preposition nodes N at the same time _pre Rear node N _post Corresponding to the node; using the precondition of the vulnerability as a preposed node N in a defense graph _pre To represent the prerequisites required by an attacker to exploit the vulnerability; using the postcondition of the vulnerability as a postnode N in a defense graph _post (ii) a Taking nodes meeting a plurality of preconditions of the vulnerability as joint nodes N in the defense graph _f The said joint node N _f With a precondition ofCan be a postcondition for other vulnerabilities.

(2.2) generation of a defense graph edge: after creating the different nodes from step (2.1), the different nodes need to be connected by edges. The host is still used as a unit, and the vulnerability precondition node, the vulnerability node and the vulnerability post-node of each host are sequentially connected in a directional manner. However, different hosts have different topological relations, so that connectivity analysis and defensive graph rule derivation are required to be performed according to different topological relations, and the host post-node N is used for _post And a front node N _pre And performing directional connection. Thus, the edge is represented as: e _f Representing a connecting edge of the joint node and a plurality of front nodes around the joint node; e _n Denoted as CVE node N _CVE With its front node N _pre And a rear node N _post The connecting edge of (2);

(2.3) constructing a defense graph by using the nodes and edges of the defense graph obtained in the step (2.1) and the step (2.2), and visually displaying the model structure of the target network by using a Graphviz tool, wherein the defense graph is represented by DeffendGraph = { E = { (E) _f ,E _n ,N _pre ,N _post ,N _CVE ,N _f }; the deffendgraph is a general term of a defense graph formed by a target network.

(3) Building a deep reinforcement learning simulation attacker environment, taking the nodes and edges of the defense graph built in the step (2) as the input of the deep reinforcement learning, obtaining the most easily penetrated path of the defense graph through the deep reinforcement learning, and arranging a corresponding honeypot or intrusion detection system on the most easily penetrated path; the method specifically comprises the following substeps:

(3.1) building a deep reinforcement learning simulation attacker environment: and according to the number N of all nodes of the defense graph, constructing an N multiplied by N model map, simultaneously writing the connectivity relation among the nodes into an action set of an Attacker intelligent agent Attacker, and sequentially writing all the nodes into an Attacker intelligent agent Attacker state set to obtain the constructed deep reinforcement learning simulation Attacker environment.

(3.2) Attacker agent of Attacker in step(3.1) pre-training the deep reinforcement learning simulation attacker environment to obtain a target strategy pi _t : an Attacker agent Attacker is trained based on a deep Q network algorithm (DQN) in reinforcement learning, the Attacker aims to safely penetrate a target host as fast as possible, Q learning is combined with a convolutional neural network by the DQN to construct a reinforcement learning training model, and the algorithm steps are as follows, as shown in FIG. 3:

(3.2.1) the DQN not only solves the problem that the state space is too large and difficult to maintain by combining a deep neural network and a Q learning algorithm of reinforcement learning, but also has the potential far greater than artificial feature representation due to the strong feature extraction capability of the neural network. The Q learning in the reinforcement learning is performed by iteration updating a state-action value function Q through a Bellman equation in a time sequence difference mode:

Q _i+1 (s _t ,a _t )＝Q _i (s _t ,a _t )+α(y _i -Q _i (s _t ,a _t ))

wherein the content of the first and second substances,

is a target Q value, s _t+1 As an action a _t The next state to occur, a _t+1 Is s _t+1 Possible actions in the state. α is the learning rate and γ is the discount factor. According to the Bellman optimal equation theory, the Q function can be approximated to a real value Q by continuously iteratively updating the above formula ^* So as to finally obtain the optimal strategy:

(3.2.2) DQN also uses the target network mechanism, i.e. at the current Q _θ On the basis of a network structure, a target with the same structure is set up

The network forms the whole model framework of DQN, and during the training process, the current Q is _θ The predicted Q value output by the network is used to select action a, another target->

The network is used to calculate a target Q value. The loss function is defined by calculating the mean square error of the predicted Q value and the target Q value:

wherein the content of the first and second substances,

updating the current Q for a target Q value by back-gradient propagation through a neural network _θ The parameter θ of the network.

(3.2.3) in the training process, the DQN adopts an experience playback mechanism to convert the state into the process (state s) _i And action a _i Prize r _i Next state s _i ') stored in the empirical replay buffer Buff as a training data set for the network model and subject to batch learning in the form of random sampling.

(3.2.4) sampling N training data sets from the empirical replay buffer Buff, updating the current Q by minimizing the loss function _θ Network parameters of the network, for the target

Networks whose network parameters need not be updated iteratively, but rather at intervals from the current Q _θ And copying the network parameters in the network, and then carrying out the next round of learning.

(3.3) inputting the NxN model map obtained in the step (3.1) into the reinforcement learning training model obtained in the step (3.2), and learning the strategy pi of the pre-training model according to the depth intensity _t Generating attack sequence state action pairs of the attackers at T moments, and connecting all the attack sequence state action pairs of the attackers to obtain (state, action) {(s) ₁ ,a ₁ ),...,(s _T ,a _T ) As the most easily permeated path.

And (3.4) according to the most easily penetrated path obtained in the step (3.3), arranging a honeypot system and an intrusion detection system in the nodes and the host in the most easily penetrated path so as to be used as protection for network security defense and perform real-time information recording.

(4.1) because frequent network scanning consumes a large amount of time cost, the recorded information of an Intrusion Detection System (IDS) and a honeypot system is selected and used as a starting signal of dynamic defense graph scanning, when attacker information appears in the Intrusion Detection System (IDS) or the honeypot, namely the attacker has attack activity on the node or the path, nmap scanning is carried out again at the moment, and the vulnerability of the newly-added honeypot system is added into the defense graph input information.

(4.2) time cost is very well considered in the construction process of the defense graph. Therefore, when the dynamic defense graph is constructed, incremental deletion is carried out on the basis of the primary defense graph, and the cost of constructing the original node again is avoided. And (3) comparing the information scanned in the step (4.2) with the information scanned in the step (1.1) to obtain the difference on the topological structure and the difference between the host vulnerabilities under the same topological structure, and aiming at the differences, performing node deletion on the basis of the initial defense graph. Thereby achieving the effect of dynamic update.

And (4.3) repeating the steps (3.2) to (3.4), iterating the updated most easily penetrated path, and constructing a dynamic defense graph, thereby achieving the effect of target network defense.

In conclusion, the invention utilizes the defensive map technology to visually display the model structure of the target network; the efficiency cost of generating a pure attack graph is reduced, and the defense cost of network security defense is saved. The invention utilizes the honeypot technology and the intrusion detection system to record information in real time, thereby realizing the purpose of network security defense more automatically.

The embodiments described in this specification are merely illustrative of the implementation forms of the inventive concept, and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments, but also equivalent technical means that can be conceived by one skilled in the art based on the inventive concept.

Claims

1. A network security defense method based on dynamic defense graphs and reinforcement learning is characterized by comprising the following steps:

the step (3) specifically comprises the following substeps:

(3.2) pre-training an attacker agent attacker on the basis of a deep Q network algorithm in reinforcement learning, and pre-training to obtain a target strategy pi _t Building a deep reinforcement learning simulation attacker environment and constructing a reinforcement learning training model;

(3.3) inputting the NxN model map obtained in the step (3.1) into the reinforcement learning training model obtained in the step (3.2), and learning the strategy pi of the pre-training model according to the depth intensity _t Generating attack sequence state action pairs of the attackers at T moments, and connecting all the attack sequence state action pairs of the attackers to obtain a path which is most easily penetrated；

(3.4) according to the most easily penetrated path obtained in the step (3.3), arranging a honeypot system and an intrusion detection system in the nodes and the host in the most easily penetrated path, and recording real-time information;

2. The dynamic defense graph and reinforcement learning-based network security defense method according to claim 1, wherein the step (1) specifically comprises the following sub-steps:

each host represents x _i ∈R ^V×H (i＝1,2,....,N _host ) I.e. x _i Is a matrix containing V x H elements, wherein V represents host vulnerability and H represents connectivity relationship between hosts.

3. The dynamic defense graph and reinforcement learning-based network security defense method according to claim 1, wherein the step (2) specifically comprises the following sub-steps:

(2.1) nodes generating the defense graph: according to the scanning information data set acquired in the step (1), the vulnerability of the host is used as a CVE node N in the defense graph _CVE Taking the precondition of the vulnerability as a preposed node N in the defense graph _pre Taking the post-condition of the vulnerability as a post-node N in the defense graph _post Taking the nodes meeting a plurality of preconditions of the vulnerability as joint nodes N in the defense graph _f ；

(2.2) according to the network topological structure relationship, performing connectivity analysis on the nodes of the defense graph, performing directional connection to form the edges of the defense graph, wherein E _f Representing a connecting edge of the joint node and a plurality of front nodes around the joint node; e _n Denoted as CVE node N _CVE With its front node N _pre And a rear node N _post The connecting edge of (2);

4. The dynamic defense graph and reinforcement learning-based network security defense method according to claim 1, wherein the step (4) specifically comprises the following sub-steps:

(4.1) when the intrusion detection system or the honeypot system has attacker information, namely the attacker has attack activity on the node or the path, scanning and detecting the host, the port and the vulnerability of the target network again, and adding the newly added vulnerability of the honeypot system into the input information of the defense graph;

(4.2) when a dynamic defense graph is constructed, incremental deletion is carried out on the basis of the primary defense graph, the information scanned in the step (4.2) is compared with the information scanned in the step (1.1), the difference on the topological structure and the difference between host vulnerabilities under the same topological structure are obtained, and for the differences, node and edge deletion is carried out on the basis of the primary defense graph;