CN113783881B

CN113783881B - Network honeypot deployment method facing penetration attack

Info

Publication number: CN113783881B
Application number: CN202111078546.5A
Authority: CN
Inventors: 陈晋音; 李玮峰; 李晓豪; 贾澄钰
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-09-15
Filing date: 2021-09-15
Publication date: 2023-04-07
Anticipated expiration: 2041-09-15
Also published as: CN113783881A

Abstract

The invention discloses a network honeypot deployment method facing penetration attack, which is characterized in that the network structure of the network honeypot deployment method is scanned, an attribute attack graph is generated and converted into a Bayesian attack graph containing penetration success rate information, and the penetration success rate of each network node is recorded and stored. And then establishing a reward function based on the node penetration success probability, wherein the attacker captured by the honeypot can obtain rewards related to the penetration success probability, and the reward value is higher when the success rate is higher. Honeypot deployments have a basic negative reward value, meaning that unlimited deployment of honeypots does not yield the maximum benefit. Therefore, the problem that the maintenance cost is high due to the fact that honeypots are deployed on each path is effectively avoided. And then taking the node and path information in the attack graph as the input of reinforcement learning, then carrying out reinforcement learning by using an SARSA learning mechanism, and providing a scheme for deploying honeypots with the maximum profit according to the environment.

Description

Network honeypot deployment method facing penetration attack

The technical field is as follows:

the invention relates to the field of network security protection facing attack graph technology and reinforcement learning, in particular to a network honeypot deployment method facing penetration attack.

Technical background:

as more and more people participate in internet life, everyone enjoys the convenience brought by the internet, and the network security problem is gradually rising and getting more and more. The problem of information security threatened by hackers is repeatedly evolving in today's cyberspace. As a result, the network is susceptible to higher levels of interference, which makes the network security more vulnerable. The diversity of devices also makes maintaining them (e.g., patching bugs) a more challenging management problem.

A deceptive defense technique called honeypot was then produced. After 20 years of development, honeypots are continuously updated and iterated, and the evolution of honeypots aims to cope with emerging threats. From "data packets found on the internet" in 1993 to locking and capturing internet of things attacks, the development of honeypots has become a circular process. A retrospective analysis was performed on malware captured on honeypots. The analysis lays a new direction for the subsequent national defense network security and the honeypot development.

Modern computer networks are highly connected and heterogeneous to provide more sophisticated services and to accommodate ever-increasing and rapidly changing demands. For example, these networks connect computers of different operating systems and protocols. Furthermore, more and more devices are added to the network each day. For example, deployment of wireless devices, as well as internet of things, robotics, sensors, makes networks larger and denser.

The contribution of honeypots to security is considered a reactive process. The value of honeypot deployment comes from the captured dataset. The longer the attack interaction can be maintained, the larger the data set and subsequent analysis. Global honeypot projects track emerging threats. Virtual technologies provide honeypot operators with a means to abstract deployments from the production network and bare metal infrastructure. In response to the prevalence of honeypots, honeypot detection tools have been developed and incorporate detection techniques into malware deployments.

The attack graph is a network vulnerability assessment method based on a model. The attack graph technology can correlate the vulnerabilities of all hosts in the network to carry out deep analysis, discover attack paths threatening the network security and display the attack paths in a graph mode. The security manager can visually observe the relationship between each vulnerability in the network by using the attack graph, and the minimum cost is selected to make up for the network vulnerability.

Reinforcement learning is a type of algorithm, which allows a computer to learn from errors by continuously trying and finally finding out a rule, thereby learning a method for achieving the purpose. Reinforcement learning is currently applied in a variety of scenarios where actions or decisions need to be performed.

An internet of things network architecture is also deployed in a battlefield environment in which it is referred to as a battlefield internet event. In a broader sense, the internet of things network also refers to devices used in military combat that may communicate over tactical networks other than the internet. Therefore, it is crucial to protect the resiliency and robustness of the network's critical nodes on the battlefield. In the information collection phase (also called reconnaissance phase), the attacker collects internal information of the target network using a series of tools and scanning techniques. Attackers typically map the target network using a software scanning tool (e.g., nmap, etc.) or infer the network through traffic analysis. On the other hand, a network administrator (defender) can effectively protect its own network at an early stage of reconnaissance by deceiving an attacker and manipulating a network interface to mask the true state of the network. The problems of high technical requirements, cost increase possibly caused by too many devices and the like are not mentioned although the automatic honeypot deployment is realized in the existing honeypot deployment scheme, and the deployment scheme is not flexible enough and is difficult to cope with complex network attacks due to the method of automatically deploying the use script. The invention obtains the network information by generating the Bayesian attack graph and deploys the honeypot system by the reinforcement learning through the penetration success probability, thereby playing the role of protecting the system and overcoming the problem of network redundancy.

At present, the existing honeypot system has the situation of redundant deployment, honeypots are mechanically deployed at certain nodes, although the cost of an attacker can be increased, the basic problem is that the pertinence is not strong, the server scale is enlarged, and the maintenance cost of the server system is increased. And once bypassed, does not serve any protective function.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a network honeypot deployment method facing penetration attack.

In order to achieve the purpose, the technical scheme of the invention is as follows: a network honeypot deployment method facing penetration attack comprises the following steps:

(1) Scanning and detecting a target network to obtain scanning information of the target network, and storing and classifying the scanning information; performing connectivity analysis according to the network topology structure relationship and the host vulnerability relationship to generate an attribute attack graph; calculating the availability E of the vulnerability by using a general vulnerability scoring system, calculating the conditional probability P between the attribute nodes and the probability P (obj) of the target node being attacked, and generating a Bayesian attack graph from the attribute attack graph;

(2) Enumerating each possible attack path, distributing a group of k honeypots for P (obj) according to the attack probability along the possible paths of the attacker, defining the cost for deploying the honeypots, setting a reward value Cap, calculating a reward function, and optimizing the honeypot deployment;

(3) And (3) optimizing the honeypots deployed in the step (2) by using an SARSA reinforcement learning algorithm in combination with the Bayesian attack graph generated in the step (1) to obtain an optimal honeypot deployment path.

Further, the step (1) includes the sub-steps of:

(1.1) scanning and detecting a host, a port and a vulnerability of a target network to obtain scanning information of the target network, and classifying the scanning information, the storage information, the port information and the like;

(1.2) defining the data set to contain N _host A set X of individual hosts,

each host represents x _i ∈R ^Q×H (i＝1，2，....，N _host ) I.e. x _i The method comprises the following steps of (1) forming a matrix containing Q multiplied by H elements, wherein Q represents host vulnerability, and H represents connectivity relation between hosts;

(1.3) generating an attribute attack graph: performing connectivity analysis according to the network topology relationship and the host vulnerability relationship by using the scanning information of the target network acquired in the step (1.1), and performing directional connection to form an edge of an attribute attack graph; dividing the scanning information of different target networks into different nodes according to the data set of the target scanning information in the step (1.1); connecting edges and nodes to generate an attribute attack graph;

(1.4) calculating the usability E of the vulnerability based on a general vulnerability scoring system, wherein the formula is as follows;

E＝20VCU(0≤E≤10)

where V is the access vector, C is the access complexity, and U is the access validation;

(1.5) according to the usability E of the vulnerability obtained in the step (1.4), the greater the difficulty of calculating the atomic attack, the greater the attack difficulty, and the calculation formula is as follows:

wherein D represents the difficulty of the corresponding atomic attack;

the difficulty of atomic attack represents the conditional probability P between attribute nodes, and the formula is as follows:

if the target attribute to be obtained by the attacker is obj, all direct and indirect father nodes of obj on the attack path are Pre (obj), and the direct father node is DPre (obj), then the probability of the target node being attacked is P (obj):

P(obj)＝P(obj|Pre(obj)P(Pre(obj))；

(1.6) obtaining a possible path of an attacker according to the probability P (obj) that the target node is attacked in the step, and converting the attribute attack graph generated in the step (1.3) to generate a Bayesian attack graph; and completing the elimination of the loop and analyzing the success probability of the permeation path.

Further, the step (2) includes the sub-steps of:

(2.1) the defender enumerating each possible attack path, distributing a set of k honeypots along the attacker's possible paths, which can deviate the attacker from the true target node;

(2.2) weighting the defender's reward by the probability P (obj) that the protected or attacked node is breached; the defender placing a new honeypot at the edge of the network incurs a fixed cost, assuming that the cost of placing this honeypot is P;

(2.3) for simple target networks specifically: if the honeypot does not capture the attacker, setting the reward value Cap to be 0; if the honeypots capture attackers, custom setting a reward value Cap, and defining a reward function R formula for deploying a single honeypot as follows:

R＝-(P-Cap*(1-E _p ))

wherein E is _p = P (obj) is the probability of the protected or attacked node being attacked;

for complex target networks, in particular: if the honeypot does not capture the attacker, setting the reward value Cap to be 0; if the honeypot captures the attacker, setting a reward value Cap in a self-defined mode, utilizing the probability of being attacked calculated in the step (1.5) to be P (obj), and deploying the honeypot according to the probability of being attacked to be P (obj); reward function

The formula of (1) is as follows:

R＝-(P-Cap(1-max(P(obj))))；

the simple target network is within 2 layers of network layers, and the number of the hosts is less than 4; and if so, the network is regarded as a complex network.

Further, the step (3) includes the sub-steps of:

(3.1) establishing a Q-table storage state S and all actions a and Q (S, A) to be taken, and taking all the actions a and Q (S, A) as a training data set of the network model;

(3.2) creating an Agent, wherein the Agent comprises a learning algorithm and an action space; the Agent intelligent Agent utilizes SARSA reinforcement learning algorithm to train; in the training, a first network path state is initialized randomly, then an action a is selected from a Q-table by using an element-greed based on a network current state S for each step in a training round, the action a is executed to obtain a new network path state S 'and a current reward r, an action a' in the S 'is obtained by using the element-greed, and the value of Q (S, A) in a table is updated by using the action a' to continue the training;

(3.3) continuously repeating the method in the step (3.2) until the Q-table is not updated any more, and generating an optimal strategy pi, wherein the formula is as follows:

Q(S，A)＝Q(S，A)+α(R+γQ(S`，A`)-Q(S，A))

wherein, α represents the learning rate and γ is the reward discount;

(3.4) according to the optimal strategy pi obtained in the step (3.3), obtaining the action a to be executed in the state s; the action a comprises deploying honeypots or not executing two actions, and the state s is only related to the attack success rate of the Bayesian attack graph of the attack path; and finally deploying the honeypots according to the optimal strategy pi.

The beneficial effects of the invention are as follows: 1) The Bayesian attack graph is used for basically knowing the security of the own network structure and deploying the honeypot system on the basis of the basic knowledge. 2) The honeypot system is deployed by using a reward value mechanism, the deployment can increase the deployment cost, but successfully captures that hackers can obtain rewards according to the infiltration success rate of the nodes. 3) An importance mechanism is introduced to a network structure with a large network space, and a honeypot system is selectively deployed according to the importance of nodes, so that honeypot deployment cost and maintenance cost are reduced. 4) And the honey pot system is deployed by combining reinforcement learning with a Bayesian attack graph, so that the network security is improved.

Drawings

FIG. 1 is a diagram of the network environment architecture of the experiment of the present invention;

FIG. 2 is a flow chart of a method of the present invention;

FIG. 3 is a schematic diagram of a honeypot application;

FIG. 4 is a schematic diagram illustrating the generation principle of an attack graph;

FIG. 5 is a Bayesian attack graph of the present experimental environment;

fig. 6 is a schematic diagram of reinforcement learning.

The specific implementation scheme is as follows:

the following detailed description of embodiments of the invention is provided in connection with the accompanying drawings. Referring to fig. 1 to 6, a reinforcement learning honeypot deployment method based on an attack graph is described.

The technical conception of the invention is as follows: firstly, an attacker attacks our network, and the attacker can use certain tools to understand our network structure and generate an attack graph to attack the network which we defend. Then, the network defender can also use the Nmap to scan the network structure of the defender, and an attribute attack graph is generated and converted into a Bayesian attack graph containing penetration success rate information. And recording and storing the permeation success probability of each network node after the Bayesian attack graph exists. And then establishing a reward function, wherein the reward function is based on the probability of successful infiltration of the nodes, and an attacker captured by the honeypot can obtain rewards related to the probability of successful infiltration, and the reward value is higher as the success rate is higher. Honeypot deployment has a basic negative reward value, which means that unlimited deployment of honeypots does not yield the maximum benefit. Therefore, the problem that the maintenance cost is high due to the fact that honeypots are deployed on each path is effectively avoided. And then, taking the node and path information in the attack graph as the state of reinforcement learning to input, and taking whether to deploy honeypots as action input. Then learning is performed using the SARSA learning mechanism. Finally, reinforcement learning can provide a scheme for deploying honeypots with the largest income according to the environment, attack graphs of attackers can also fail due to changes of network spaces after honeypot deployment, workload of the attackers is increased, even the attackers can not attack successfully and leave more traceable information, and the purpose of protecting the network spaces is achieved.

The invention discloses a reinforcement learning honeypot deployment method based on an attack graph, and FIG. 2 is a flow chart of the method, which specifically comprises the following substeps:

(1) Scanning and detecting the target network to obtain scanning information of the target network, and storing and classifying the scanning information; performing connectivity analysis according to the network topology structure relationship and the host vulnerability relationship to generate an attribute attack graph; calculating the availability E of the vulnerability by using a general vulnerability scoring system, calculating the conditional probability P between the attribute nodes and the probability P (obj) of the target node under attack, and generating a Bayesian attack graph from the attribute attack graph:

(1.1) experimental network structure as shown in fig. 1, a device a is a host providing web services, a node (B) is a firewall, the node (B) is connected to a device C providing FTP, SSH, and RSH services and a device B providing FTP and RSH services, and a node (a) is an access entry, and first scans experimental network information and detects a host, a port, and a vulnerability of a target network by using an open source scanning tool such as NMAP. And stores and categorizes the scan information.

(1.2) defining the data set to contain N _host A set X of one sample is taken,

each sample representing x _i ∈R ^V×H (i＝1，2，....，N _host ) I.e. x _i Is a matrix containing V x H elements, wherein V represents the host vulnerability and H represents the connectivity relationship between hosts.

And (1.3) generating an attribute attack graph, dividing different information into different nodes by using the scanning information of the target network acquired in the step (1.1), wherein the host is taken as a unit, and the vulnerability, the precondition, the postcondition and the joint node owned by each host are taken as host vulnerability relations. After creating the different nodes, the different nodes need to be connected by edges. And determining edges of the defense graph, and performing connectivity analysis according to information such as network topology structure relationship, host vulnerability relationship and the like, wherein the host is still taken as a unit, and vulnerability precondition nodes, vulnerability nodes and vulnerability postcondition nodes of each host are sequentially connected. However, different hosts have different topological relations, so that the post-condition node and the pre-condition node of the host need to be connected according to different topological relations, edges of the attribute attack graph are generated after connection, and the node and the edge are obtained to obtain the attribute defense graph.

(1.4) in the Common Vulnerability Scoring System (CVSS), the availability E index of vulnerabilities is defined as:

E＝20VCU(0≤E≤10)

where V is an Access Vector (AV), C is an Access Complexity (AC), and U is an access Authentication (AU); the 3 parameters described above collectively depict the availability of vulnerabilities. The smaller the value of the availability E of a vulnerability, the more difficult it is to represent an atomic attack. After the vulnerability data is queried, the prior probabilities of 5 nodes, namely ftp (0,1), user (0), ftp (0,2), ftp (1,2) and sshd (0,1), are 0.6, 0.3, 0.4, 0.7 and 0.5.

(1.5) the usability E of the vulnerability is in inverse proportion to the attack difficulty, so that the difficulty of the atomic attack is calculated according to the usability E of the vulnerability obtained in the step (1.4), and the larger the value of the difficulty of the atomic attack is, the calculation formula is as follows:

wherein D represents the difficulty of the corresponding atomic attack;

edges among attribute nodes in the Bayesian network represent the process of utilizing the attack, the vulnerability is utilized with lower probability when the attack difficulty is higher, and the vulnerability and the attack difficulty are in inverse proportion. Thus, the difficulty of an atomic attack represents the conditional probability P between attribute nodes, as follows:

P(obj)＝P(obj|Pre(obj)P(Pre(obj))；

(1.6) according to the probability that the target node is attacked in the step P (obj), converting the attribute attack graph generated in the step (1.3), generating a Bayesian attack graph by using Mulval, generating a schematic diagram of the Bayesian attack graph as shown in the step 3, when a child node exists in a stack, storing a loop in the stack, calculating the difficulty of atomic attack in the loop, finding out the atomic attack node with the maximum difficulty, deleting the edge of the atomic attack node, finishing the elimination of the loop, and analyzing the probability of success of the penetration path. As can be seen from fig. 5, the attackers generate three attack paths:

①Path_1：[ftp(0，1)and user(0)]→[trust(1，0)]→[user(1)and ftp(1，2)]→[trust(2，1)]→[user(2)]→[root(2)]；

②Path_2：[user(0)and sshd(0，1)]→[temp]→[user(1)and ftp(1，2)]→[trust(2，1)]→[user(2)]→[root(2)]；

③Path_3：[user(0)，ftp(0，2)]→[trust(2，0)]→[user(2)]→[root(2)].

according to fig. 5, the target network has three attack paths. The penetration success probability of the three attack paths can be obtained by calculating the probability that the target node is attacked to be P (obj). Comparing penetration success probabilities, it can be found that, given evidence of P (ftp _ rhost (0,1)) =1, P (sshd (0,1)) =1, P (ftp _ rhost (0,2)) =1, the probabilities of success of attack paths Path _1, path _2, path _3 corresponding to the evidence respectively increase. Since user (0) occurs in 3 attack paths, the probability of success of Path _1, path _2, and Path _3 increases given evidence of P (user (0)) = 1.

The success probabilities of penetration of the three paths are respectively 0.0215,0.0328,0.0258, which is obtained through calculation, in an actual situation, the more attributes an attacker obtains, that is, the more evidences, the higher the probability of success of the attack is. From the above analysis, it can be seen that, when evidence is added, the probability of attack success is increased, and the experimental data is consistent with the actual situation.

(2) Distributing a group of k honeypots for P (obj) according to the probability of being attacked along the possible path of the attacker, defining the cost for deploying the honeypots, setting a reward value Cap, calculating a reward function, and optimizing the honeypot deployment:

the defender does not know what the specific intention of the attacker is and is not much aware of what the attack graph of the attacker is, but can deduce what the node to be attacked next by the attacker (the next target host is which the current node is connected) according to the Bayesian attack graph, so that the defender can place honeypots on the path to increase the attack cost of the attacker.

In a simple target network, an defender attempts to defend against the next potentially intruded node from the entry point node. In a complex target network, a defender will defend against a node located "jumping off the entry node".

Since an attacker uses a path inside the network, the allocated honeypots also need to cover a path in the network. Otherwise, randomly allocating honeypots will not guarantee the security of the set of nodes we consider in this honeypot deployment model.

(2.1) the defender enumerates each possible attack path, assigning a set of k honeypots along the attacker's possible paths to fool the attacker into reaching his target and misleading his actions. Such honeypots can cause attackers to deviate from the true target node. For a complex target network environment with N paths, the honeypots can be set according to the attack success probability P (obj) provided by the attack graph;

(2.2) weighting the defender's reward by the probability P (obj) that the protected or attacked node is breached; defending people inPlacing a new honeypot at the edge of the network incurs a fixed cost, assuming that the cost of placing this honeypot is P; total deployment cost R of honeypots _t The calculation formula is as follows:

wherein, l is the total node size of the network system, a is the action matrix, and h is the number of deployed honeypots.

(2.3) for simple target networks specifically: if the honeypot does not capture the attacker, setting the reward value Cap to be 0; if the honeypots capture attackers, the reward value Cap is set in a self-defined mode, and then a reward function formula for deploying a single honeypot is defined as follows:

R＝-(P-Cap*(1-E _p ))

wherein E is _p P (obj) is the probability of a protected or attacked node being attacked, the reward function R accounts for the cost of the original deployment of honeypots as P, the reward value of-P will be obtained as long as the defender deploys honeypots at a certain node, but if this action is reported back (honeypots successfully capture attackers), cap (1-E) will be obtained _p ) The weighted prize value of. The action cost consumption of the cost P of honeypot deployment is reduced to a certain extent.

Three possible attack paths are obtained in the experimental process, so honeypots are deployed in the three possible attack paths;

eventually the attacker falls into the honeypot on path 2. The total deployment cost obtained by the calculation method is as follows:

R _t ＝-3P+0.9672Cap

For complex target networks, in particular: and (4) deploying the honeypots according to the probability of being attacked P (obj) by utilizing the probability of being attacked P (obj) calculated in the step (1.5). Then the reward function

The formula (c) is as follows:

then, through the new calculation method, the total deployment cost becomes:

R _t ＝-P+0.9672Cap

therefore, the two honeypot deployment costs P are directly reduced, the complexity of the protected network space is reduced, and honeypot deployment is optimized.

(3) Optimizing the honeypots deployed in the step (2) by using an SARSA reinforcement learning algorithm to obtain an optimal honeypot deployment path, and comprising the following substeps:

(3.1) establish a Q-table to save the state s and all actions a, Q (s, a) that will be taken. The action is stored in an action space Q-table and used as a training data set of the network model;

and (3.2) creating an Agent, wherein the Agent comprises a learning algorithm and an action space, and the learning algorithm is an algorithm for how the Agent selects a strategy. In this experiment, the Agent's learning algorithm uses the SARSA reinforcement learning algorithm, and the schematic diagram of reinforcement learning is shown in FIG. 6. Randomly initializing a first network path state, firstly using an element-greed from a Q-table for each step in a round based on a current state s of a network (when the Q table does not have the state during the first operation, an action space of s-a can be created by the Q table, and the action space is initially all 0, selecting an action a, executing the action a, then obtaining a new network path state s 'and a current reward r, and simultaneously using the element-greed to obtain a' when the next state s 'is obtained, directly using the value of Q (s, a) in the updated table by the aid of the a',

and (3.3) continuously repeating the step (3.2) until the Q-table is not updated any more until the optimal strategy pi is generated. The formula is as follows

Q(S，A)＝Q(S，A)+α(R+γQ(S`，A`)-Q(S，A))

Where α represents the learning rate, which is a discount accumulation reward mechanism, and γ is a reward discount. The Agent records the reward values of all strategies, and after the Agent traverses all possible attack conditions in the experimental network, a strategy which can enable the reward value r to be maximum, namely an optimal strategy, is generated.

(3.4) what the Agent needs to do is to learn a "policy" pi by trying it out in the network environment, according to which the action to be performed is known in the state x

a＝π(x)

The goodness of a policy depends on the cumulative rewards accrued after long-term execution of the policy. In the invention, the action a comprises the step of deploying honeypots or the step of not executing two actions, and the state s only has a relation with the attack success rate of the Bayes attack graph of the attack path. It is known in the previous section that, since path 2 has the highest penetration success rate, agent finally chooses to arrange honeypot system on path 2 to increase the workload of the attacker. The structure of the network space is changed after the honeypots are arranged, the original attack graph of an attacker can also be invalid, if the attacker still uses the old attack path, the attack of the attacker can be invalid due to the addition of the honeypots, and if the attacker successfully falls into the honeypots, personal information can be left, so that a defender can conveniently take evidence to carry out subsequent sanctions on the attacker.

The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by the equivalents thereof as may occur to those skilled in the art upon consideration of the present inventive concept.

Claims

1. A network honeypot deployment method facing penetration attack is characterized by comprising the following steps:

(1) Scanning and detecting the target network to obtain scanning information of the target network, and storing and classifying the scanning information; performing connectivity analysis according to the network topology structure relationship and the host vulnerability relationship to generate an attribute attack graph; calculating the availability E of the vulnerability by using a general vulnerability scoring system, and calculating the conditional probability P between the attribute nodes _s And the probability of the target node being attacked is P: (obj) generating a Bayesian attack graph from the attribute attack graph;

(2) Enumerating each possible attack path, distributing a group of k honeypots for P (obj) according to the possible paths of the attacker, defining the cost for deploying the honeypots, setting a reward value Cap, calculating a reward function, and optimizing honeypot deployment;

(3) Optimizing the honeypots deployed in the step (2) by using an SARSA reinforcement learning algorithm in combination with the Bayesian attack graph generated in the step (1) to obtain an optimal honeypot deployment path;

the step (3) includes the substeps of:

(3.1) establishing a Q-table storage state S and all actions a to be taken, which are marked as Q (S, A); taking all the actions a and Q (S, A) as a training data set of a network model;

(3.2) creating an Agent, wherein the Agent comprises a learning algorithm and an action space; the Agent utilizes SARSA reinforcement learning algorithm to train; in the training, a first network path state is initialized randomly, then an action a is selected from a Q-table by using an element-greed based on a network current state S for each step in a training round, the action a is executed to obtain a new network path state S 'and a current reward r, an action a' in the S 'is obtained by using the element-greed, and the value of Q (S, A) in a table is updated by using the action a' to continue the training;

Q(S，A)＝Q(S，A)+α(R+γQ(S`，A`)-Q(S，A))

where α represents the learning rate and γ is the reward discount;

(3.4) according to the optimal strategy pi obtained in the step (3.3), obtaining the action a to be executed in the state s; the action a comprises the step of deploying honeypots or the step of not executing two actions, and the state s is only related to the attack success rate of the Bayesian attack graph of the attack path; and finally deploying the honeypots according to the optimal strategy pi.

2. The cyber-honeypot deployment method facing infiltration attacks according to claim 1, wherein the step (1) comprises the following sub-steps:

(1.1) scanning and detecting a host, a port and a vulnerability of a target network to obtain scanning information of the target network, and storing and classifying the port information;

(1.2) defining the data set to contain N _host A set X of individual hosts,

each host represents x _i ∈R ^Q×H ，i＝1,2，....，N _host I.e. x _i The method comprises the following steps of (1) forming a matrix containing Q multiplied by H elements, wherein Q represents host vulnerability, and H represents connectivity relation between hosts;

E＝20VCU，0≤E≤10

wherein D represents the difficulty of the corresponding atomic attack;

the difficulty of an atomic attack represents the conditional probability P between attribute nodes _s The formula is as follows:

P(obj)＝P(obj|Pre(obj)P(Pre(obj))；

(1.6) obtaining a possible path of an attacker according to the probability P (obj) that the target node is attacked in the step, and converting the attribute attack graph generated in the step (1.3) to generate a Bayesian attack graph; and completing loop elimination and analyzing the success probability of the permeation path.

3. The cyber honeypot deployment method facing a penetration attack according to claim 1, wherein the step (2) comprises the substeps of:

(2.1) enumerating each possible attack path by the defender, and distributing a group of k honeypots along the possible paths of the attacker to make the attacker deviate from a real target node;

(2.3) for simple target networks specifically: if the new honeypot does not capture the attacker, setting the reward value Cap to 0; if the new honeypot captures the attacker, custom setting a reward value Cap, and defining a reward function R formula for deploying a single honeypot as follows:

R＝-(P-Cap*(1-E _p ))

wherein E is _p = P (obj) is the probability of a protected or attacked node being attacked;

for complex target networks, in particular: if the new honeypot does not capture the attacker, setting the reward value Cap to 0; if the new honeypot captures the attacker, the reward value Cap is set in a self-defining mode and utilizedThe probability of being attacked calculated in the step (1.5) is P (obj), and honeypots are deployed according to the probability of being attacked which is P (obj); reward function

The formula of (1) is as follows: