CN115913731A

CN115913731A - Strategic honeypot deployment defense method based on intelligent penetration test

Info

Publication number: CN115913731A
Application number: CN202211508855.6A
Authority: CN
Inventors: 郑海斌; 刘欣然; 陈晋音
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2023-04-04

Abstract

A reward defense method based on intelligent penetration testing comprises the following steps: step 1, gradually permeating each subnet of a target network, finding the position of a sensitive host and attacking the sensitive host; generating a state action value using a deep neural network as a function approximator; performing penetration test by performing operations such as vulnerability exploitation, scanning, authority extraction and the like on a target sensitive host; step 2, protecting the sensitive host by deploying honeypots in a target network, obtaining the optimal number of honeypots by using a reinforcement learning QL algorithm, considering the profit cost of defenders and attackers, and verifying the defense effect based on the method; and 3, respectively setting firewalls with different reward values in the target network added with the honeypot host, setting a firewall with a high reward value in the honeypot host subnet to entice an attacker to fall into the honeypot host, and setting a firewall with a low reward value in the sensitive host subnet to protect the sensitive host from being attacked, so that a better defense effect is achieved.

Description

Strategic honeypot deployment defense method based on intelligent penetration test

Technical Field

The invention belongs to the field of intelligent body reinforcement learning penetration testing and honeypot deployment defense, and particularly relates to a strategic honeypot deployment defense method based on intelligent penetration testing.

Background

In recent years, with the development and application of internet technology, attacks on the internet can bring many security problems to the internet. Penetration testing (PT or PenTesting for short) is an active, authorized simulated network attack aimed at assessing network security and discovering hidden vulnerabilities. Penetration testing currently plays a crucial role in reinforcing a computer system against cyber attacks, because digital assets are exposed to hackers' persistent, varied and complex threats more frequently than ever before. And the penetration test can discover the vulnerable link and hidden risk of the system and develop security evaluation by simulating an attacker to invade the sensitive host. However, the illegal use of penetration tests by some people is a way to threaten the security of the network, and therefore, corresponding defense measures against penetration attacks are required.

The network spoofing defense is a defense method that a defender interferes and misleads the cognitive decision of an attacker, so that the attacker takes actions which are not favorable for penetration success in the attack process, the defender can detect, delay or interrupt the attack process, and the aim of enhancing the target network security is fulfilled. Aiming at the behavior of an attacker, the defender focuses on realizing special protection on an attack target.

The deployment of honeypots in a network has a great effect on maintaining the security of the network, and honeypots can serve as baits for misleading network attackers and protecting real assets so as to play a role in defending against penetration attacks. The setting of the reward value is a key link of the reinforcement learning algorithm training, and the reward is a direct experience source for continuously improving the agent and enabling the agent to independently realize the target. The agent judges the performance of the agent by receiving the environmental reward, thereby driving the agent to the target state by selecting the behavior with high profit with larger possibility. The penetration strategy learned by the intelligent agent in reinforcement learning can be influenced by changing the reward value, so that the purpose of defending penetration attack is achieved.

In summary, in order to prevent the intelligent penetration test based on reinforcement learning from being maliciously utilized, the strategic honeypot deployment method for the intelligent penetration test is explored to reduce the success rate of the intelligent penetration test, and the firewalls with different reward values can be used for inducing attackers to fall into the set honeypots more easily, so that the improvement of the defense effect is of great significance.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a strategic honeypot deployment defense method for intelligent penetration testing based on the principle of intelligent penetration testing of reinforcement learning so as to achieve the aim of maintaining network security.

Unlike other defense methods, the present invention performs penetration defense based on reinforcement learning methods to populate honeypots that can determine the optimal number and location of honeypots needed by a target network. By setting different types of firewalls and modifying the reward value, the penetration test is defended and lured so that an attacker falls into the honeypot, and a better defense effect can be obtained.

The invention relates to a reward defense method based on intelligent penetration test, which comprises the following steps:

step 1, penetration attack. The penetration test gradually penetrates each subnet of the target network, finds the position of the sensitive host and attacks the sensitive host. And finally, the result is drawn into a report and is provided for a tester, and the tester can perform reinforcement and defense on the system through the report so as to improve the safety of the network. Based on the QL algorithm, a deep neural network is used as a function approximator to generate state action values. And performing penetration test by performing operations such as vulnerability exploitation, scanning, permission extraction and the like on the target sensitive host.

And 2, deploying honeypots. Sensitive hosts are protected by deploying honeypots in the target network, the honeypots are deployed in a strategic manner using a reinforcement learning QL algorithm to obtain an optimal number of honeypots and to take into account the revenue costs of defenders and attackers. And verifying the defense effect based on the method.

And step 3, firewall defense. Firewalls with different reward values are respectively arranged in a target network added with a honeypot host. The firewall with the high reward value is arranged in the honeypot host subnet to induce an attacker to fall into the honeypot host, and the firewall with the low reward value is arranged in the sensitive host subnet to protect the sensitive host from being attacked, so that a better defense effect is achieved. In the step, a QL algorithm framework is still adopted, and defense is performed by modifying firewalls with different reward values, so that the penetration success rate of penetration attackers is reduced better finally.

And (5) iteratively completing the steps until the algorithm model is completely converged, and completing the training based on the QL algorithm model. After the training is completed, a better reward defense method based on the intelligent penetration test is obtained.

The technical conception of the invention is as follows: 1. the network penetration test is an effective method for evaluating the security of a network system, and related penetration testers can perform vulnerability mining on a network and a host thereof and perform security evaluation by simulating the hacking behavior. 2. Compared with manual penetration testing and automatic penetration attack, penetration attack based on reinforcement learning has stronger attack performance, so that the development of defense against the penetration attack is required to have important significance. 3. One means of attack to prevent the illegal use of reinforcement learning based penetration testing is known as network security. The target network is infiltrated and honeypots are strategically deployed to defend attackers based on a QL algorithm, and an infiltration process and a honeypot deployment process can be established as a Markov decision process to describe: MDP can be expressed as a quadruplet<S,A,R,P>Wherein S represents a state space set, A represents an action space set, and P represents a state transition matrix, i.e., a state S from the current time t _t Taking action a _t The probability of going to the next moment state is P(s) _t+1 |s _t ,a _t ) And R denotes a bonus function.

Where γ ∈ (0, 1) represents a discount factor for measuring the importance of the current prize relative to future prizes. The QL algorithm has the main idea that states and actions of an agent at different moments are constructed into a Q table storing Q (s, a) values, and then the action with the maximum profit is selected according to the Q values in the Q table. The update mechanism is as follows:

where s represents the current state of the attacker, a represents the current action of the attacker, and s, and a, represent the next state that occurs after the attacker takes action a and the actions that may be taken in the next state. r represents the instant prize, and α and γ represent the learning rate and discount factors, respectively.

Compared with the prior art, the invention has the beneficial effects that at least: the reward defense method for the intelligent penetration test provided by the invention takes a QL algorithm as a network foundation. The optimal number of honeypots are added to the target network for defense based on reinforcement learning intellectualization, so that the defense cost of penetration attack can be reduced, and the defense effect can be improved. The firewall is regarded as a barrier of network topography in the target network, and by adding the firewalls with different reward values in the sub-network added with the honeypot network, an attacker can fall into the honeypot host designed by the attacker and be prevented from permeating sensitive hosts in the target network, so that the defense efficiency and the success rate are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a diagram of a reward defense model of the present invention.

Fig. 2 is a general flow chart of the present invention.

Fig. 3 is a practical application diagram of the scheme in the constructed network.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

The invention provides a reward defense method based on intelligent penetration testing, which utilizes reward data of an intelligent agent in reinforcement learning as an entry point and carries out reward defense facing the intelligent penetration testing when an attacker penetrates and attacks. The target network has 4 subnets, 8 hosts, as shown in fig. 3. The goal is to penetrate sensitive hosts (4, 0) in a subnet 4, the method comprising the following steps according to the flow chart of fig. 2:

s1, carrying out penetration attack on a target network.

S11, in the process of realizing intelligent penetration attack based on the Q-learning algorithm, an attacker can continuously optimize the automatically generated attack path and finally obtain the optimal attack path. The agent is treated as a penetration attacker, S = { S = } ₁ ,s ₂ ,...,s _i Represents the set of states of the attacker, where s _i It is the attacker that at a certain moment i gets information about the host from the extranet with the scanning tool.

And S12, according to the acquired state information, the attacker infiltrates and invades the target host. The set of actions of an attacker can be represented as a = { a = { ₁ ,a ₂ ,...,a _i In which a is _i Is the state s that the attacker obtains from its interaction with the environment _i The actions taken, the actions taken during the penetration attack, are shown in the table below.

Name (R)	Type (B)	Cost consumption	Probability of	Access rights
					SSH	Penetration of	3	0.9	User
FTP	Penetration of	1	0.6	User
					HTTP	Penetration of	2	0.9	User
Tomcat	Weight raising	1	1	Root
					Daclsvc	Right to be increased	1	1	Root
Service scanning	Scanning	1	1	/
					Operating system tracing	Scanning	1	1	/
Subnet scanning	Scanning	1	1	/
					Progress scanning	Scanning	1	1	/

S13, taking action a according to the attacker _i Whether successful penetration to the target host will give the attacker reward feedback r _i ，R＝{r ₁ ,r ₂ ,...,r _n Denotes the set of rewards earned by the attacker. The prize value is calculated as follows:

/>

s14, in the infiltration process, the goal of an attacker is to maximize the accumulated reward value, namely, the sensitive host with the maximum value is infiltrated by adopting the operations as few as possible, as shown in the following formula:

and S2, after the penetration attack is carried out on the target network according to the method in the step S1, defense is carried out by arranging a honeypot host in the target network.

S21, regarding the honeypot deployment problem as a safety countermeasure for two enemy players. N is the total number of connected stations that can be used as honeypots or access points. Ratio of N used in honeypot is represented by thetaAnd (4) showing. s is _D,i E { -1,1} represents the policy of the defender: when a network attack is directed to a real access point or honeypot, s _D,i Equal to-1 and 1, S, respectively _A,i Utility function U representing aggressor policy, guardian over time interval t _D [t]Can be expressed as:

wherein delta ₁ Representing the defender's return on each attack, δ, on the honeypot ₂ Represents the return per attack, δ, detected by the defender without using honeypots ₃ Is the cost of each attack that the defender does not detect in time. When an attack is detected, I _r,i =1, I when not attacked _r,i =0, when S _D,i =1 then means that the attacking device is a honeypot, C being associated with the costs incurred by using honeypots.

S22, in order to enable the honeypots to have better defense effect, the optimal strategy of the defense party is to randomly distribute the honeypots, so that the attackers cannot recognize the existence of the honeypots. Wherein U is _D Representing the utility function of the defender within the time interval t, the goal being to optimize U, since the defender cannot know the attack times a priori _D Expected value of

And by the probability theta that the probability phi that each connected device is attacked is related to the corresponding asset part of the honeypot:

P _r is the probability of detecting an attack in the absence of honeypots. P _r In the case of =1, the honeypot does not provide any profit to the defender. To achieve the best defense effect, the appropriate value of θ is selected to maximize the utility function U _D [t]。

S23, the number of honeypots which can be deployed in the subsequent work represents a possible action a, and the number of honeypots represents a current state S. And obtaining the most appropriate honeypot number of the target network based on a QL algorithm updating mechanism.

S24, carrying out penetration attack on the sensitive host of the target network based on the method of the step S1, and if an attacker falls into the honeypot within the specified number of rounds and cannot successfully penetrate the sensitive host, proving that the method has a good defense effect.

And S3, by setting firewalls with different reward values, an attacker can more easily fall into a honeypot designed by the attacker and can better protect a sensitive host.

S31, designing the reward by using the term k to incentivize realistic attack activity, such that the reward of state S after taking action a becomes:

R(s,a)＝R(s,a)+k(s) (7)

s32, the term k is used to reduce the reward to encourage strong fire protection. The values are as follows:

wherein w ≦ 0 is a parameter for adjusting the excitation intensity. The change in the reward may be altered by the security of the communication protocol. Where FTP has a k-multiplier of 0.8 and SSH has a k-multiplier of 0.2, it may be more difficult than SSH when an attacker penetrates a host containing an FTP-based firewall.

S33, based on the principle, the attacker is more prone to attack and penetrate the subnet containing the SSH. By utilizing the honeypot network in step S2, the firewall for SSH can be set in the subnet with honeypot hosts, while the firewall for FTP can be set in the sensitive host subnet. Making it easier for attackers to fall into honeypot hosts and more difficult to penetrate sensitive hosts.

And S34, carrying out penetration attack on the sensitive host of the target network based on the QL algorithm, and if the number of turns required by the penetration sensitive host is more than that in the step S2, proving that the model has a better defense effect.

The embodiment provides a reward defense method based on intelligent penetration testing, which is used for defending network penetration attacks based on reinforcement learning. The attacked network model is defined as the hosts, connections and configurations on the network and is defined by the tuples { subnet, topology, host, service, firewall }; in order to realize interaction with a network model, the process of interaction between an attacker intelligent agent and the environment is modeled by a QL algorithm-based reinforcement learning Markov decision process, so that the interaction between the network and the osmotic environment is realized. When the method is applied, different infiltration environments are obtained by modifying the target network model in the infiltration environment, and the practicability and the defense effect of the method can be verified based on the intelligent agent reinforcement learning.

The specific actual defense model is shown in fig. 1, the target network reward data of the attacker agent during training is captured and modified by the defense party, the firewalls in the network are all set to be SSH type firewalls, the seepage defense is conducted by embedding the honeypot host in the subnet based on the step S2, the defense effect is verified by conducting seepage attack on the sensitive host, and if the attacker cannot obtain the use authority of the sensitive host after the training round is finished, the defense is successful. And then setting the firewall connected with the sensitive host as an FTP type and maintaining other firewalls as SSH types to be unchanged, so that an attacker can more easily fall into the set sensitive host, and performing penetration attack on the sensitive host in the same attack mode, and the model has better penetration defense effect after verification.

The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A reward defense method based on intelligent penetration testing comprises the following steps:

step 1, penetration attack; the penetration test finds the position of a sensitive host and attacks the sensitive host by gradually penetrating each subnet of a target network; finally, the result is drawn into a report and is provided for a tester, and the tester can perform reinforcement and defense on the system through the report so as to improve the safety of the network; generating a state action value based on a QL algorithm by using a deep neural network as a function approximator; carrying out penetration test by carrying out operations such as vulnerability exploitation, scanning, authority extraction and the like on a target sensitive host;

step 2, deployment of honeypots; protecting sensitive hosts by deploying honeypots in a target network, and using a reinforcement learning QL algorithm to deploy honeypots in a strategic manner to obtain the optimal number of honeypots and consider the profit cost of defenders and attackers; verifying the defense effect based on the method;

step 3, firewall defense; firewall with different reward values is respectively set in a target network added with a honeypot host; an attacker is induced to fall into the honeypot host by setting a firewall with a high reward value in the honeypot host subnet, and the firewall with a low reward value is set in the sensitive host subnet to protect the sensitive host from being attacked, so that a better defense effect is achieved; in the step, a QL algorithm frame is still adopted, and defense is performed by modifying firewalls with different reward values, so that the penetration success rate of penetration attackers is reduced better.

2. A reward defense method based on intelligent penetration test as claimed in claim 1, characterized in that: the step 1 specifically comprises the following steps:

s11, in the process of realizing intelligent penetration attack based on the Q-learning algorithm, an attacker can continuously optimize an attack path automatically generated and finally obtain an optimal attack path; the agent is treated as a penetration attacker, S = { S = ₁ ,s ₂ ,...,s _i Represents the set of states of the attacker, where s _i The attacker acquires information about the host from the extranet by using a scanning tool at a specific moment i;

s12, according to the acquired state information, an attacker infiltrates and invades the target host; the set of actions of an attacker can be represented as a = { a = { ₁ ,a ₂ ,...,a _i In which a is _i Is the state s obtained by the attacker from his interaction with the environment _i The actions taken, the actions taken during the penetration attack are shown in the table below;

name (R) Type (B) Cost consumption Probability of occurrence Access rights SSH Penetration of 3 0.9 User FTP Penetration of 1 0.6 User HTTP Penetration of 2 0.9 User Tomcat Weight raising 1 1 Root Daclsvc Right to be increased 1 1 Root Service scanning Scanning 1 1 / Operating system tracing Scanning 1 1 / Subnet scanning Scanning 1 1 / Process scanning Scanning 1 1 /

S13, taking action a according to the attacker _i Whether successful penetration to the target host will give the attacker reward feedback r _i ，R＝{r ₁ ,r ₂ ,...,r _n Denotes the set of rewards earned by the attacker; the reward value is calculated as follows:

s14, in the infiltration process, the goal of the attacker is to maximize the accumulated reward value, i.e. to infiltrate the most valuable sensitive host with as few operations as possible, as shown in the following formula:

3. a reward defense method based on intelligent penetration test as claimed in claim 1, characterized in that: the step 2 body comprises:

s21, regarding the problem of honeypot deployment as the safety countermeasure of two enemy players; n is the total number of connected stations that can be used as honeypots or access points; the ratio of the utilized N of the honeypots is represented by theta; s is _D,i E { -1,1} represents the policy of the defender: when a network attack is directed to a real access point or honeypot, s _D,i Equal to-1 and 1,S, respectively _A,i Utility function U representing aggressor policy, guardian over time interval t _D [t]Can be expressed as:

wherein delta ₁ Representing the defender's return on each attack, δ, on the honeypot ₂ Represents the return per attack, δ, detected by the defender without using honeypots ₃ Is the cost of each attack that the defender does not detect in time; when an attack is detectedWait for I _r,i =1, when not attacked I _r,i =0, when S _D,i =1 then denotes that the attacking device is a honeypot, C being associated with the costs incurred by using honeypots;

s22, in order to enable the honeypots to have a better defense effect, the optimal strategy of a defense party is to randomly distribute the honeypots, so that an attacker cannot identify the existence of the honeypots; wherein U is _D Representing the utility function of the defender within the time interval t, the goal being to optimize U, since the defender cannot know the attack times a priori _D Expected value of

P _r is the probability of detecting an attack in the absence of honeypots; p _r If =1, the honeypot will not provide any income for defenders; to achieve the best defense effect, the appropriate value of θ is selected to maximize the utility function U _D [t]；

S23, the number of honeypots which can be deployed in the subsequent work represents a possible action a, and the number of honeypots represents a current state S; the most appropriate honeypot number of the target network can be obtained based on a QL algorithm updating mechanism;

4. The reward defense method based on the intelligent penetration test as claimed in claim 1, characterized in that: the step 3 comprises the following steps:

s31, designing the reward to incentivize real attack activity by using the term k, such that the reward of state S becomes after taking action a:

R(s,a)＝R(s,a)+k(s) (7)

s32, the term k is used for reducing the reward to encourage strong fire prevention; the values are as follows:

/>

wherein w is less than or equal to 0, is a parameter for adjusting the excitation intensity; changes in rewards may be altered by the security of the communication protocol; where FTP has a k-multiplier of 0.8 and SSH has a k-multiplier of 0.2, it may be more difficult for an attacker to penetrate a host containing an FTP-based firewall than SSH;

s33, based on the principle, an attacker is more prone to attack and penetrate the subnet comprising the SSH; by using the honeypot network in step S2, the firewall of SSH can be set in the subnet with honeypot hosts, and the firewall of FTP can be set in the subnet with sensitive hosts; make it easier for attackers to fall into honeypot hosts and more difficult to penetrate sensitive hosts;