CN115473706A - Deep reinforcement learning intelligent penetration test method and device based on simulation learning - Google Patents

Deep reinforcement learning intelligent penetration test method and device based on simulation learning Download PDF

Info

Publication number
CN115473706A
CN115473706A CN202211046763.0A CN202211046763A CN115473706A CN 115473706 A CN115473706 A CN 115473706A CN 202211046763 A CN202211046763 A CN 202211046763A CN 115473706 A CN115473706 A CN 115473706A
Authority
CN
China
Prior art keywords
network
training
action
agent
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211046763.0A
Other languages
Chinese (zh)
Inventor
陈晋音
胡书隆
李晓豪
李玮峰
赵云波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202211046763.0A priority Critical patent/CN115473706A/en
Publication of CN115473706A publication Critical patent/CN115473706A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1433Vulnerability analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0803Configuration setting
    • H04L41/0813Configuration setting characterised by the conditions triggering a change of settings
    • H04L41/082Configuration setting characterised by the conditions triggering a change of settings the condition being updates or upgrades of network functionality
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/04Processing captured monitoring data, e.g. for logfile generation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Environmental & Geological Engineering (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a deep reinforcement learning intelligent penetration test method and a device based on simulation learning, wherein the method comprises the following steps: (1) Acquiring expert sample data, wherein the expert sample data is a state action pair when the post-infiltration succeeds; (2) Training an agent by utilizing an A3C algorithm, wherein the agent is used as a simulation attacker in a penetration test; (3) Putting a first state action pair generated by an actor network in an A3C algorithm and a second state action pair in expert sample data into a GAIL (generic object identifier) disarminator network in the training process of the agent to train the disarminator network; (4) Constructing an advantage function according to discount rewards output by the disambinator network after training and values output by the criticc network, and updating an operator network in the A3C algorithm by using the advantage function; (5) Repeating the steps (2) - (4) until the training round is finished; (6) And setting the trained intelligent agent in a network environment needing penetration testing to perform the penetration testing.

Description

Deep reinforcement learning intelligent penetration test method and device based on simulation learning
Technical Field
The invention belongs to the technical field of defense facing to network space security and deep reinforcement learning, and particularly relates to a deep reinforcement learning intelligent penetration test method and device based on imitation learning.
Background
With the continuous development of artificial intelligence technology and internet technology, network attack technology is also increasingly updated. The Penetration Test (pennetration Test) is used as a network security Test and evaluation method, and potential security hazards possibly existing in a target network are tested by simulating the real attack behavior of a hacker, so that the purpose of clearing the potential hazards and improving the system security is achieved. Under the military combat scene of the confrontation of the red and blue army, penetration testing is widely used, and the penetration party serving as the blue army performs penetration evaluation on partial bugs existing in the military combat network in a mode of simulating the attack of malicious hackers, so that the aim of defending against the malicious network attack is fulfilled. The penetration testing process includes an active analysis of all vulnerabilities, technical deficiencies, and all leaks of the network system from a location where an attacker may exist and from this location a security hole is conditionally actively penetrated. The one-time complete penetration test mainly comprises seven steps of early-stage interaction, information collection, threat modeling, vulnerability analysis, penetration attack, post-penetration attack and report generation. In summary, penetration testing involves controlled attacks on a computer system to assess its security. At present, it is one of the key methods adopted by the international network security organization to strengthen the defense against network threats.
However, the network penetration test requires a lot of training and time cost to obtain a good result, but at present, the skilled network security professional is in increasing shortage, so it is very important to intelligentize the penetration test and save labor cost. Penetration testing passes through authorized controlled attacks on the network system to discover any security holes that an attacker may exploit. This approach is very effective for assessing system security because it essentially simulates the behavior of a real-world attacker in a real-world scenario. However, one major drawback behind this effectiveness is that it requires a high cost in time and skill in performing the infiltration. As network systems grow in size, complexity and number, this high cost problem becomes increasingly non-negligible, which directly places higher demands on security professionals, which are not met quickly enough at present. One approach to attempt to address this problem is to apply Artificial Intelligence (AI) techniques to the field of network security in order to automate and intelligentize the penetration test process. Current automated penetration testing methods rely on model-based methods and penetration efficiency is generally not high, however, network security situations are rapidly changing with the development of new software and attack vectors, which makes the production and maintenance of new models a challenge.
Disclosure of Invention
Aiming at the problems in the prior art, the embodiment of the application aims to provide a deep reinforcement learning intelligent penetration testing method and device based on simulation learning so as to improve the automatic penetration testing efficiency.
According to a first aspect of the embodiments of the present application, there is provided a deep reinforcement learning intelligent penetration testing method based on imitation learning, including:
(1) Acquiring expert sample data, wherein the expert sample data is a state action pair when post-infiltration succeeds;
(2) Training an agent by utilizing an A3C algorithm, wherein the agent is used as a simulation attacker in a penetration test;
(3) Putting a first state action pair generated by an actor network in an A3C algorithm and a second state action pair in the expert sample data into a GAIL (generic object identifier) disarminator network in an agent training process, and training the disarminator network;
(4) Constructing an advantage function according to the discount reward output by the disincimator network after training and the value output by the criticc network, and updating an operator network in the A3C algorithm by using the advantage function;
(5) Repeating the steps (2) - (4) until the training round is finished;
(6) And setting the trained intelligent agent in a network environment needing penetration testing so that the intelligent agent performs penetration testing.
Further, training the agent by using the A3C algorithm, including:
(1.1) framing penetration testing into a markov decision process;
(1.2) training all sub-threads of the agent by respectively adopting an AC algorithm, wherein the training process of each sub-thread comprises the following steps:
(1.2.1) inputting the state of the current moment to obtain a corresponding strategy;
(1.2.2) constructing a merit function by using the difference value of the reward function and the merit function to evaluate the strategy, wherein the merit function is as follows:
A(s,t)=r t +γr r+1 +...+γ n-1 R t+n-1 γ n V(s')-V(s)=R(t)-V(s)
wherein gamma is a discount factor, the value range is (0, 1), R (-) is a reward function, and V (-) is a value function;
(1.2.3) updating the parameters of the operator network and the critical network in the child thread by using the strategy gradient:
Figure BDA0003822659560000021
Figure BDA0003822659560000022
wherein, theta i And mu i Respectively the parameters of the actor network and criticc network in the ith sub-thread, pi (a | s; theta) i ) Is based on Actor network parametersθ i A (s | μ) of i ) Is based on the Critic network parameter mu i The merit function of (1);
(1.3) after all the child threads are updated in one round, updating parameters theta and mu of the actor network and the critic network in the main network of the intelligent agent according to the parameters of the actor network and the critic network updated by all the child threads:
Figure BDA0003822659560000023
Figure BDA0003822659560000031
wherein n is the number of the sub-threads, i and beta i Parameter θ for ith sub-thread respectively i And mu i An updated learning rate;
(1.4) repeating steps (1.2) and (1.3) until the training round is over.
Further, in the training process of the agent, the state includes a vulnerability name, a port service, a service version number, a penetration module and a penetration target, the action is a load output by the penetration framework metasploit, and the reward is set according to whether penetration is successful and the type of the output load.
Further, the step (3) comprises:
(3.1) initializing a resolver network D and an actor network pi;
(3.2) according to the preset number of the first state action pairs and the preset number of the second state action pairs, respectively setting labels 0 and 1 for the first state action pairs and the second state action pairs, and putting the labeled first state action pairs and the labeled second state action pairs into a Gateway (GAIL) descriptor network;
(3.3) training a discriminator network with the first state action: calculating JS divergence, performing gradient reverse derivation according to the JS divergence to update parameters of a discriminator network, and adjusting the distribution of the number of pairs of the first state actions and the logarithm of the second state actions;
(3.4) repeating step (3.3) until JS divergence is minimized:
Figure BDA0003822659560000032
wherein, tau i Representing a set of first state-action pairs, (s, a) being any state-action pair from the set of first state-action pairs, logD w (s, a) represents the loss judgment of the output of the operator network, log (1-D) w (s, a)) represents the loss judgment of the resolver network output;
(3.5) after training of the discriminator network is finished, inputting the state of the next moment into the operator network in the GAIL to obtain the probability of all actions, selecting the action corresponding to the value with the maximum probability as the final action, simultaneously evaluating the action by the critic network in the GAIL to output the corresponding value, and outputting a discount reward value to the action by the discriminator network.
Further, the step (4) comprises:
and constructing a new advantage function pi (theta is used for guiding the training of the operator network according to the discount reward value, and updating the parameter theta of the operator network by a strategy gradient updating method:
Figure BDA0003822659560000033
wherein alpha is h ∈R + And (3) representing the corresponding learning rate of the h step, wherein h is the training step number set in each training round in the step (2).
According to a second aspect of the embodiments of the present application, there is provided a deep reinforcement learning intelligent penetration testing device based on simulation learning, including:
the acquisition module is used for acquiring expert sample data, wherein the expert sample data is a state action pair when the post-infiltration is successful;
the A3C training module is used for training the intelligent agent by utilizing an A3C algorithm, wherein the intelligent agent is used as a simulation attacker in the penetration test;
the GAIL training module is used for putting a first state action pair generated by an operator network in an A3C algorithm and a second state action pair in the expert sample data into a GAIL (generic object model) disarming network in an agent training process, and performing training of the disarming network;
the updating module is used for constructing a dominant function according to the discount rewards output by the cognitive network after training and the value output by the critical network and updating the operator network in the A3C algorithm by using the dominant function;
the agent training module is used for repeating the process from the A3C training module to the updating module, and training the agent until the training round is finished;
and the penetration testing module is used for setting the trained intelligent agent in a network environment needing penetration testing so that the intelligent agent can perform the penetration testing.
According to a third aspect of embodiments of the present application, there is provided an electronic apparatus, including:
one or more processors;
a memory for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement a method as described in the first aspect.
According to a fourth aspect of embodiments herein, there is provided a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, carry out the steps of the method according to the first aspect.
The technical scheme provided by the embodiment of the application can have the following beneficial effects:
according to the embodiment, the application provides a novel depth reinforcement learning intelligent penetration test method based on simulation learning, and the A3C algorithm and the GAIL algorithm are combined, so that the leak penetration efficiency is improved; in the training process of the GAIL network, expert sample data is obtained first, and then a model is imported for training. The operator network in the model generates a state action pair on line, and then the state action generated on line and the state action of the expert are placed in the discriminator network for training, so that the action generated by the operator network and the action of the expert are infinitely close.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
FIG. 1 is a flow diagram illustrating a method for deep reinforcement learning intelligent penetration testing based on mock learning according to an exemplary embodiment.
FIG. 2 is a schematic diagram illustrating the structure of the A3C algorithm, according to an exemplary embodiment.
Fig. 3 is a schematic diagram illustrating a GAIL algorithm architecture according to an exemplary embodiment.
FIG. 4 is a block diagram illustrating a deep reinforcement learning intelligent penetration testing apparatus based on mock learning according to an exemplary embodiment.
FIG. 5 is a block diagram illustrating a deep reinforcement learning intelligent penetration testing apparatus based on mock learning according to an exemplary embodiment.
FIG. 6 is a schematic diagram illustrating an electronic device in accordance with an example embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if," as used herein, may be interpreted as "at \8230; \8230when" or "when 8230; \823030when" or "in response to a determination," depending on the context.
FIG. 1 is a flowchart illustrating a method for deep reinforcement learning intelligent penetration testing based on mock learning according to an exemplary embodiment, which may include the following steps as shown in FIG. 1:
(1) Acquiring expert sample data, wherein the expert sample data is a state action pair when post-infiltration succeeds;
(2) Training an agent by utilizing an A3C algorithm, wherein the agent serves as a simulation attacker in an infiltration test;
(3) Putting a first state action pair generated by an actor network in an A3C algorithm and a second state action pair in the expert sample data into a GAIL (generic object identifier) disarminator network in an agent training process, and training the disarminator network;
(4) Constructing a dominance function according to the discount rewards output by the disintimator network after training and the value output by the critic network, and updating an operator network in the A3C algorithm by using the dominance function;
(5) Repeating the steps (2) - (4) until the training round is finished;
(6) And setting the trained intelligent agent in a network environment needing penetration testing so that the intelligent agent performs penetration testing.
According to the embodiment, the application provides a novel depth reinforcement learning intelligent penetration test method based on simulation learning, and the A3C algorithm and the GAIL algorithm are combined, so that the leak penetration efficiency is improved; in the training process of the GAIL network, expert sample data is obtained first, and then a model is imported for training. The operator network in the model generates a state action pair on line, and then the state action generated on line and the state action of the expert are placed in the discriminator network for training, so that the action generated by the operator network and the action of the expert are infinitely close.
Specifically, the network inputs of Actor and Critic are both the currently observed states, while the output of the former is the corresponding action and the output of the latter is the value of the current state. The inputs to the resolver network are the agent's status-action pairs and the expert sample's status-action pairs used only for training, respectively, the output of which is the discount reward value.
In the implementation of step (1), the expert sample data refers to the state action pairs corresponding to the higher bonus value. The penetration test scene is different from the traditional game gym scene of reinforcement learning, the expert sample data of the penetration test scene needs to be obtained manually, in the penetration test scene, the highest reward value can be obtained after the penetration succeeds, and therefore the state action pair defined when the penetration succeeds is the expert sample data
Figure BDA0003822659560000061
In this embodiment, based on a metasploit penetration test scenario, in the process of performing a penetration test by using metasploit, when port information of a target drone is scanned by nmap, the port information is extracted as a status input, and then a penetration module and a payload matched with the port information are sequentially selected by calling an msf penetration frame until penetration is successful. In the code setting, once the post-penetration is successful, a prompt of 'BINGO' is given, and the current state and action are stored to realize the acquisition of the expert sample, wherein the following is the scene description of the generation of the expert sample data;
in the metasploit penetration test scene, the operation types include 16 types in total, namely windows, unix and linux; port services comprise ssh, telnet, apache and the like, and the number of the port services is 37 in total; the number of the permeation modules can be selected to be 1417. As can be seen, the successful state of this post-infiltration is [0.875, -0.35135135135135135137, 3.3, -0.131968948270996, 0], and the action is 472. The 1 st, 2 nd, and 4 th dimensions in the five-dimensional state information respectively refer to normalized operation data performed after the serial numbers corresponding to the operating system, the port service name, and the pervasive module, and belong to key information, and the output action 472 corresponds to the 472 th payload in the payload lists 0 to 592.
And after multiple data storage, 37167 pairs of expert sample data are successfully acquired.
In the specific implementation of the step (2), a simulated attacker in the network is trained based on an asynchronous dominant motion evaluation algorithm (A3C) in deep reinforcement learning, the goal of the simulated attacker is to simulate penetration attack on vulnerabilities existing in the network so as to achieve the purpose of evaluating network security, the A3C algorithm adopts a multithreading method, AC is put into multiple threads for synchronous training, each sub-thread can interact with the environment independently to obtain experience data, and the threads do not interfere with each other and run independently. Wherein the main network does not need to be trained and is only used for storing parameters of the AC network structure. Training the agent using the A3C algorithm, including:
(1.1) framing penetration testing into a markov decision process;
in this embodiment, in the training process of the agent, the state includes a vulnerability name, a port service, a service version number, a penetration module, and a penetration target, the action is a load (payload) output by the penetration framework metasploit, and the reward is set according to whether penetration is successful and the type of the output load.
(1.2) training all sub-threads of the agent by respectively adopting an AC algorithm;
specifically, the Actor-Critic Algorithm (AC) combines a strategy gradient and a value function, approximates the function to estimate a return value using a Critic network to evaluate the strategy of the Actor, and takes charge of generating an action using an Actor network.
Wherein the training process of each sub-thread comprises the following steps:
(1.2.1) inputting the state of the current moment to obtain a corresponding strategy;
specifically, in the main network and the sub-thread network, A3C adopts an AC network structure, namely, the AC network structure is divided into an Actor network pi θ (as) and Critic network V μ (s) of the reaction mixture. In the Actor network, the corresponding strategy pi (a | s; theta) is obtained by inputting the state data of the current time, and the pi (a | s; theta) represents the probability of selecting the action a under the condition of the state s of the current time and the parameter theta.
(1.2.2) constructing a merit function by using the difference value of the reward function and the merit function to evaluate the strategy, wherein the merit function is as follows:
A(s,t)=r t +γr r+1 +...+γ n-1 R t+n-1 γ n V(s')-V(s)=R(t)-V(s)
wherein gamma is a discount factor, the value range is (0, 1), R (-) is a reward function, and V (-) is a value function;
specifically, the discount factor γ is set to 0.9, and the above-mentioned merit function is obtained by sampling n training steps, where n can be set according to actual conditions.
(1.2.3) updating the parameters of the operator network and the critical network in the child thread by using the strategy gradient:
Figure BDA0003822659560000071
Figure BDA0003822659560000072
wherein theta is i And mu i Respectively the parameters of the operator network and the critical network in the ith sub-thread, pi (a | s; theta) i ) Is based on the Actor network parameter θ i A (s | μ) of i ) Is based on the critic network parameter mu i The merit function of (2);
specifically, the updating of the operator network parameter theta is performed by first taking the logarithm of the policy function and then comparing theta i And (4) derivation, differentiation of the parameters at the moment, and finally superposition of the two parameters. Updating the critic network parameter mu, namely firstly updating the mu in the merit function i And obtaining the derivative by superposition with the differential of the derivative.
(1.3) after all the sub-threads are updated in one round, updating parameters theta and mu of an operator network and a critic network in the main network of the intelligent agent according to the updated parameters of the operator network and the critic network of all the sub-threads:
Figure BDA0003822659560000081
Figure BDA0003822659560000082
wherein n is the number of the sub-threads, and alpha and beta respectively represent the learning rate of updating the parameters theta and mu;
specifically, the learning rates α and β are dynamically adjusted according to the training effect at this time, and α and β are set to 0.00025 in the initial stage of training and to 0.0001 in the later stage of training.
(1.4) repeating steps (1.2) and (1.3) until the training round is over.
Specifically, through interaction of a plurality of sub-threads and the environment, different sub-networks update a main network together, and the sub-networks and the main network are synchronized when the main network is updated every time, so that the gradient parameters of the main network and the environment are compared when the sub-networks return the gradient parameters next time, and the main network is ensured to train in a good direction continuously.
In the implementation of step (3), generative confrontation mimic learning (GAIL) is similar to generative confrontation network (GAN), and GAIL introduces a mimic learning mechanism on the basis of GAN, and in addition to the operator network and the critic network, GAIL uses a disarrinator network in the training process. In the training process, the state action pair generated by the operator network in the A3C algorithm and expert sample data (namely, the state action pair which is successfully permeated later) are put into a cognitive network of GAIL for training, after the training of the cognitive network is finished, the generation of corresponding actions can be guided, so that the action distribution approaches to the expert sample, and the model outputs actions with higher reward values at higher probability, thereby achieving the purpose of improving the permeability performance, wherein the step (3) can comprise the following substeps:
(3.1) initializing a resolver network D and an actor network pi;
specifically, the discriminator network and the generator network in the GAIL need to be initialized in the early stage of training to prevent overfitting of the late training.
(3.2) tagging 0 and 1 to the first status-action pairs and the second status-action pairs respectively according to the number of the predetermined first status-action pairs and the number of the predetermined second status-action pairs, and placing the tagged first status-action pairs and second status-action pairs into the descriptor network of the GAIL;
specifically, splicing the states and actions of expert sample data collected in advance, and setting a label 1 for the spliced second state action; at the same time, the first state action pairs generated by the operator network are also spliced, and the spliced first state action is opposite to the label 0. The label processing is carried out on the data according to different state actions, the state pairs generated on line and the expert action pairs can be classified, and the resolver network can carry out two-class training on the two types of data, so that the training efficiency is accelerated.
(3.3) training a discriminator network with the first state action: calculating JS divergence, performing gradient reverse derivation according to the JS divergence to update parameters of a discriminator network, and adjusting the distribution of the number of pairs of the first state actions and the logarithm of the second state actions;
specifically, the spliced two types of data are input into a discriminator network in the GAIL for training, and a loss function of the real-state strategy sequence trajectory data and a loss function of the currently generated sequence trajectory data are continuously optimized until the discriminator network training is finished.
(3.4) repeating step (3.3) until JS divergence is minimized:
Figure BDA0003822659560000091
wherein, tau i Representing a set of first state-action pairs, (s, a) being any state-action pair, logD, from the set of first state-action pairs w (s, a) represents the loss judgment on the output of the operator network, log (1-D) w (s, a)) represents the loss judgment of the resolver network output;
through such a maximum and minimum game process, the Actor network and the resolver network required by training can be optimized in a circulating and alternating mode.
(3.5) after training of the discriminator network is finished, inputting the state of the next moment into the operator network in the GAIL to obtain the probability of all actions, selecting the action corresponding to the value with the maximum probability as the final action, simultaneously evaluating the action by the critic network in the GAIL to output the corresponding value, and outputting a discount reward value to the action by the discriminator network.
Therefore, the generator operator network updates and trains the merit function constructed by the output of the critic network and the discriminator network, and further, the generator operator network updates and trains the 3 networks in the A3C-GAIL: the operator network, the critical network and the disarminator network are closely linked. The Critic network evaluates the value of the current state, and can visually evaluate the quality of the action generated by the current state, so that the intelligent agent is promoted to be biased to select the action corresponding to the state with higher value.
In the specific implementation of the step (4), the value output by the critic network is subtracted from the discount reward output by the discriminator network to construct a new merit function pi (theta). Guiding the training of the operator network according to the advantage function pi (theta), and updating the parameter theta of the operator network by a strategy gradient updating method:
Figure BDA0003822659560000092
wherein alpha is h ∈R + Represents the corresponding learning rate of the h-th step, h being set for each training round in step (2)Training steps;
in the implementation of step (5), as shown in FIG. 4, the network inputs of the operator and the critic are both currently observed states, while the output of the former is the corresponding action and the output of the latter is the value of the current state. The inputs to the resolver network are the agent's state-action pairs and the expert sample's state-action pairs used only for training, respectively, the outputs of which are the discount reward values. A new merit function is constructed by awarding discount of the critic network output value and the disarminator network output so as to continuously update the training of the operator network, and therefore the action generated by the operator network is continuously close to the action of the expert. When the number of times of training reaches the preset number of training rounds, the step is ended. In an embodiment, the number of training rounds set is 20 ten thousand rounds, the number of training steps in each round is 1 ten thousand steps, and the setting is a conventional setting in the art, and can be set by itself according to actual situations, which is not described herein.
In the concrete implementation of the step (6), after the training of the intelligent agent is finished through the process of the step (5), the trained intelligent agent is arranged in the network environment which needs to be subjected to the penetration test, so that the intelligent penetration test is carried out.
The technical conception of the invention is as follows: the Reinforcement Learning (RL) is an artificial intelligence optimization technology, and has the key advantages that an environment model is not needed to generate an attack strategy, the optimal strategy is learned through interaction with the environment, and the deep reinforcement learning fully utilizes a neural network as a parameter structure and combines the perception capability of the deep learning and the decision capability of the reinforcement learning to optimize the deep reinforcement learning strategy. The present application therefore introduces reinforcement learning agents as simulated attackers in penetration testing. In the training process of the intelligent penetration test based on the A3C-GAIL, the penetration test is framed into a Markov Decision Process (MDP), and information such as an operating system type, a product name, a product version, a selected vulnerability exploitation module, a target index (target type of attack) and the like is input into a model as state data. The actions performed include a single scan operation and exploit for each service and each computer on the network, i.e., a full scan with Nmap, which returns information of the services running on each port on a given server and the version of each service. And selecting a corresponding action according to the current state, namely selecting an attack load in a payload list corresponding to the vulnerability exploitation module at the moment to obtain the next state (the target index can be changed randomly). And (4) carrying out vulnerability penetration on the target server through metasploit, and feeding back a vulnerability penetration result to set the reward. In the sub-thread, each network is trained independently, the generated vulnerability penetration mode is input into the metasploit to carry out module matching on the server, and the metasploit selects a corresponding exploit module according to corresponding vulnerability information and selects payload. The reward value is then fed back to the depth augmentation model by detecting the type of payload for guiding policy generation. And outputting Payloads through mutual transmission of parameters and gradients between the main network and each branch network so as to guide decision training and achieve the aims of identifying and permeating bugs. The A3C-GAIL structure is adopted to guide generation of better action distribution through expert sample data, so that detection of the network vulnerability is accelerated, penetration efficiency of the network vulnerability is improved, and performance improvement of an intelligent agent in the penetration efficiency and reduction of penetration cost by using a metasploid penetration framework under a single target drone scene metasploid 2 are achieved. .
The invention has the following beneficial effects: 1) A new intelligent penetration testing method is provided, and the A3C algorithm and the GAIL algorithm are combined, so that the penetration efficiency of the loophole is improved; 2) In the process of acquiring expert sample data, the experience of a person is combined with a machine algorithm by recording the successful state and action of each penetration, so that the aim of man-machine cooperation is fulfilled; 3) In the training process of the GAIL network, when sample data is expanded to a certain scale, a model is introduced for training. An operator network in the model generates corresponding actions from the sampling state of the sample data, and then the generated actions and the expert actions in the sample data are placed in a disacrimator network for training until the actions generated by the strategy network and the expert actions are infinitely close; 4) The method is experimentally verified in the actual single-machine target drone scene metaschoice 2, and the penetration effect is better than that of the original deep explicit model in the aspects of reward values, penetration success path number within limited time and the like.
Corresponding to the foregoing embodiments of the depth-enhanced learning intelligent penetration testing method based on the learning simulation, the present application also provides embodiments of a depth-enhanced learning intelligent penetration testing apparatus based on the learning simulation.
FIG. 5 is a block diagram illustrating an apparatus for deep reinforcement learning intelligent penetration testing based on mock learning according to an exemplary embodiment. Referring to fig. 5, the apparatus may include:
the acquisition module 21 is configured to acquire expert sample data, where the expert sample data is a state action pair when post-infiltration succeeds;
an A3C training module 22, configured to train an agent using an A3C algorithm, where the agent serves as a simulated attacker in a penetration test;
the GAIL training module 23 is configured to place a first state action pair generated by an actor network in an A3C algorithm and a second state action pair in the expert sample data in a GAIL descriptor network in an agent training process, and perform training of the descriptor network;
the updating module 24 is configured to construct an advantage function according to the discount reward output by the destructor network after training and the value output by the critic network, and update the operator network in the A3C algorithm by using the advantage function;
an agent training module 25, configured to repeat the process from the A3C training module to the update module, train the agent until the training round is completed;
and the penetration testing module 26 is used for arranging the trained intelligent agent in a network environment needing penetration testing so that the intelligent agent can perform the penetration testing.
With regard to the apparatus in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application. One of ordinary skill in the art can understand and implement it without inventive effort.
Correspondingly, the present application further provides an electronic device, comprising: one or more processors; a memory for storing one or more programs; when executed by the one or more processors, cause the one or more processors to implement a method for deep reinforcement learning intelligent penetration testing based on mock learning as described above. As shown in fig. 6, for a hardware structure diagram of any device with data processing capability in which the method for testing deep reinforcement learning based on learning emulation according to the embodiment of the present invention is located, in addition to the processor, the memory, and the network interface shown in fig. 6, any device with data processing capability in which the apparatus is located in the embodiment may further include other hardware according to an actual function of the any device with data processing capability, which is not described again.
Accordingly, the present application also provides a computer readable storage medium, on which computer instructions are stored, and the instructions, when executed by a processor, implement the simulation learning based deep reinforcement learning intelligent penetration test method as described above. The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be an external storage device of the wind turbine, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), and the like, provided on the device. Further, the computer readable storage medium may include both an internal storage unit of any data processing capable device and an external storage device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing capable device, and may also be used for temporarily storing data that has been output or is to be output.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.
It will be understood that the present application is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof.

Claims (8)

1. A deep reinforcement learning intelligent penetration test method based on imitation learning is characterized by comprising the following steps:
(1) Acquiring expert sample data, wherein the expert sample data is a state action pair when post-infiltration succeeds;
(2) Training an agent by utilizing an A3C algorithm, wherein the agent is used as a simulation attacker in a penetration test;
(3) Putting a first state action pair generated by an actor network in an A3C algorithm and a second state action pair in the expert sample data into a GAIL (generic object identifier) disarminator network in an agent training process, and training the disarminator network;
(4) Constructing an advantage function according to the discount reward output by the disincimator network after training and the value output by the criticc network, and updating an operator network in the A3C algorithm by using the advantage function;
(5) Repeating the steps (2) - (4) until the training round is finished;
(6) And setting the trained intelligent agent in a network environment needing penetration testing so that the intelligent agent performs penetration testing.
2. The method of claim 1, wherein training the agent using the A3C algorithm comprises:
(1.1) framing the penetration test as a markov decision process;
(1.2) training all sub-threads of the agent by respectively adopting an AC algorithm, wherein the training process of each sub-thread comprises the following steps:
(1.2.1) inputting the state of the current moment to obtain a corresponding strategy;
(1.2.2) constructing a merit function by using the difference value of the reward function and the merit function to evaluate the strategy, wherein the merit function is as follows:
A(s,t)=r t +γr r+1 +...+γ n-1 R t+n-1 γ n V(s')-V(s)=R(t)-V(s)
wherein gamma is a discount factor, the value range is (0, 1), R (-) is a reward function, and V (-) is a value function;
(1.2.3) updating the parameters of the actor network and the critic network in the child thread by using the strategy gradient:
Figure FDA0003822659550000011
Figure FDA0003822659550000012
wherein, theta i And mu i Respectively the parameters of the actor network and criticc network in the ith sub-thread, pi (a | s; theta) i ) Is based on the Actor network parameter θ i Policy function of (a) (s | μ) i ) Is based on the Critic network parameter mu i The merit function of (1);
(1.3) after all the child threads are updated in one round, updating parameters theta and mu of the actor network and the critic network in the main network of the intelligent agent according to the parameters of the actor network and the critic network updated by all the child threads:
Figure FDA0003822659550000013
Figure FDA0003822659550000021
wherein n is the number of child threads, α i And beta i Parameter θ for ith sub-thread respectively i And mu i An updated learning rate;
(1.4) repeating steps (1.2) and (1.3) until the training round is over.
3. The method of claim 1, wherein during the training process of the agent, the status comprises vulnerability name, port service, service version number, infiltration module, infiltration target, the action is the load exported by the infiltration framework metasploid, and the reward is set according to whether the infiltration is successful and the type of the exported load.
4. The method of claim 1, wherein step (3) comprises:
(3.1) initializing a resolver network D and an actor network pi;
(3.2) tagging 0 and 1 to the first status-action pairs and the second status-action pairs respectively according to the number of the predetermined first status-action pairs and the number of the predetermined second status-action pairs, and placing the tagged first status-action pairs and second status-action pairs into the descriptor network of the GAIL;
(3.3) training a discriminator network with the first state action: calculating JS divergence, and performing gradient reverse derivation according to the JS divergence so as to update parameters of the descriptor network and adjust the distribution of the number of pairs of the first state actions and the logarithm of the second state actions;
(3.4) repeating step (3.3) until JS divergence is minimized:
Figure FDA0003822659550000022
wherein, tau i Representing a set of first state-action pairs, (s, a) being any state-action pair from the set of first state-action pairs, log D w (s, a) represents the loss judgment on the output of the operator network, log (1-D) w (s, a)) represents the loss judgment of the resolver network output;
(3.5) after training of the discriminator network is finished, inputting the state of the next moment into the operator network in the GAIL to obtain the probability of all actions, selecting the action corresponding to the value with the maximum probability as the final action, simultaneously evaluating the action by the critic network in the GAIL to output the corresponding value, and outputting a discount reward value to the action by the discriminator network.
5. The method of claim 1, wherein step (4) comprises:
and constructing a new advantage function pi (theta is used for guiding the training of the operator network according to the discount reward value, and updating the parameter theta of the operator network by a strategy gradient updating method:
θ←θ+α hθ π(θ)
wherein alpha is h ∈R + And (3) representing the corresponding learning rate of the h step, wherein h is the training step number set in each training round in the step (2).
6. A deep reinforcement learning intelligent penetration testing device based on simulation learning is characterized by comprising:
the acquisition module is used for acquiring expert sample data, wherein the expert sample data is a state action pair when the post-infiltration is successful;
the A3C training module is used for training the intelligent agent by utilizing an A3C algorithm, wherein the intelligent agent is used as a simulation attacker in the penetration test;
the GAIL training module is used for putting a first state action pair generated by an actor network in an A3C algorithm and a second state action pair in the expert sample data into a GAIL descriptor network in the training process of the agent to train the descriptor network;
the updating module is used for constructing an advantage function according to the discount reward output by the destructor network after training and the value output by the critic network and updating the operator network in the A3C algorithm by using the advantage function;
the agent training module is used for repeating the process from the A3C training module to the updating module, and training the agent until the training round is finished;
and the penetration testing module is used for setting the trained intelligent agent in a network environment needing penetration testing so that the intelligent agent can perform the penetration testing.
7. An electronic device, comprising:
one or more processors;
a memory for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.
8. A computer-readable storage medium having stored thereon computer instructions, which when executed by a processor, perform the steps of the method according to any one of claims 1-5.
CN202211046763.0A 2022-08-30 2022-08-30 Deep reinforcement learning intelligent penetration test method and device based on simulation learning Pending CN115473706A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211046763.0A CN115473706A (en) 2022-08-30 2022-08-30 Deep reinforcement learning intelligent penetration test method and device based on simulation learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211046763.0A CN115473706A (en) 2022-08-30 2022-08-30 Deep reinforcement learning intelligent penetration test method and device based on simulation learning

Publications (1)

Publication Number Publication Date
CN115473706A true CN115473706A (en) 2022-12-13

Family

ID=84368651

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211046763.0A Pending CN115473706A (en) 2022-08-30 2022-08-30 Deep reinforcement learning intelligent penetration test method and device based on simulation learning

Country Status (1)

Country Link
CN (1) CN115473706A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116415687A (en) * 2022-12-29 2023-07-11 江苏东蓝信息技术有限公司 Artificial intelligent network optimization training system and method based on deep learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116415687A (en) * 2022-12-29 2023-07-11 江苏东蓝信息技术有限公司 Artificial intelligent network optimization training system and method based on deep learning
CN116415687B (en) * 2022-12-29 2023-11-21 江苏东蓝信息技术有限公司 Artificial intelligent network optimization training system and method based on deep learning

Similar Documents

Publication Publication Date Title
Yamin et al. Cyber ranges and security testbeds: Scenarios, functions, tools and architecture
US20200410399A1 (en) Method and system for determining policies, rules, and agent characteristics, for automating agents, and protection
CN111310802B (en) Anti-attack defense training method based on generation of anti-network
Liu et al. Performing co-membership attacks against deep generative models
CN110503207A (en) Federation's study credit management method, device, equipment and readable storage medium storing program for executing
CN111783105B (en) Penetration test method, device, equipment and storage medium
Yamin et al. Modeling and executing cyber security exercise scenarios in cyber ranges
CN109446808A (en) Android countermeasure sample generation method and system based on DCGAN
CN113810406B (en) Network space security defense method based on dynamic defense graph and reinforcement learning
CN115473706A (en) Deep reinforcement learning intelligent penetration test method and device based on simulation learning
CN110807291B (en) On-site situation future guiding technology based on mimicry countermeasure learning mechanism
CN115102705B (en) Automatic network security detection method based on deep reinforcement learning
CN115580430A (en) Attack tree-pot deployment defense method and device based on deep reinforcement learning
Yamin et al. Use of cyber attack and defense agents in cyber ranges: A case study
Jin et al. Backdoor attack is a devil in federated gan-based medical image synthesis
CN110365625B (en) Internet of things security detection method and device and storage medium
CN113360917A (en) Deep reinforcement learning model security reinforcement method and device based on differential privacy
CN112685291A (en) System joint test method and related device
Huang et al. Attack detection and data generation for wireless cyber-physical systems based on self-training powered generative adversarial networks
Chen et al. GAIL-PT: a generic intelligent penetration testing framework with generative adversarial imitation learning
CN117610026B (en) Honey point vulnerability generation method based on large language model
CN115277153B (en) Smart grid 5G network risk assessment system and assessment method
CN113946758B (en) Data identification method, device, equipment and readable storage medium
Smith III Genetic program based data mining for fuzzy decision trees
CN114726622B (en) Back door attack influence evaluation method for power system data driving algorithm, system thereof and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination