CN115473706A

CN115473706A - Deep reinforcement learning intelligent penetration test method and device based on simulation learning

Info

Publication number: CN115473706A
Application number: CN202211046763.0A
Authority: CN
Inventors: 陈晋音; 胡书隆; 李晓豪; 李玮峰; 赵云波
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2022-08-30
Filing date: 2022-08-30
Publication date: 2022-12-13

Abstract

The invention discloses a deep reinforcement learning intelligent penetration test method and a device based on simulation learning, wherein the method comprises the following steps: (1) Acquiring expert sample data, wherein the expert sample data is a state action pair when the post-infiltration succeeds; (2) Training an agent by utilizing an A3C algorithm, wherein the agent is used as a simulation attacker in a penetration test; (3) Putting a first state action pair generated by an actor network in an A3C algorithm and a second state action pair in expert sample data into a GAIL (generic object identifier) disarminator network in the training process of the agent to train the disarminator network; (4) Constructing an advantage function according to discount rewards output by the disambinator network after training and values output by the criticc network, and updating an operator network in the A3C algorithm by using the advantage function; (5) Repeating the steps (2) - (4) until the training round is finished; (6) And setting the trained intelligent agent in a network environment needing penetration testing to perform the penetration testing.

Description

Deep reinforcement learning intelligent penetration test method and device based on simulation learning

Technical Field

The invention belongs to the technical field of defense facing to network space security and deep reinforcement learning, and particularly relates to a deep reinforcement learning intelligent penetration test method and device based on imitation learning.

Background

With the continuous development of artificial intelligence technology and internet technology, network attack technology is also increasingly updated. The Penetration Test (pennetration Test) is used as a network security Test and evaluation method, and potential security hazards possibly existing in a target network are tested by simulating the real attack behavior of a hacker, so that the purpose of clearing the potential hazards and improving the system security is achieved. Under the military combat scene of the confrontation of the red and blue army, penetration testing is widely used, and the penetration party serving as the blue army performs penetration evaluation on partial bugs existing in the military combat network in a mode of simulating the attack of malicious hackers, so that the aim of defending against the malicious network attack is fulfilled. The penetration testing process includes an active analysis of all vulnerabilities, technical deficiencies, and all leaks of the network system from a location where an attacker may exist and from this location a security hole is conditionally actively penetrated. The one-time complete penetration test mainly comprises seven steps of early-stage interaction, information collection, threat modeling, vulnerability analysis, penetration attack, post-penetration attack and report generation. In summary, penetration testing involves controlled attacks on a computer system to assess its security. At present, it is one of the key methods adopted by the international network security organization to strengthen the defense against network threats.

However, the network penetration test requires a lot of training and time cost to obtain a good result, but at present, the skilled network security professional is in increasing shortage, so it is very important to intelligentize the penetration test and save labor cost. Penetration testing passes through authorized controlled attacks on the network system to discover any security holes that an attacker may exploit. This approach is very effective for assessing system security because it essentially simulates the behavior of a real-world attacker in a real-world scenario. However, one major drawback behind this effectiveness is that it requires a high cost in time and skill in performing the infiltration. As network systems grow in size, complexity and number, this high cost problem becomes increasingly non-negligible, which directly places higher demands on security professionals, which are not met quickly enough at present. One approach to attempt to address this problem is to apply Artificial Intelligence (AI) techniques to the field of network security in order to automate and intelligentize the penetration test process. Current automated penetration testing methods rely on model-based methods and penetration efficiency is generally not high, however, network security situations are rapidly changing with the development of new software and attack vectors, which makes the production and maintenance of new models a challenge.

Disclosure of Invention

Aiming at the problems in the prior art, the embodiment of the application aims to provide a deep reinforcement learning intelligent penetration testing method and device based on simulation learning so as to improve the automatic penetration testing efficiency.

According to a first aspect of the embodiments of the present application, there is provided a deep reinforcement learning intelligent penetration testing method based on imitation learning, including:

(1) Acquiring expert sample data, wherein the expert sample data is a state action pair when post-infiltration succeeds;

(2) Training an agent by utilizing an A3C algorithm, wherein the agent is used as a simulation attacker in a penetration test;

(3) Putting a first state action pair generated by an actor network in an A3C algorithm and a second state action pair in the expert sample data into a GAIL (generic object identifier) disarminator network in an agent training process, and training the disarminator network;

(4) Constructing an advantage function according to the discount reward output by the disincimator network after training and the value output by the criticc network, and updating an operator network in the A3C algorithm by using the advantage function;

(5) Repeating the steps (2) - (4) until the training round is finished;

(6) And setting the trained intelligent agent in a network environment needing penetration testing so that the intelligent agent performs penetration testing.

Further, training the agent by using the A3C algorithm, including:

(1.1) framing penetration testing into a markov decision process;

(1.2) training all sub-threads of the agent by respectively adopting an AC algorithm, wherein the training process of each sub-thread comprises the following steps:

(1.2.1) inputting the state of the current moment to obtain a corresponding strategy;

(1.2.2) constructing a merit function by using the difference value of the reward function and the merit function to evaluate the strategy, wherein the merit function is as follows:

A(s,t)＝r _t +γr _r+1 +...+γ ^n-1 R _t+n-1 γ ⁿ V(s')-V(s)＝R(t)-V(s)

wherein gamma is a discount factor, the value range is (0, 1), R (-) is a reward function, and V (-) is a value function;

(1.2.3) updating the parameters of the operator network and the critical network in the child thread by using the strategy gradient:

wherein, theta _i And mu _i Respectively the parameters of the actor network and criticc network in the ith sub-thread, pi (a | s; theta) _i ) Is based on Actor network parametersθ _i A (s | μ) of _i ) Is based on the Critic network parameter mu _i The merit function of (1);

(1.3) after all the child threads are updated in one round, updating parameters theta and mu of the actor network and the critic network in the main network of the intelligent agent according to the parameters of the actor network and the critic network updated by all the child threads:

wherein n is the number of the sub-threads, _i and beta _i Parameter θ for ith sub-thread respectively _i And mu _i An updated learning rate;

(1.4) repeating steps (1.2) and (1.3) until the training round is over.

Further, in the training process of the agent, the state includes a vulnerability name, a port service, a service version number, a penetration module and a penetration target, the action is a load output by the penetration framework metasploit, and the reward is set according to whether penetration is successful and the type of the output load.

Further, the step (3) comprises:

(3.1) initializing a resolver network D and an actor network pi;

(3.2) according to the preset number of the first state action pairs and the preset number of the second state action pairs, respectively setting labels 0 and 1 for the first state action pairs and the second state action pairs, and putting the labeled first state action pairs and the labeled second state action pairs into a Gateway (GAIL) descriptor network;

(3.3) training a discriminator network with the first state action: calculating JS divergence, performing gradient reverse derivation according to the JS divergence to update parameters of a discriminator network, and adjusting the distribution of the number of pairs of the first state actions and the logarithm of the second state actions;

(3.4) repeating step (3.3) until JS divergence is minimized:

wherein, tau _i Representing a set of first state-action pairs, (s, a) being any state-action pair from the set of first state-action pairs, logD _w (s, a) represents the loss judgment of the output of the operator network, log (1-D) _w (s, a)) represents the loss judgment of the resolver network output;

(3.5) after training of the discriminator network is finished, inputting the state of the next moment into the operator network in the GAIL to obtain the probability of all actions, selecting the action corresponding to the value with the maximum probability as the final action, simultaneously evaluating the action by the critic network in the GAIL to output the corresponding value, and outputting a discount reward value to the action by the discriminator network.

Further, the step (4) comprises:

and constructing a new advantage function pi (theta is used for guiding the training of the operator network according to the discount reward value, and updating the parameter theta of the operator network by a strategy gradient updating method:

wherein alpha is _h ∈R ⁺ And (3) representing the corresponding learning rate of the h step, wherein h is the training step number set in each training round in the step (2).

According to a second aspect of the embodiments of the present application, there is provided a deep reinforcement learning intelligent penetration testing device based on simulation learning, including:

the acquisition module is used for acquiring expert sample data, wherein the expert sample data is a state action pair when the post-infiltration is successful;

the A3C training module is used for training the intelligent agent by utilizing an A3C algorithm, wherein the intelligent agent is used as a simulation attacker in the penetration test;

the GAIL training module is used for putting a first state action pair generated by an operator network in an A3C algorithm and a second state action pair in the expert sample data into a GAIL (generic object model) disarming network in an agent training process, and performing training of the disarming network;

the updating module is used for constructing a dominant function according to the discount rewards output by the cognitive network after training and the value output by the critical network and updating the operator network in the A3C algorithm by using the dominant function;

the agent training module is used for repeating the process from the A3C training module to the updating module, and training the agent until the training round is finished;

and the penetration testing module is used for setting the trained intelligent agent in a network environment needing penetration testing so that the intelligent agent can perform the penetration testing.

According to a third aspect of embodiments of the present application, there is provided an electronic apparatus, including:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a method as described in the first aspect.

According to a fourth aspect of embodiments herein, there is provided a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, carry out the steps of the method according to the first aspect.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

according to the embodiment, the application provides a novel depth reinforcement learning intelligent penetration test method based on simulation learning, and the A3C algorithm and the GAIL algorithm are combined, so that the leak penetration efficiency is improved; in the training process of the GAIL network, expert sample data is obtained first, and then a model is imported for training. The operator network in the model generates a state action pair on line, and then the state action generated on line and the state action of the expert are placed in the discriminator network for training, so that the action generated by the operator network and the action of the expert are infinitely close.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

FIG. 1 is a flow diagram illustrating a method for deep reinforcement learning intelligent penetration testing based on mock learning according to an exemplary embodiment.

FIG. 2 is a schematic diagram illustrating the structure of the A3C algorithm, according to an exemplary embodiment.

Fig. 3 is a schematic diagram illustrating a GAIL algorithm architecture according to an exemplary embodiment.

FIG. 4 is a block diagram illustrating a deep reinforcement learning intelligent penetration testing apparatus based on mock learning according to an exemplary embodiment.

FIG. 5 is a block diagram illustrating a deep reinforcement learning intelligent penetration testing apparatus based on mock learning according to an exemplary embodiment.

FIG. 6 is a schematic diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if," as used herein, may be interpreted as "at \8230; \8230when" or "when 8230; \823030when" or "in response to a determination," depending on the context.

FIG. 1 is a flowchart illustrating a method for deep reinforcement learning intelligent penetration testing based on mock learning according to an exemplary embodiment, which may include the following steps as shown in FIG. 1:

(2) Training an agent by utilizing an A3C algorithm, wherein the agent serves as a simulation attacker in an infiltration test;

(4) Constructing a dominance function according to the discount rewards output by the disintimator network after training and the value output by the critic network, and updating an operator network in the A3C algorithm by using the dominance function;

(5) Repeating the steps (2) - (4) until the training round is finished;

Specifically, the network inputs of Actor and Critic are both the currently observed states, while the output of the former is the corresponding action and the output of the latter is the value of the current state. The inputs to the resolver network are the agent's status-action pairs and the expert sample's status-action pairs used only for training, respectively, the output of which is the discount reward value.

In the implementation of step (1), the expert sample data refers to the state action pairs corresponding to the higher bonus value. The penetration test scene is different from the traditional game gym scene of reinforcement learning, the expert sample data of the penetration test scene needs to be obtained manually, in the penetration test scene, the highest reward value can be obtained after the penetration succeeds, and therefore the state action pair defined when the penetration succeeds is the expert sample data

In this embodiment, based on a metasploit penetration test scenario, in the process of performing a penetration test by using metasploit, when port information of a target drone is scanned by nmap, the port information is extracted as a status input, and then a penetration module and a payload matched with the port information are sequentially selected by calling an msf penetration frame until penetration is successful. In the code setting, once the post-penetration is successful, a prompt of 'BINGO' is given, and the current state and action are stored to realize the acquisition of the expert sample, wherein the following is the scene description of the generation of the expert sample data;

in the metasploit penetration test scene, the operation types include 16 types in total, namely windows, unix and linux; port services comprise ssh, telnet, apache and the like, and the number of the port services is 37 in total; the number of the permeation modules can be selected to be 1417. As can be seen, the successful state of this post-infiltration is [0.875, -0.35135135135135135137, 3.3, -0.131968948270996, 0], and the action is 472. The 1 st, 2 nd, and 4 th dimensions in the five-dimensional state information respectively refer to normalized operation data performed after the serial numbers corresponding to the operating system, the port service name, and the pervasive module, and belong to key information, and the output action 472 corresponds to the 472 th payload in the payload lists 0 to 592.

And after multiple data storage, 37167 pairs of expert sample data are successfully acquired.

In the specific implementation of the step (2), a simulated attacker in the network is trained based on an asynchronous dominant motion evaluation algorithm (A3C) in deep reinforcement learning, the goal of the simulated attacker is to simulate penetration attack on vulnerabilities existing in the network so as to achieve the purpose of evaluating network security, the A3C algorithm adopts a multithreading method, AC is put into multiple threads for synchronous training, each sub-thread can interact with the environment independently to obtain experience data, and the threads do not interfere with each other and run independently. Wherein the main network does not need to be trained and is only used for storing parameters of the AC network structure. Training the agent using the A3C algorithm, including:

(1.1) framing penetration testing into a markov decision process;

in this embodiment, in the training process of the agent, the state includes a vulnerability name, a port service, a service version number, a penetration module, and a penetration target, the action is a load (payload) output by the penetration framework metasploit, and the reward is set according to whether penetration is successful and the type of the output load.

(1.2) training all sub-threads of the agent by respectively adopting an AC algorithm;

specifically, the Actor-Critic Algorithm (AC) combines a strategy gradient and a value function, approximates the function to estimate a return value using a Critic network to evaluate the strategy of the Actor, and takes charge of generating an action using an Actor network.

Wherein the training process of each sub-thread comprises the following steps:

specifically, in the main network and the sub-thread network, A3C adopts an AC network structure, namely, the AC network structure is divided into an Actor network pi ^θ (as) and Critic network V ^μ (s) of the reaction mixture. In the Actor network, the corresponding strategy pi (a | s; theta) is obtained by inputting the state data of the current time, and the pi (a | s; theta) represents the probability of selecting the action a under the condition of the state s of the current time and the parameter theta.

A(s,t)＝r _t +γr _r+1 +...+γ ^n-1 R _t+n-1 γ ⁿ V(s')-V(s)＝R(t)-V(s)

specifically, the discount factor γ is set to 0.9, and the above-mentioned merit function is obtained by sampling n training steps, where n can be set according to actual conditions.

wherein theta is _i And mu _i Respectively the parameters of the operator network and the critical network in the ith sub-thread, pi (a | s; theta) _i ) Is based on the Actor network parameter θ _i A (s | μ) of _i ) Is based on the critic network parameter mu _i The merit function of (2);

specifically, the updating of the operator network parameter theta is performed by first taking the logarithm of the policy function and then comparing theta _i And (4) derivation, differentiation of the parameters at the moment, and finally superposition of the two parameters. Updating the critic network parameter mu, namely firstly updating the mu in the merit function _i And obtaining the derivative by superposition with the differential of the derivative.

(1.3) after all the sub-threads are updated in one round, updating parameters theta and mu of an operator network and a critic network in the main network of the intelligent agent according to the updated parameters of the operator network and the critic network of all the sub-threads:

wherein n is the number of the sub-threads, and alpha and beta respectively represent the learning rate of updating the parameters theta and mu;

specifically, the learning rates α and β are dynamically adjusted according to the training effect at this time, and α and β are set to 0.00025 in the initial stage of training and to 0.0001 in the later stage of training.

(1.4) repeating steps (1.2) and (1.3) until the training round is over.

Specifically, through interaction of a plurality of sub-threads and the environment, different sub-networks update a main network together, and the sub-networks and the main network are synchronized when the main network is updated every time, so that the gradient parameters of the main network and the environment are compared when the sub-networks return the gradient parameters next time, and the main network is ensured to train in a good direction continuously.

In the implementation of step (3), generative confrontation mimic learning (GAIL) is similar to generative confrontation network (GAN), and GAIL introduces a mimic learning mechanism on the basis of GAN, and in addition to the operator network and the critic network, GAIL uses a disarrinator network in the training process. In the training process, the state action pair generated by the operator network in the A3C algorithm and expert sample data (namely, the state action pair which is successfully permeated later) are put into a cognitive network of GAIL for training, after the training of the cognitive network is finished, the generation of corresponding actions can be guided, so that the action distribution approaches to the expert sample, and the model outputs actions with higher reward values at higher probability, thereby achieving the purpose of improving the permeability performance, wherein the step (3) can comprise the following substeps:

(3.1) initializing a resolver network D and an actor network pi;

specifically, the discriminator network and the generator network in the GAIL need to be initialized in the early stage of training to prevent overfitting of the late training.

(3.2) tagging 0 and 1 to the first status-action pairs and the second status-action pairs respectively according to the number of the predetermined first status-action pairs and the number of the predetermined second status-action pairs, and placing the tagged first status-action pairs and second status-action pairs into the descriptor network of the GAIL;

specifically, splicing the states and actions of expert sample data collected in advance, and setting a label 1 for the spliced second state action; at the same time, the first state action pairs generated by the operator network are also spliced, and the spliced first state action is opposite to the label 0. The label processing is carried out on the data according to different state actions, the state pairs generated on line and the expert action pairs can be classified, and the resolver network can carry out two-class training on the two types of data, so that the training efficiency is accelerated.

specifically, the spliced two types of data are input into a discriminator network in the GAIL for training, and a loss function of the real-state strategy sequence trajectory data and a loss function of the currently generated sequence trajectory data are continuously optimized until the discriminator network training is finished.

(3.4) repeating step (3.3) until JS divergence is minimized:

wherein, tau _i Representing a set of first state-action pairs, (s, a) being any state-action pair, logD, from the set of first state-action pairs _w (s, a) represents the loss judgment on the output of the operator network, log (1-D) _w (s, a)) represents the loss judgment of the resolver network output;

through such a maximum and minimum game process, the Actor network and the resolver network required by training can be optimized in a circulating and alternating mode.

Therefore, the generator operator network updates and trains the merit function constructed by the output of the critic network and the discriminator network, and further, the generator operator network updates and trains the 3 networks in the A3C-GAIL: the operator network, the critical network and the disarminator network are closely linked. The Critic network evaluates the value of the current state, and can visually evaluate the quality of the action generated by the current state, so that the intelligent agent is promoted to be biased to select the action corresponding to the state with higher value.

In the specific implementation of the step (4), the value output by the critic network is subtracted from the discount reward output by the discriminator network to construct a new merit function pi (theta). Guiding the training of the operator network according to the advantage function pi (theta), and updating the parameter theta of the operator network by a strategy gradient updating method:

wherein alpha is _h ∈R ⁺ Represents the corresponding learning rate of the h-th step, h being set for each training round in step (2)Training steps;

in the implementation of step (5), as shown in FIG. 4, the network inputs of the operator and the critic are both currently observed states, while the output of the former is the corresponding action and the output of the latter is the value of the current state. The inputs to the resolver network are the agent's state-action pairs and the expert sample's state-action pairs used only for training, respectively, the outputs of which are the discount reward values. A new merit function is constructed by awarding discount of the critic network output value and the disarminator network output so as to continuously update the training of the operator network, and therefore the action generated by the operator network is continuously close to the action of the expert. When the number of times of training reaches the preset number of training rounds, the step is ended. In an embodiment, the number of training rounds set is 20 ten thousand rounds, the number of training steps in each round is 1 ten thousand steps, and the setting is a conventional setting in the art, and can be set by itself according to actual situations, which is not described herein.

In the concrete implementation of the step (6), after the training of the intelligent agent is finished through the process of the step (5), the trained intelligent agent is arranged in the network environment which needs to be subjected to the penetration test, so that the intelligent penetration test is carried out.

The technical conception of the invention is as follows: the Reinforcement Learning (RL) is an artificial intelligence optimization technology, and has the key advantages that an environment model is not needed to generate an attack strategy, the optimal strategy is learned through interaction with the environment, and the deep reinforcement learning fully utilizes a neural network as a parameter structure and combines the perception capability of the deep learning and the decision capability of the reinforcement learning to optimize the deep reinforcement learning strategy. The present application therefore introduces reinforcement learning agents as simulated attackers in penetration testing. In the training process of the intelligent penetration test based on the A3C-GAIL, the penetration test is framed into a Markov Decision Process (MDP), and information such as an operating system type, a product name, a product version, a selected vulnerability exploitation module, a target index (target type of attack) and the like is input into a model as state data. The actions performed include a single scan operation and exploit for each service and each computer on the network, i.e., a full scan with Nmap, which returns information of the services running on each port on a given server and the version of each service. And selecting a corresponding action according to the current state, namely selecting an attack load in a payload list corresponding to the vulnerability exploitation module at the moment to obtain the next state (the target index can be changed randomly). And (4) carrying out vulnerability penetration on the target server through metasploit, and feeding back a vulnerability penetration result to set the reward. In the sub-thread, each network is trained independently, the generated vulnerability penetration mode is input into the metasploit to carry out module matching on the server, and the metasploit selects a corresponding exploit module according to corresponding vulnerability information and selects payload. The reward value is then fed back to the depth augmentation model by detecting the type of payload for guiding policy generation. And outputting Payloads through mutual transmission of parameters and gradients between the main network and each branch network so as to guide decision training and achieve the aims of identifying and permeating bugs. The A3C-GAIL structure is adopted to guide generation of better action distribution through expert sample data, so that detection of the network vulnerability is accelerated, penetration efficiency of the network vulnerability is improved, and performance improvement of an intelligent agent in the penetration efficiency and reduction of penetration cost by using a metasploid penetration framework under a single target drone scene metasploid 2 are achieved. .

The invention has the following beneficial effects: 1) A new intelligent penetration testing method is provided, and the A3C algorithm and the GAIL algorithm are combined, so that the penetration efficiency of the loophole is improved; 2) In the process of acquiring expert sample data, the experience of a person is combined with a machine algorithm by recording the successful state and action of each penetration, so that the aim of man-machine cooperation is fulfilled; 3) In the training process of the GAIL network, when sample data is expanded to a certain scale, a model is introduced for training. An operator network in the model generates corresponding actions from the sampling state of the sample data, and then the generated actions and the expert actions in the sample data are placed in a disacrimator network for training until the actions generated by the strategy network and the expert actions are infinitely close; 4) The method is experimentally verified in the actual single-machine target drone scene metaschoice 2, and the penetration effect is better than that of the original deep explicit model in the aspects of reward values, penetration success path number within limited time and the like.

Corresponding to the foregoing embodiments of the depth-enhanced learning intelligent penetration testing method based on the learning simulation, the present application also provides embodiments of a depth-enhanced learning intelligent penetration testing apparatus based on the learning simulation.

FIG. 5 is a block diagram illustrating an apparatus for deep reinforcement learning intelligent penetration testing based on mock learning according to an exemplary embodiment. Referring to fig. 5, the apparatus may include:

the acquisition module 21 is configured to acquire expert sample data, where the expert sample data is a state action pair when post-infiltration succeeds;

an A3C training module 22, configured to train an agent using an A3C algorithm, where the agent serves as a simulated attacker in a penetration test;

the GAIL training module 23 is configured to place a first state action pair generated by an actor network in an A3C algorithm and a second state action pair in the expert sample data in a GAIL descriptor network in an agent training process, and perform training of the descriptor network;

the updating module 24 is configured to construct an advantage function according to the discount reward output by the destructor network after training and the value output by the critic network, and update the operator network in the A3C algorithm by using the advantage function;

an agent training module 25, configured to repeat the process from the A3C training module to the update module, train the agent until the training round is completed;

and the penetration testing module 26 is used for arranging the trained intelligent agent in a network environment needing penetration testing so that the intelligent agent can perform the penetration testing.

With regard to the apparatus in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application. One of ordinary skill in the art can understand and implement it without inventive effort.

Correspondingly, the present application further provides an electronic device, comprising: one or more processors; a memory for storing one or more programs; when executed by the one or more processors, cause the one or more processors to implement a method for deep reinforcement learning intelligent penetration testing based on mock learning as described above. As shown in fig. 6, for a hardware structure diagram of any device with data processing capability in which the method for testing deep reinforcement learning based on learning emulation according to the embodiment of the present invention is located, in addition to the processor, the memory, and the network interface shown in fig. 6, any device with data processing capability in which the apparatus is located in the embodiment may further include other hardware according to an actual function of the any device with data processing capability, which is not described again.

Accordingly, the present application also provides a computer readable storage medium, on which computer instructions are stored, and the instructions, when executed by a processor, implement the simulation learning based deep reinforcement learning intelligent penetration test method as described above. The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be an external storage device of the wind turbine, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), and the like, provided on the device. Further, the computer readable storage medium may include both an internal storage unit of any data processing capable device and an external storage device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing capable device, and may also be used for temporarily storing data that has been output or is to be output.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the present application is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof.

Claims

1. A deep reinforcement learning intelligent penetration test method based on imitation learning is characterized by comprising the following steps:

(5) Repeating the steps (2) - (4) until the training round is finished;

2. The method of claim 1, wherein training the agent using the A3C algorithm comprises:

(1.1) framing the penetration test as a markov decision process;

A(s,t)＝r _t +γr _r+1 +...+γ ^n-1 R _t+n-1 γ ⁿ V(s')-V(s)＝R(t)-V(s)

(1.2.3) updating the parameters of the actor network and the critic network in the child thread by using the strategy gradient:

wherein, theta _i And mu _i Respectively the parameters of the actor network and criticc network in the ith sub-thread, pi (a | s; theta) _i ) Is based on the Actor network parameter θ _i Policy function of (a) (s | μ) _i ) Is based on the Critic network parameter mu _i The merit function of (1);

wherein n is the number of child threads, α _i And beta _i Parameter θ for ith sub-thread respectively _i And mu _i An updated learning rate;

(1.4) repeating steps (1.2) and (1.3) until the training round is over.

3. The method of claim 1, wherein during the training process of the agent, the status comprises vulnerability name, port service, service version number, infiltration module, infiltration target, the action is the load exported by the infiltration framework metasploid, and the reward is set according to whether the infiltration is successful and the type of the exported load.

4. The method of claim 1, wherein step (3) comprises:

(3.1) initializing a resolver network D and an actor network pi;

(3.3) training a discriminator network with the first state action: calculating JS divergence, and performing gradient reverse derivation according to the JS divergence so as to update parameters of the descriptor network and adjust the distribution of the number of pairs of the first state actions and the logarithm of the second state actions;

(3.4) repeating step (3.3) until JS divergence is minimized:

wherein, tau _i Representing a set of first state-action pairs, (s, a) being any state-action pair from the set of first state-action pairs, log D _w (s, a) represents the loss judgment on the output of the operator network, log (1-D) _w (s, a)) represents the loss judgment of the resolver network output;

5. The method of claim 1, wherein step (4) comprises:

θ←θ+α _h ▽ _θ π(θ)

6. A deep reinforcement learning intelligent penetration testing device based on simulation learning is characterized by comprising:

the GAIL training module is used for putting a first state action pair generated by an actor network in an A3C algorithm and a second state action pair in the expert sample data into a GAIL descriptor network in the training process of the agent to train the descriptor network;

the updating module is used for constructing an advantage function according to the discount reward output by the destructor network after training and the value output by the critic network and updating the operator network in the A3C algorithm by using the advantage function;

7. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.

8. A computer-readable storage medium having stored thereon computer instructions, which when executed by a processor, perform the steps of the method according to any one of claims 1-5.