CN116545657A

CN116545657A - Automatic permeation deduction system and method

Info

Publication number: CN116545657A
Application number: CN202310396049.2A
Authority: CN
Inventors: 傅涛; 潘志松; 詹达之; 张磊; 谢艺菲; 王海洋; 郑轶; 余鹏
Original assignee: Bozhi Safety Technology Co ltd
Current assignee: Bozhi Safety Technology Co ltd
Priority date: 2023-04-13
Filing date: 2023-04-13
Publication date: 2023-08-04

Abstract

The application discloses an automated permeation deduction system and method, wherein the system comprises: the problem definition module is used for establishing a reinforcement learning model aiming at the shortest hidden attack path discovery problem; an element module for defining actions, states, rewards, and policies in reinforcement learning; and the reinforcement learning module is used for selecting corresponding actions in a network environment by adopting a DDPG algorithm which introduces multi-domain action selection, and finding the shortest attack sequence step length by continuing learning after experience observed from the learning environment, so as to find the weakest place in the network. The method and the system can automatically analyze the network environment of the target system, find and verify potential vulnerability points and vulnerability of the target system, and reduce the cost of penetration test.

Description

Automatic permeation deduction system and method

Technical Field

The application relates to a network security assessment system and method, belongs to the technical field of network security, and particularly relates to an automatic permeation deduction system and method.

Background

Penetration testing is an important assessment tool and means in network security, and comprehensive, comprehensive and detailed assessment of risk factors which form threats to the implementation of the existing network security is achieved by assessing the vulnerability of the existing network equipment and the effectiveness and integrity of the network security tool. The penetration test is to penetrate the target system from a hacker view angle, simulate attack by adopting a hacking means, mine and detect loopholes existing in the target network system, verify the security of the system, so as to preempt the hacker to find the weaknesses in the target network in one step, thereby formulating an effective security policy for security precaution, and therefore, the penetration test is also an important ring of the network security active defense system attack surface management.

The traditional penetration test technology mainly has the following defects: on the one hand, the traditional test is mainly manual operation of a penetration tester, and in the test process, the tester is required to perform experience judgment based on a penetration test tool, so that target system information is obtained by utilizing various methods, weak points are explored and determined, vulnerability exploitation and post penetration test are performed, and finally, a report document is used for describing the whole flow of the penetration test, analyzing risk points existing in a system and providing repair suggestions. It is easy to find that the traditional testing technology has strong dependence on the experience level of the testers, and high requirements are also put on the relevant knowledge mastering conditions of the testers; meanwhile, the penetration test is complex and cumbersome, and there are a large number of repeated operations, so that a large time and labor cost are required. On the other hand, the traditional penetration test can only be regarded as a snapshot of the system security situation at a certain moment, and the environment after the test can be updated for a plurality of times, new potential vulnerabilities and configuration errors which are not existed in the test can be introduced in the process, so that many penetration test reports are outdated before delivery, and the timeliness is low.

Disclosure of Invention

According to one aspect of the application, an automatic permeation deduction system is provided, the system can automatically analyze the network environment where the target system is located, find and verify potential vulnerability points and vulnerabilities of the target system, realize instant delivery of a test report, and reduce permeation test time and labor cost. The automated permeation deduction system comprises a problem definition module, an element module and a reinforcement learning module; wherein, the liquid crystal display device comprises a liquid crystal display device,

the problem definition module is used for establishing a reinforcement learning model for finding problems by the shortest hidden attack path in the multi-domain network space;

the element module is used for defining the state and the action of the agent and setting rewards according to the action result;

the reinforcement learning module is used for performing multi-domain action selection based on constraint relation between actions and states of the agents, performing reinforcement learning according to the reinforcement learning model, finding out the shortest attack sequence step length, and then finding out the weakest place in the network.

Preferably, the problem definition module is configured to consider the shortest hidden attack path discovery problem as a markov decision process MDP:

M＝(s,a,p,r,γ)

where S ε S is the current state of the network space, a ε A is the currently available attack actions, p is the probability of transition between states, r is the reward value after the agent takes action to reach the next state, and γ is the discount rate;

in the initial state s ₀ Training agent as an attacker to configure network space environment; will end state s _t Defined as an attack in which an attacker succeeds or fails in a limited number of steps;

in each attack step sequence, the agent takes an action to complete an attack step; at each step t, agent is from state s _t Initially, an action a is taken _t Reach a new state s _t+1 And gets r rewarded from the network environment _t ；

Wherein s is _t A state indicating a time t, including; the location of the agent, the running state of the computer, the running state of the service and the access state of the service; s is(s) _t+ 1 represents the next time state, containing at least one of the following information: the new location information of the agent, the new service information that the agent may acquire, and the authority information that the agent may acquire after accessing a certain service.

Preferably, the element module defines the status, action and rewards of the agent as follows:

the state is a set of possible states in a multi-domain network space;

the actions are action sets which can be taken by the agent and can change the state of the network space;

the rewards are rewards for taking action on the agent in one state.

Preferably, the reinforcement learning module adopts a modified DDPG algorithm, so that the agent can select different actions under different states.

Preferably, the reinforcement learning module is configured to perform the following steps:

storing sequences through an online policy network (s _t ,a _t ,r _t ,s _t+1 ) The sequence of actions represents: the execution state is s _t Obtain the prize value r _t And converts the next state to s _t+1 ；

When the policy network selects an inoperable action a in the state _t When it is mapped to a feasible action a using linear transformation _t ' the relevant action sequence is defined as (s _t ,a _t ,-∞,s _t+1 ) The expression: action a _t In state s _t And then executing in the state, again s _t The reward is a large negative value to ensure that the relevant action is not selected during the training process.

Preferably, the reinforcement learning module comprises a memory playback unit and four networks, wherein,

the memory playback unit usesTransfer process s, a, r in storage state _t ,s ₀ The method comprises the steps of carrying out a first treatment on the surface of the For small batch sampling, extracting corresponding samples to train the corresponding neural network so as to avoid strong correlation between the samples; the four networks comprise an online strategy network, a target strategy network, an online Q network and a target Q network; the policy network is used for simulating the behavior of an attacker, the neural network takes the current state as input, and the output is the action taken by the agent in the state; the Q network is used for estimating the expected state of the finally obtained rewards if the strategy is continuously executed after the current action is executed at a certain moment, the input of the expected state is the current state and the current operation, and the output of the expected state is the Q value.

According to yet another aspect of the present application, there is provided an automated permeation deduction method comprising:

constructing a reinforcement learning model according to the shortest hidden attack path discovery problem in the multi-domain network space;

defining the state and action of the agent, and setting rewards according to the result of the action;

and selecting multi-domain actions based on constraint relations between actions and states of the agents, performing reinforcement learning according to the reinforcement learning model, finding out the shortest attack sequence step length, and then finding out the weakest place in the network.

Preferably, the constructing the reinforcement learning model according to the shortest hidden attack path discovery problem in the multi-domain network space includes:

consider the shortest hidden attack path discovery problem as a markov decision process MDP: m= (s, a, p, r, γ). Wherein S ε S is the current state of the network space and S is the state space; a epsilon A is the currently available attack action, A is the action space, and represents the state s _t A set of valid actions that the agent can take in this state; p is the probability of a transition between states; r is the reward value after the agent takes action to reach the next state; gamma is the discount rate;

in the initial state s ₀ Training the agent as an attacker and a configured network space environment; final state s _t Corresponding to an attackAttacks that succeed or fail in a limited number of steps; in each attack step sequence, the agent takes an action to complete an attack step; at each step t, agent is from state s _t Initially, an action a is taken _t Reach a new state s _t+1 And gets r rewarded from the network environment _t The method comprises the steps of carrying out a first treatment on the surface of the Will s _t The state at the time t is defined, and comprises a position where the agent is located, a computer running state, a service running state and a service access state; s is(s) _t+1 For the next state, the update of the state is represented, which includes: new location information about the agent, new service information that the agent may acquire, and rights information that the agent may acquire after having accessed a certain service.

Preferably, the status of the agent is defined as: a set of possible states in a multi-domain network space;

the action is defined as: the agent can take action sets and can change the state of the network space;

the rewards are defined as: in one state, rewards are taken for agents to act.

Preferably, the searching of the shortest attack sequence step length adopts a modified DDPG algorithm, and uses a agent to represent an attacker; in the process of finding an attack path, the agent obtains a certain reward or negative reward after selecting actions in the current state;

finding out a corresponding strategy mapping function R(s) -A, maximizing the long-term rewards of the agent, outputting multi-domain actions to be executed, and finding out the shortest attack sequence step length in the network space;

when action a is selected _t When not operable, it is mapped to a feasible action a using linear transformation _t The action sequence at this time is defined as(s) _t ,a _t ,-∞,s _t+1 ) The expression: action a _t In state s _t And then executing in the state, again s _t The reward is a large negative value to ensure that the relevant action is not selected during the training process.

The beneficial effects that this application can produce include:

(1) The method and the system can automatically analyze the network environment of the target system, find and verify potential vulnerability points and vulnerability of the target system, liberate network security specialists from complex and repeated labor, and reduce the cost of penetration test;

(2) The method adopts a reinforcement learning-based method to find the permeation path, improves the existing DDPG algorithm, introduces multi-domain action selection, solves the problem of inconsistent selectable actions of reinforcement learning in different states, can better find the optimal attack path hidden in the multi-domain network space, and finds the weakest defect of the network space so as to effectively repair;

(3) The system or the method is suitable for daily penetration test, the penetration test can be automatically executed every day even after each change, the detection is possibly subjected to the utilized configuration change, the hysteresis limit of the existing penetration test technology is avoided, and the requirement of instant delivery test report can be met.

Drawings

FIG. 1 is a schematic diagram of an automated permeation deduction system framework in one embodiment of the present application.

Detailed Description

The present application is described in detail below with reference to examples, but the present application is not limited to these examples.

As shown in fig. 1, the automated permeation deduction system described herein includes a problem definition module, an element module, and a reinforcement learning module.

The problem definition module is used for establishing a reinforcement learning model for finding problems by the shortest hidden attack path in the multi-domain network space.

In one embodiment, the problem definition module treats the shortest hidden attack path discovery problem as a markov decision process, MDP:

M＝(s,a,p,r,γ)

Wherein s is _t A state indicating a time t, including; the location of the agent, the running state of the computer, the running state of the service and the access state of the service; s is(s) _t+1 Representing a next time state, comprising at least one of the following information: the new location information of the agent, the new service information that the agent may acquire, and the authority information that the agent may acquire after having accessed a certain service.

The element module is used for defining the state and the action of the agent and setting rewards according to the result of the action.

In one embodiment, the status of the agent is defined as: a set of possible states in a multi-domain network space.

In one embodiment, the actions of the agent are defined as: the agent may take a set of actions and may change the state of the web space.

In one embodiment, the reward means: in one state, rewards are taken for agents to act.

In one embodiment, the reinforcement learning module employs a modified DDPG algorithm, which enables agents to select different actions in different states.

In one embodiment, the reinforcement learning module specifically performs the following steps:

In one embodiment, the reinforcement learning module includes a memory playback unit and four networks, wherein,

the memory playback unit is used for storing the transition processes s, a, r of the states _t ,s ₀ The method comprises the steps of carrying out a first treatment on the surface of the For small batch sampling, extracting corresponding samples to train the corresponding neural network so as to avoid strong correlation between the samples;

the four networks comprise an online strategy network, a target strategy network, an online Q network and a target Q network; the policy network is used for simulating the behavior of an attacker, the neural network takes the current state as input, and the output is the action taken by the agent in the state; the Q network is used for estimating the expected state of the finally obtained rewards if the strategy is continuously executed after the current action is executed at a certain moment, the input of the expected state is the current state and the current operation, and the output of the expected state is the Q value.

The automated permeation deduction method comprises the following steps: the states and actions of agents (agents) are defined, i.e., the states observed by agents in a network environment and the corresponding actions are selected according to the states. The reward is set according to the result of the action. The DDPG algorithm that incorporates the reinforcement learning module, i.e., how to select the corresponding action in the network environment, and the experience observed from the learning environment then continues to learn.

In one embodiment, the specific flow of the method is as follows:

step 1: problem definition.

The goal of this step is to find the shortest hidden penetration path, i.e. the shortest hidden attack path, so that the network administrator can find the weakest link in the network to take measures to strengthen the network space security.

Taking the example shown in fig. 1, in a given network space environment, the server S2 stores sensitive data thereon. If the attacker can access the server S2 and obtain sensitive data, this indicates that the attack was successful. Therefore, in the present embodiment, an attacker agent can be trained to access the server S2 and acquire its sensitive data using reinforcement learning.

Defining the shortest hidden attack path: an attacker can access and acquire sensitive data from the server S2 and the attack sequence step size is the shortest.

The discovery problem of the shortest hidden attack path is regarded as a markov decision process MDP, namely:

M＝(s,a,p,r,γ)

where s.epsilon.S is the current state of the network space, a.epsilon.A is the currently available attack actions, p is the probability of transition between states, and r is the reward value after the agent takes action to reach the next state. Gamma is the discount rate.

In this setting, in the initial state s ₀ In the network space environment (including operations such as establishing an access control list) and then agents continuously permeate the network as an attacker for training. And meanwhile, performing matrix completion according to part of environment states, namely updating the state of the network space environment according to the new information acquired by the agent. Final state s _t Corresponding to an attack in which the attacker succeeds or fails in a limited number of steps. In each sequence of attack steps, the agent will take an action to complete an attack step. At each step t, agent is from state s _t Initially, an action a is taken _t Reach a new state s _t+1 And obtain the slaveRewards r in network environment _t . Will s _t The state defined as the time t comprises the position where the agent is located, the running state of the computer, the running state of the service, the access state of the service and the like. s is(s) _t+1 The update of the status is represented, which includes new location information about the agent, new service information that the agent may acquire, and authority information that the agent may acquire after having accessed a certain service. The action space a contains states s _t Is in state s _t A collection of actions that the agent can take. The goal of this work is to find the shortest attack path, so in the following reward setting, the reward is set to a fixed value divided by the penetration attack sequence step size.

Step 2: and strengthening and learning each element.

The elements to be reinforcement-learned mainly comprise states, actions and rewards of the agent.

Status: a set of possible states in the multi-domain network space, including the location of the agent, the device being operated, and the device permissions owned. Wherein the core rights of the multi-domain network space are that an attacker can obtain through a series of actions.

The actions are as follows: the agent may take a set of actions and may change the state of the web space. For example, entering a room, operating or controlling a terminal computer, access a service port through a terminal computer terminal port.

Rewarding: in one state, rewards are taken for agents to act. In this embodiment, the agent' S goal is to obtain sensitive data at server S2 and obtain a key or password to decrypt the information. In this case, the primary goal of the attacker is to obtain the device usage rights and device dominance rights of the server S2, and to obtain a password to decrypt the sensitive data, the information of the obtained sensitive data knows the rights, indicating that the attack was successful. Because of the multi-domain security rules, an attacker cannot directly access the server S2. In order to access the server S2, the attacker needs to access the firewall to modify the access control list, allowing the attacker to access the servers S1 and S2 through the terminal, and then the attacker can access the server S2, and if he successfully obtains the information-aware authority of the confidential information, then access the firewall again, and restore the firewall state to the original state.

Step 3: a DDPG algorithm that introduces reinforcement learning.

An attack path hidden under a certain network space configuration is discovered by the agent. With DDPG algorithm, an agent is used to represent an attacker. In discovering an attack path, the agent first selects an action in the current state, which can change the state of the environment and the agent. At the same time the agent will get a certain prize or negative prize. In addition, the agent's changing status enables it to perform other actions to get more rewards. Thus, the agent will be tested repeatedly in this network space environment.

The purpose of the above procedure is to find the shortest attack sequence step, select the corresponding action according to the state in which it is located, which also means to find the corresponding policy mapping function R(s) →a, so as to maximize the long-term rewards of the agent. In this process, policies can be divided into two categories, namely deterministic policies and stochastic policies. The deterministic policy is to select the corresponding action for the state action value. In general, deterministic strategy algorithms are more efficient, but suffer from insufficient exploration and improvement capabilities. Different from a deterministic strategy, the randomness strategy is added with a corresponding random value, so that the randomness strategy has certain exploration capacity. For the proposed model, since reinforcement learning modules are introduced, few optional actions are performed in the same state, and a deterministic strategy is adopted to ensure better model performance.

The model is a standard reinforcement learning model in which agents take action and get rewards in the web space with the goal of maximizing rewards that agents get and then training agents further. In this application, the DDPG algorithm is adopted and its action selection part is improved.

In contrast to the standard DDPG algorithm, the model of the present application comprises four networks and one memory playback unit. Wherein the memory playback unit is mainly responsible for the transfer process s, a, r of the memory state _t ,s ₀ The method comprises the steps of carrying out a first treatment on the surface of the For small batch sampling, the corresponding samples are extracted to train the phaseThe neural network should thus avoid strong correlations between samples. Among the four networks, there are two policy networks and two Q networks, namely an online and target policy network, an online Q and target Q network. The policy network mainly simulates the behavior of an attacker, and the neural network takes the current state as input and the output as the action taken by the agent in the state. The Q network is mainly used for estimating the expected state of the finally obtained rewards if the strategy is continuously executed after the current action is executed at a certain moment. The input is the current state and current operation, and the output is the Q value. If only a single neural network is used to simulate the strategy or Q value, the learning process is unstable, so in the DDPG algorithm, copies of two networks are created for the strategy network and the Q value network, respectively, the two networks are on-line networks, and the two networks are target networks. The online network is a real-time training network, and model parameters of the online network are updated to the target network after training is completed for a period of time, so that the stability of the training process is ensured, and convergence is easy.

The application improves the standard DDPG algorithm, and compared with the standard DDPG algorithm, the standard DDPG algorithm mainly has the following two differences:

first, in DDPG algorithms for multi-domain action selection, reinforcement learning modules are introduced. In a standard DDPG algorithm, the selection of actions that an agent can choose to perform is the same in each state. In such a cyberspace environment, however, when an attacker selects an attack path in the cyberspace, it has different selectable actions in each state. For example, if an attacker is in the outermost space, it may choose only the action "enter room" or "remain stationary", but in the standard DDPG algorithm, all actions in the network environment should be selectable. Obviously, the agent in the outermost space cannot have an action "control computer terminal". Therefore, in order to enable the DDPG algorithm to select different actions under different states, a reinforcement learning module is added. The input of the module is the output of the online policy network, the output action a changes linearly in this state, and the actual action a is executed _t ' action a is actually performed _t ' enter multi-domain action execution ModuleBlock, obtain corresponding rewards r _t '. Finally, the corresponding action a is actually executed _t ' and corresponding rewards r _t ' return to the online policy network. By this method, a reasonable selection of relevant actions in different states can be achieved.

Second, the inputs to the empirical playback mechanism are different. To ensure that the multi-domain action selection meets the action-to-state constraints, the input of an empirical playback mechanism is increased, not only by storing the sequence(s) over an online policy network _t ,a _t ,r _t ,s _t+1 ) I.e. the execution state is s _t Obtain the prize value r _t And converts the next state to s _t+1 . Moreover, due to the corresponding relation, the current state and agent action need to be considered during selection, and the action that the selection of the policy network is not feasible in the state is avoided. Thus, when the policy network selects an inoperable action a in the state _t Not only does the reinforcement learning module need to map it to a feasible action a using a linear transformation _t ' the related action sequence also needs to be written as(s) _t ,a _t ,-∞,s _t+1 ) In the form of (a) represents the action a _t In state s _t And then executing in the state, again s _t The reward is a large negative value, so that the relevant action is guaranteed not to be selected in the training process. In the network structure, two policy networks have the same structure, the input is the network state, and the output is the action to be selected by the agent. From a structural point of view, the policy network is mainly composed of 5 layers. The first layer is an input layer, the second layer is an hidden layer, the third layer and the fourth layer are all-connected layers, the fifth layer is an output layer, and finally a multi-dimensional vector a is output, wherein a represents multi-domain actions to be executed.

The foregoing description is only a few examples of the present application and is not intended to limit the present application in any way, and although the present application is disclosed in the preferred examples, it is not intended to limit the present application, and any person skilled in the art may make some changes or modifications to the disclosed technology without departing from the scope of the technical solution of the present application, and the technical solution is equivalent to the equivalent embodiments.

Claims

1. An automatic permeation deduction system is characterized by comprising a problem definition module, an element module and a reinforcement learning module; wherein, the liquid crystal display device comprises a liquid crystal display device,

the problem definition module is used for finding problems according to the shortest hidden attack path in the multi-domain network space and constructing a reinforcement learning model;

2. The automated penetration deduction system according to claim 1, wherein the problem definition module is configured to consider a shortest hidden attack path discovery problem as a markov decision process MDP:

M＝(s,a,p,r,γ)

Wherein s is _t A state indicating a time t, including; agent instituteThe location of the location, the computer running state, the service running state and the service access state; s is(s) _t+ 1 represents the next time state, containing at least one of the following information: the new location information of the agent, the new service information that the agent may acquire, and the authority information that the agent may acquire after having accessed a certain service.

3. The automated permeation deduction system according to claim 1, wherein the element module defines the status, actions, rewards for the agent as follows:

the state is a set of possible states in a multi-domain network space;

the rewards are rewards for taking action on the agent in one state.

4. The automated permeation deduction system of claim 1, wherein the reinforcement learning module employs a modified DDPG algorithm for enabling agents to select different actions in different states.

5. The automated permeation deduction system according to claim 4, wherein the reinforcement learning module is configured to perform the steps of:

When the policy network selects an inoperable action a in the state _t When it is mapped to a feasible action a using linear transformation _t ' the relevant action sequence is defined as (s _t ,a _t ,-∞,s _t+1 ) The expression: action a _t In state s _t And then executing in the state, again s _t The reward is a huge negative value to ensureThe relevant action is not selected during the training process.

6. The automated permeation deduction system according to claim 4, wherein the reinforcement learning module comprises a memory playback unit and four networks, wherein,

the memory playback unit is used for storing the transition processes s, a, r of the states _t ,s ₀ The method comprises the steps of carrying out a first treatment on the surface of the For small batch sampling, extracting corresponding samples to train the corresponding neural network so as to avoid strong correlation between the samples; the four networks comprise an online strategy network, a target strategy network, an online Q network and a target Q network; the policy network is used for simulating the behavior of an attacker, the neural network takes the current state as input, and the output is the action taken by the agent in the state; the Q network is used for estimating the expected state of the finally obtained rewards if the strategy is continuously executed after the current action is executed at a certain moment, the input of the expected state is the current state and the current operation, and the output of the expected state is the Q value.

7. An automated permeation deduction method, comprising:

constructing reinforcement learning model DD221034I according to shortest hidden attack path discovery problem in multi-domain network space

A shape;

8. The automated penetration deduction method according to claim 7, wherein constructing the reinforcement learning model according to the shortest hidden attack path discovery problem in the multi-domain network space comprises:

consider the shortest hidden attack path discovery problem as a markov decision process MDP: m= (S, a, p, r, γ), where S e S is the current state of the network space, a e a is the currently available attack action, p is the probability of transition between states, r is the reward value after the agent takes action to reach the next state, γ is the discount rate;

in the initial state s ₀ Training the agent as an attacker and a configured network space environment; final state s _t Corresponding to an attack in which an attacker succeeds or fails in a limited number of steps; in each attack step sequence, the agent takes an action to complete an attack step; at each step t, agent is from state s _t Initially, an action a is taken _t Reach a new state s _t+1 And gets r rewarded from the network environment _t The method comprises the steps of carrying out a first treatment on the surface of the Will s _t The state at the time t is defined, and comprises a position where the agent is located, a computer running state, a service running state and a service access state; s is(s) _t+ 1 is the state at the next time, and represents the update of the state, which includes: new location information about the agent, new service information that the agent may acquire, and authority information that the agent may acquire after having accessed a certain service.

9. The automated permeation deduction method according to claim 7, wherein the status of the agent is defined as: a set of possible states in a multi-domain network space;

the rewards are defined as: in one state, rewards are taken for agents to act.

10. The automated penetration deduction method according to claim 7, wherein the shortest attack sequence step is found using an improved DDPG algorithm, using agents to represent an attacker; in the process of finding an attack path, the agent obtains a certain reward or negative reward after selecting actions in the current state;