CN115174221B

CN115174221B - Industrial control OT network multi-target penetration test method and system

Info

Publication number: CN115174221B
Application number: CN202210789167.5A
Authority: CN
Inventors: 王凯; 吴贤生; 王子博; 张耀方; 王佰玲
Original assignee: Weihai Tianzhiwei Network Space Safety Technology Co ltd; Harbin Institute of Technology Weihai
Current assignee: Weihai Tianzhiwei Network Space Safety Technology Co ltd; Harbin Institute of Technology Weihai
Priority date: 2022-07-06
Filing date: 2022-07-06
Publication date: 2023-07-21
Anticipated expiration: 2042-07-06
Also published as: CN115174221A

Abstract

An industrial control OT network multi-target penetration test method and system comprises the steps of collecting network information to be tested and generating an attack graph; abstracting a Markov model from the attack graph and giving state transition rewards; adopting a reinforcement learning algorithm to interact with the Markov model to obtain an optimal attack strategy; the method and the device can be used for verifying the optimal attack strategy by calling the penetration test tool, solve the technical problems of single test target and complicated test method existing in the existing penetration test method, and can meet the requirement that a penetration test engineer can give out guidance again after changing part of attack steps based on own experience, thereby having larger flexibility and being widely applied to the field of big data processing.

Description

Industrial control OT network multi-target penetration test method and system

Technical Field

The invention relates to the field of big data processing, in particular to an industrial control OT network multi-target penetration test method and system.

Background

The industrial control OT (Operation Technology) network is used for managing industrial infrastructure and connecting control equipment and controlled equipment, such as an Industrial Control System (ICS) and a data acquisition and monitoring control System (SCADA). The traditional industrial control system is an independent system isolated from the Internet, and in recent years, with the networking of the industrial control system, the sealing performance and the proprietary performance of the industrial control system are broken, and a plurality of attack cases indicate that network attacks penetrate through an IT network to an OT network.

The penetration test is a typical analysis technology, can be used for assisting in bug repair, security reinforcement and the like from the view angle of an attacker, and is used for assisting a security protector to repair bugs on an optimal path preferentially by outputting an optimal attack path in the current automatic penetration test of a network, so that the penetration test needs a great deal of expertise, the traditional penetration test needs to be executed by an expert, and the penetration test can only adopt a periodic test evaluation on the system due to high labor cost, so that the system state is difficult to master through frequent tests, and a great deal of resources are input in the current industry to research and develop automatic penetration test tools for assisting in the penetration test, thereby reducing the workload of the expert.

The method for performing the automatic penetration test based on reinforcement learning is very suitable for penetration test scenes because of the characteristic of summarizing experience in the process of interacting with the environment, and is widely used. At present, most of IT network penetration tests are adopted, reinforcement learning and real environments are adopted for interaction, a large number of try attack modules are needed in the training process, the training process is not higher than the traversing type execution efficiency, some prior art enables a MulVAL tool to obtain an attack graph, and then partial nodes are extracted on the attack graph to serve as Markov models for reinforcement learning training. This represents that there are multiple possible final objectives in a penetration test, and evaluating the merit of each step of attack requires consideration of the impact of multiple objectives.

Therefore, the existing method defines excessive target-independent positive rewards, so that the targeting in the optimal path is not strong, a plurality of positive rewards tend to be collected along the path, and the method can output an optimal attack path instead of an attack strategy, so that the consideration of interference in the attack process is lacking, and when an auxiliary expert carries out penetration test, the expert is required to execute strictly according to the optimal path found by the system, and the adjustment of the expert based on personal experience and preference cannot be adapted.

Disclosure of Invention

The invention aims to provide an industrial control OT network multi-target penetration test method and system, and aims to solve the technical problems of single test target and complicated test method in the traditional penetration test method.

A first aspect of an embodiment of the present application provides an industrial control OT network multi-target penetration test method, which includes:

collecting the network information to be tested, and generating an attack graph;

abstracting a Markov model from the attack graph and giving state transition rewards;

adopting a reinforcement learning algorithm to interact with the Markov model to obtain an optimal attack strategy;

and calling a penetration test tool to verify the optimal attack strategy.

Preferably, the collecting of the measured network information is realized in the following manner:

scanning information of the network system to be tested;

setting an industrial control network penetration test target;

and collecting and establishing a vulnerability data set for data storage.

Preferably, the attack graph is generated, specifically by the following way:

and deducing the relation among all vulnerabilities according to the connectivity relation among all hosts on the network and the conditional relation before and after the vulnerabilities to form an attack graph, and acquiring the attack graph through a MulVAL tool by utilizing the collected host configuration information.

Preferably, a markov model is abstracted from the attack graph and a state transition reward is given, specifically by the following modes:

all nodes on the attack graph are used as states of the Markov process, and the nodes on the attack graph have different rewards and represent rewards obtained by entering the states in the Markov process.

Preferably, the reinforcement learning algorithm is adopted to interact with the Markov model, and the method is realized in the following way:

each screen starts, an arbitrary initial state s is selected, Q values corresponding to all actions under the s are calculated by a predicted value network, an action a corresponding to the largest Q value is selected and applied to an environment MDP model, the specific process is to inquire a Markov model diagram, if the states are connected by directed edges, the next state s 'is successfully returned, the number of the next state s' is the same as that of the state a, and an rewarding matrix is inquired to obtain rewards r; if no directed edges are connected, returning to the next state s ', the number of which is the same as s, inquiring the bonus matrix to obtain the bonus r, and placing the experience (s, a, r, s') in a playback buffer;

after each interaction with the environment, the neural network performs multiple training, the playback buffer is a storage area for storing experiences with a fixed size, experiences are randomly acquired from the playback buffer when the neural network is trained, the experiences are transmitted to a predicted value network, the predicted value network is output Q (s, a, w) by the current network parameters w, the playback buffer transmits s ' to a target value network, the target value network inputs the maximum Q value maxQ (s ', a ', w '), the playback buffer directly transmits r to a loss value function, the maxQ (s ', a ', w ') +r-Q (s, a, w) is an error function required by the predicted value network training, the gradient of the error function can guide the predicted value network to modify parameters, the parameters of the predicted value network are copied to the target value network after multiple interactions, and the neural network parameters are obtained after each screen interaction is finished until the target node of a Markov model is entered or the maximum step number is exceeded.

Preferably, the optimal attack strategy is obtained, in particular by:

inputting a state to the neural network obtained after reinforcement learning training, outputting values of all actions in the state, namely Q values, and searching actions corresponding to the maximum Q value by an intelligent body to obtain the optimal next action in the current state; starting from the initial state, if the method is always carried out according to an optimal strategy, an optimal path can be obtained; if the optimal path is deviated, the attack strategy provides guidance for the next attack.

Preferably, a penetration test tool is invoked to verify the optimal attack strategy, specifically by the following means:

according to the optimal path obtained by the intelligent agent, the information provided by the attack graph is used for guiding the calling of the penetration testing tool to carry out the actual penetration test.

Preferably, the penetration test tool employs Metasploit, burpsuit or W3af.

A second aspect of the present application provides an industrial control OT network multi-target penetration test system, comprising:

attack graph generation module: the method comprises the steps of collecting network information to be tested and generating an attack graph;

attack graph conversion module: for abstracting a Markov model from the attack graph and awarding state transition rewards;

and an interaction module: the method comprises the steps of interacting with the Markov model by adopting a reinforcement learning algorithm to obtain an optimal attack strategy;

and (3) a verification module: and the method is used for calling a penetration test tool and verifying the optimal attack strategy.

According to the method, the attack graph is obtained by collecting information, the Markov model is abstracted from the attack graph, then the reinforcement learning method is used for obtaining the optimal strategy of the Markov model, and finally the attack tool is guided to be called for attack verification; and an application mode of the attack strategy is provided, and a cost setting mode is adopted, so that the target guidance of the optimal path is ensured. The method is suitable for multi-target penetration test of the industrial control network, assists penetration test engineers to pass through an IT network to reach an OT network, takes a target as a guide, searches an optimal attack strategy which enables attack cost to be minimum, and can give a path which can reach the target easily; meanwhile, the method can meet the requirement that the penetration test engineer changes part of attack steps based on own experience and gives guidance again, and has great flexibility.

Drawings

Fig. 1 is a schematic flow chart of a multi-objective penetration test method for an industrial control OT network according to an embodiment of the present application;

fig. 2 is a topology diagram of an industrial control OT network multi-objective penetration test method according to an embodiment of the present application;

FIG. 3 is a diagram of an example network system under test according to one embodiment of the present application;

FIG. 4 is a diagram of an attack example provided in an embodiment of the present application;

FIG. 5 is a block flow diagram of a reinforcement learning algorithm according to an embodiment of the present application;

fig. 6 is a state transition diagram of an optimization strategy according to an embodiment of the present application.

Detailed Description

In order to make the technical problems, technical schemes and beneficial effects to be solved by the present application more clear, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

It should be noted that, the positional or positional relationship indicated by the terms such as "upper", "lower", "inner", "outer", "top", "bottom", etc. are based on the positional or positional relationship shown in the drawings, and are merely for convenience of describing the present invention and simplifying the description, and are not to be construed as indicating or implying that the apparatus or element in question must be provided with a specific orientation, be configured and operated in a specific orientation, and therefore should not be construed as limiting the present invention.

Referring to fig. 1, a flow chart of an industrial control OT network multi-target penetration test method provided in an embodiment of the present application is shown, for convenience of explanation, only the relevant parts of the embodiment are shown, and the detailed description is as follows:

in one embodiment, as shown in fig. 2, which is a topological diagram of the method for testing the OT network multi-objective penetration test, for the network to be tested, an attack graph is generated by MulVAL, then the attack graph is converted into a markov process (MDP) and a state conversion reward is given, an agent interacts with the MDP by adopting a reinforcement learning algorithm, trains to obtain an optimal attack strategy, and finally the agent invokes an attack tool to perform penetration test to verify the optimal attack strategy. The method comprises the following specific steps:

s101, collecting measured network information and generating an attack graph.

Specifically, scanning information of a network system to be tested; the tested network system is an IT+OT network shown in fig. 3, wherein the IT network comprises an enterprise server, an office computer, an engineer station and the like, an external Ethernet and an internal OT network; the OT network is connected with the controller, the monitor, the controlled setting, the sensor and the like, and the penetration test method is used for connecting the OT network through the IT network.

The embodiment of the application is suitable for being used by security specialists when the security specialists have permission to collect the information fully, the security specialists scan in each subnet, and check the configuration information of each host and each firewall to acquire the connectivity information of each host, and the connectivity information is matched with the hacl grammar required by MulVAL, for example, the hacl (1, 2, tcp, 80) represents that the host 1 can access the 80 ports of the host 2 through the tcp.

Host system information, service information and vulnerability information are collected by using open source tools such as Nmap and Nessus, vulnerability information verification is performed by using vulnerability search tools on the host, such as tools Windows Exploit Suggester and Linux Exploit Suggester, and network information of the server 1 of the system shown in fig. 3 in this embodiment is shown in table 1.

Table 1 host information collection example

Setting an industrial control network penetration test target: when the industrial control equipment can be accessed, the industrial control protocol can be forged to control the industrial control equipment, and a large number of industrial control networks currently adopt boundary-based protection, firewall is arranged at the place where the boundary is connected with the IT network for limiting access, any host which can directly access the industrial control equipment can be used as a target of industrial control network penetration test, and the target host can be determined on the basis of running scanning software on the host, checking network configuration information and the like.

Collecting and establishing a vulnerability data set for data storage: in order to generate an attack graph later and determine costs for the attack step, a vulnerability data set needs to be collected and established, and a plurality of public vulnerability databases, such as NVD vulnerability databases, are currently available to acquire CVE numbers and CVSS3 scoring data of vulnerabilities, including a base score and an availability score, and vulnerability information collected in this embodiment is shown in Table 2.

Table 2 vulnerability information example

CVE number	base score	exploitability	Utilization mode	Effects of
					CVE-2020-14882	9.8	Low	remote	priEscalation

According to the embodiment, a MulVAL tool is adopted to generate an attack graph, the MulVAL is an open source attack graph generation tool based on an XSB logic engine, the relation among all vulnerabilities can be deduced according to the connectivity relation among all hosts on a network and the conditional relation before and after the vulnerabilities, the attack graph is formed, the collected host configuration information including the service, connectivity and vulnerability information of the hosts is utilized and is input into the MulVAL to obtain the attack graph, and the output attack graph is in a directed graph form shown in FIG. 4.

As shown in FIG. 4, the boxes in the attack graph are leaf nodes, representing initial conditions, the ellipses are inference RULEs, the prisms are inference results, input information is converted into leaf nodes, the inference results are generated through the inference RULEs, then a plurality of nodes are connected through multiple inferences, and the final result is deduced, the attack graph logically shows the possibility of multi-step attack, wherein the inference RULEs consist of a series of RULE, RULE1 is a local drain hole utilization RULE, RULE2 is a remote vulnerability utilization RULE, RULE5 is a multi-hop network access RULE, and in the embodiment, the node description of the attack graph is shown in Table 3.

Table 3 attack graph example node illustration

S102, abstracting a Markov model from the attack graph and giving state transition rewards.

Specifically, when all nodes on the attack graph are used as states of the markov process, and when the attack graph node 1 has a directed edge pointing to the node 2, the markov process also has state transition of the state 1 pointing to the state 2, the action attempting to transition to a certain state is an action set, and the target state is named by the action number, if the action 2 is adopted in the state 1 to attempt to transition to the state 2, since the directed edges are adopted in the attack graph 1 to 2, when the action 2 is adopted, the state 2 can be successfully transitioned to the state, the action 3 is adopted in the state 1 to attempt to transition to the state 3, and since the edges are not adopted in the attack graph 1 to 3, the transition to the state 3 can still be returned to the state 1.

Nodes on the attack graph have different rewards, represent rewards obtained by entering the state in the Markov process, and finally the target node is any code execution of the target host, the host which can directly access the industrial control network is set as the target host, and the target node is all nodes containing 'execCode' and target host names on the attack graph, and the nodes are successfully entered to obtain the rewards 100.

For RULE1 (local example) and RULE2 (remote example) nodes, the nodes are directly connected with the 'vulExist', represent the vulnerability exploitation, and query the ratings of vulnerability exploitation modules in CVSS and Metasplot according to vulnerability information provided by the 'vulExist' nodes, and calculate a reward value which is a negative number and represents the cost required to be paid for the nodes. The formula is as follows: r= -cost_complex, where cost_complex comes from attack complexity ranking in CVSS score, when ranking Low, cost_complex is 1, when ranking High, cost_complex is 2, cost_complex represents quality of vulnerability exploitation script, and computing factor is given according to table 4 according to highest ranking in the template of the template obtained by searching corresponding vulnerability by metamask.

Table 4 cost_explloit calculation table

The RULE5 (Multi-hop access) node sets the prize value to-0.5, representing the cost of establishing proxy access, and other nodes are prizes to 0, representing no cost, and if the state jump fails to return to the original state, the prize value is given to-0.1, representing the cost of invalid actions.

It should be noted that, the specific setting of the prize cost is an embodiment, and there may be more prize setting manners, and fine tuning the value should be regarded as the protection scope of the present invention.

S103, interacting with the Markov model by adopting a reinforcement learning algorithm to obtain an optimal attack strategy.

Specifically, the agent trains by using a reinforcement learning algorithm, obtains a mapping neural network of state-to-state action values (Q values), and makes decisions using the neural network.

Reinforcement learning is a learning method for interacting with an environment and mapping an environment state to action selection, reinforcement learning can try to determine a strategy of actions under a given environment state so as to maximize a final accumulated reward value, and DQN algorithm is reinforcement learning algorithm based on state action values, and a neural network is used for fitting and storing the state action values.

The algorithm performs training in multiple curtains, which is one training from an initial state to an end state. As shown in fig. 5, each screen starts, an arbitrary initial state s is selected, the Q values corresponding to all actions under s are calculated by a predictive value network, the action a corresponding to the largest Q value is selected and applied to an environment MDP model, the specific process is to inquire a markov model diagram, if the states are connected by directed edges, the next state s 'is successfully returned, the number of the next state s' is the same as that of the state a, and an rewarding matrix is inquired to obtain rewards r; if no directed edges are connected, returning to the next state s ', the number of which is the same as s, inquiring the bonus matrix to obtain the bonus r, and placing the experience (s, a, r, s') in a playback buffer;

The optimal strategy is to output the next optimal action according to the input current state, the optimal path refers to an action sequence from the initial state to the target node, and compared with other methods, the optimal path is obtained, and the optimal strategy is obtained based on the advantage of the reinforcement learning method of the state-action value.

The state is input into the neural network obtained after reinforcement learning training, the value of each action in the state, namely the Q value, is output, the intelligent agent searches the action corresponding to the maximum Q value, so that the optimal next action in the current state can be obtained, and from the initial state, if the operation is always performed according to the optimal strategy, the optimal path can be obtained, and if the operation deviates from the optimal path, the attack strategy can still provide guidance of the next attack.

In the actual penetration test, a tester does not perform the penetration test completely according to the optimal path given by the system according to own experience preference, and possibly adjusts so as to deviate from the optimal path, and the analysis method based on the optimal strategy can recommend the next position according to the optimal strategy under the condition that the actual state is observed, so that the penetration tester is helped to perform the next test continuously.

In the state transition diagram shown in fig. 6, the double circles represent the target nodes, and S1- > S2- > S3- > S4 are assumed to be the optimal path given by the system, but in the S1 state, the tester gives up S2 based on the attack familiar with the S5 steps, selects S5, which is selected to leave the optimal path, but the system can still provide assistance, and outputs the optimal next step S7 according to the S5 state.

S104, calling a penetration test tool to verify the optimal attack strategy.

In order to verify the attack strategy and perform penetration test automation, the embodiment uses a plurality of penetration test tools to perform automatic penetration test attack, the penetration test tools adopt Metasploit, burpsuit or W3af and the like, and the embodiment uses metaprofile as an illustration of how the application matches the planned state actions to the invocation of the attack tools.

Metasploit (MSF) is a free downloadable professional level penetration test framework, attached with thousands of utilization scripts of known software vulnerabilities, and the RULE node of the attack graph represents the reasoning RULE of the attack graph and can also be used for guiding actions when the penetration test tool is called. According to the optimal path obtained by the intelligent agent, the MSF can be guided to be invoked to perform the actual penetration test by the information provided by the attack graph, and the mapping relation between RULE nodes and actions in the embodiment is shown in Table 5.

Table 5 mapping relationship between nodes and actions

The second aspect of the application provides an industrial control OT network multi-target penetration test system, which comprises an attack graph generation module, an attack graph conversion module, an interaction module and a verification module.

attack graph conversion module: for abstracting Markov models from the attack graph and awarding state transition rewards;

and an interaction module: the method comprises the steps of performing interaction with a Markov model by adopting a reinforcement learning algorithm to obtain an optimal attack strategy;

It should be noted that, the industrial control OT network multi-target penetration test system in this embodiment is an embodiment of a system corresponding to the industrial control OT network multi-target penetration test method, so specific implementation of a software method in each module of the flow anomaly detection system may refer to the embodiments of fig. 1-6, and will not be described in detail herein.

According to the industrial control OT network multi-target penetration test method and system, the information is collected, the attack graph is obtained, the Markov model is abstracted from the attack graph, then the reinforcement learning method is used for obtaining the optimal strategy of the Markov model, finally, the attack tool is guided to be called for attack verification, and compared with the existing method suitable for single-target penetration test, the method disclosed by the invention is close to a multi-target scene in the industrial control OT network; and an application mode of the attack strategy is provided, and a cost setting mode is adopted, so that the target guidance of the optimal path is ensured. The method is suitable for multi-target penetration test of the industrial control network, assists penetration test engineers to pass through an IT network to reach an OT network, takes a target as a guide, searches an optimal attack strategy which enables attack cost to be minimum, and can give a path which can reach the target easily; meanwhile, the method can meet the requirement that the penetration test engineer changes part of attack steps based on own experience and gives guidance again, and has great flexibility.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. The industrial control OT network multi-target penetration test method is characterized by comprising the following steps of:

collecting network information to be tested, and generating an attack graph:

deducing the relation among all vulnerabilities according to the connectivity relation among all hosts on the network and the conditional relation before and after the vulnerabilities to form an attack graph, and acquiring the attack graph through a MulVAL tool by utilizing the collected host configuration information;

adopting a reinforcement learning algorithm to interact with the Markov model to obtain an optimal attack strategy:

each screen starts, an arbitrary initial state s is selected, Q values corresponding to all actions under the s are calculated by a predicted value network, an action a corresponding to the largest Q value is selected and applied to an environment MDP model, the specific process is to inquire a Markov model diagram, if the states are connected by directed edges, the next state s 'is successfully returned, the number of the next state s' is the same as that of the state a, and an rewarding matrix is inquired to obtain rewards r; if no directed edge is connected, returning to the next state s ', wherein the number of the next state s is the same as that of the next state s, inquiring the rewarding matrix to obtain rewards r, and placing experience (s, a, r, s') into a playback buffer;

after each interaction with the environment, putting experiences into a playback buffer zone, training the neural network for a plurality of times, wherein the playback buffer zone is a storage zone for storing experiences with a fixed size, when training the neural network, randomly acquiring experiences from the playback buffer zone, transmitting (s, a) to a predicted value network, transmitting s ' to a target value network by the predicted value network according to the current network parameters w, outputting Q (s, a, w), inputting the maximum Q value maxQ (s ', a ', w ') into the target value network by the playback buffer zone, directly transmitting r to a loss value function by the playback buffer zone, wherein the maxQ (s ', a ', w ') +r-Q (s, a, w) is an error function required by the predicted value network training, the gradient of the error function can guide the predicted value network to modify parameters, copying the parameters of the predicted value network to the target value network after a plurality of interactions, and obtaining the neural network parameters after each screen interaction until entering a Markov model target node or exceeding the maximum step number is ended;

inputting a state to the neural network obtained after reinforcement learning training, outputting values of all actions in the state, namely Q values, and searching actions corresponding to the maximum Q value by an intelligent body to obtain the optimal next action in the current state; starting from the initial state, if the method is always carried out according to an optimal strategy, an optimal path can be obtained; if the optimal path is deviated, the attack strategy provides guidance for next attack;

and calling a penetration test tool to verify the optimal attack strategy.

2. The industrial control OT network multi-target penetration test method according to claim 1, wherein the collecting of the network information to be tested is realized by the following steps:

scanning information of the network system to be tested;

setting an industrial control network penetration test target;

and collecting and establishing a vulnerability data set for data storage.

3. The industrial control OT network multi-target penetration test method according to claim 1, wherein a markov model is abstracted from the attack graph and a state transition reward is given, specifically by the following method:

4. The industrial control OT network multi-target penetration test method according to claim 1, wherein the penetration test tool is invoked to verify the optimal attack strategy, specifically by the following means:

5. The method of claim 4, wherein the penetration test tool is Metasploit, burpsuit or W3af.

6. An industrial control OT network multi-target penetration test system, comprising:

attack graph generation module: the method is used for collecting the network information to be tested and generating an attack graph:

and an interaction module: the method is used for interacting with the Markov model by adopting a reinforcement learning algorithm to obtain an optimal attack strategy: