CN112468440A

CN112468440A - Knowledge graph-based industrial control system attack clue discovery system

Info

Publication number: CN112468440A
Application number: CN202011168061.0A
Authority: CN
Inventors: 赖英旭; 周昆; 刘静
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2020-10-28
Filing date: 2020-10-28
Publication date: 2021-03-09
Anticipated expiration: 2040-10-28
Also published as: CN112468440B

Abstract

The invention discloses an industrial control system attack clue discovery system based on a knowledge graph. Most industrial control systems are designed and developed years ago, corresponding safety considerations are lacked, and many vulnerabilities which endanger the system safety inevitably exist, and the vulnerabilities are likely to be utilized by intruders. Aiming at the problem that an industrial control intrusion detection system can only find attacks but cannot provide clues related to the attacks, and the clues have important effect on quick recovery after the system attacks, the method provides the clues related to the attacks from the aspect of vulnerability exploitation by constructing the knowledge base of the industrial control system vulnerability exploitation. In the process of constructing the knowledge graph, an attack information named entity identification method based on a conditional random field, an entity alignment framework based on rule and character similarity calculation and a knowledge inference algorithm based on type definition and pre-training model negative triple potential correct probability are provided. The invention visually displays the knowledge graph in a force-directed graph mode according to the attack clues input by the user, and is more accurate and visual.

Description

Knowledge graph-based industrial control system attack clue discovery system

Technical Field

The invention belongs to the field of industrial control system network security, and particularly relates to an industrial control system attack clue discovery system based on a knowledge graph.

Background

The industrial control system is composed of various automatic control components and a process control component for real-time data acquisition and monitoring, and is applied to very wide industrial fields. Most of industrial control systems used at present are designed and developed years ago, corresponding safety consideration is lacked, many vulnerabilities which endanger the system safety inevitably exist, and the vulnerabilities can be possibly utilized by intruders to cause safety accidents. With the continuous promotion of the two-way integration, the mature IT technology breaks the relative closure of the industrial control system, the faced safety problem and risk become more and more prominent, and the network safety accident of the industrial control system seriously affects the life and property safety of industrial production. At present, network attacks on an industrial control system are usually discovered by using an intrusion detection technology, but the intrusion detection technology can only detect the attacks and give an alarm, and cannot provide relevant information such as a method used by the attacks, the consequences caused by the attacks, processing opinions and the like, the information can provide decision support for security personnel, and the method plays an important role in reducing attack loss by quickly recovering the system after the system is attacked.

Through the above analysis, when an attacker attacks the industrial control system by using the industrial control system bug, in order to ensure the system security, not only the attack needs to be detected, but also a method for providing a clue related to the attack is urgently needed. Aiming at the problem, the knowledge graph is introduced into the safety field of the industrial control system, and the knowledge graph is constructed for discovering the attack clues of the industrial control system by utilizing the vulnerabilities of the industrial control system. The knowledge graph is a semantic network consisting of entities and relations and formally proposed by Google, and aims to change a keyword-based search mode, understand the input of a user based on semantic retrieval of the knowledge graph and provide a more direct and systematic result for a searcher. Therefore, the knowledge graph of the industrial control system can be used for reasoning according to the input of the user, and the clue information related to the attack is visually displayed in a force-directed graph mode, so that the safety personnel can conveniently make correct safety decisions.

Disclosure of Invention

When the industrial control system detects the network attack, the intrusion detection system cannot provide the relevant information of the attack, and in order to quickly restore the system, security personnel need information such as an attack method, consequences caused by the attack, opinion processing and the like. In order to solve the problem, the invention provides a knowledge-graph-based industrial control system attack clue discovery system, and provides a more intuitive and accurate attack clue for security personnel by constructing the industrial control system vulnerability exploitation knowledge graph. The knowledge graph can be divided into a mode layer and a data layer, wherein the mode layer stores refined knowledge, and the data layer stores specific data information. According to the difference of the construction sequence of the two parts, the construction of the knowledge graph can be divided into two construction modes of top-down and bottom-up. According to the invention, a top-down mode is adopted, a knowledge graph mode layer is extracted from scene and data information and is used for guiding the construction of the data layer. According to the network security state and other related information under the multi-source heterogeneous network environment, security related elements are classified according to the specific characteristics of different data sources and are mainly divided into three dimensions of a basic dimension, a threat dimension and a fragile dimension. The attack clue discovery concept set C of the industrial control system is obtained from three different dimensions and an industrial control specific scene, wherein the concept set C is { Vendor, Device, Vulnerability, Mean and sequence }. Vendor: manufacturer, Device: industrial control system equipment, Vulnerability: equipment vulnerability, Mean: vulnerability exploitation attack method, sequence: the abnormal result caused by the attack. The relationship R between concepts is { product, wave, show, cause, use, kid-of, lead-to }, which is a production relationship between a manufacturer and a device, an ownership relationship between a device and a vulnerability, an expression relationship between a device and an attack anomaly, a cause relationship between a vulnerability and an attack anomaly, a utilization relationship between an attack method and a vulnerability, a hierarchical relationship between entities, and a causal relationship between an attack result and an attack result. The equipment and the vulnerability data come from various large vulnerability libraries, information is extracted by using a web crawler, and the attack method and the attack result information of the vulnerability come from unstructured text information, such as equipment manufacturer bulletins, vulnerability descriptions and the like.

After determining the mode layer, extracting attack information in the unstructured text. Aiming at the problems of large length difference of an attack method and an abnormal result and the existence of a large number of nests and aliases, the invention provides a feature combination containing entity features and context information based on a linear chain element random field model, wherein the feature combination comprises word features, part of speech features, entity boundary features, key word features before and after an entity and entity high-frequency word features, and the integrity of entity recognition is improved.

Since the extracted attack information comes from multiple data sources, the situation of 'meaning by word' is inevitable, and entity alignment is needed. Under the scene that the corpus is limited and a large number of complex long entities with large occurrence frequency difference exist, the simple character string similarity calculation cannot solve the problems of abbreviations and synonyms, the reason for the problem of 'multiple words and one meaning' is analyzed, and English name variation can be divided into the following five types probably:

1) the letters are identical in composition and sequence, and the names are different due to case and punctuation. Such as "email" and "E-mail".

2) Synonym substitution results in a different name. For example, "temperature increment" and "temperature built-up". 3) Abbreviations cause the names to differ. English abbreviation rules are different and can be divided into:

a. the initials of each word constitute an abbreviation.

b. Prefixes of words constitute abbreviations.

c. The combination of the prefix and suffix of a word constitutes an abbreviation.

4) The misspelling constitutes a difference.

5) Others

The invention provides a rule and aggregation similarity method, which comprises the steps of firstly using the rule to judge whether the multiple words and one meaning are caused by abbreviation and synonym replacement, and if not, then using the similarity calculation to judge. Respectively calculating the Edit distance, Jaro-Wrinkler, ISUB and Jaccard similarity, using sigmoid function aggregation as the final similarity, and representing the same entity if the similarity is larger than a threshold value. The method improves the entity alignment effect to a certain extent.

After entity alignment, a basic knowledge graph has been formed, but alsoThe problem of relationship loss exists, and the relationship needs to be complemented by reasoning. The translation model TransE is often used for knowledge reasoning, but the random replacement of the TransE model to generate negative triples may include positive triples and low-quality negative triples, which affects the reasoning effect of the model. The method introduces the concept of the potential correct probability of the negative triples, scores the negative triples generated by random replacement according to the correct probability of the negative triples, and leads the training weights of the negative triples with different scores to be different for the model, so that the training weights of the positive triples and the low-quality negative triples in the negative triples are reduced, and further the reasoning effect of the model is improved. With respect to the calculation of the potential correct probability, the invention uses a pre-trained TransE model, and the formula is defined as:

wherein f (h ', r, t') is the score of the negative triplet on the pre-trained TransE, and then the TransE model is retrained by using the negative triplet containing the potential correct probability as the final inference model.

After the knowledge graph of the vulnerability utilization of the industrial control system is constructed, the vulnerability utilization knowledge graph is combined with the intrusion detection system of the industrial control system, when the intrusion detection system finds a certain attack, the information such as the name of equipment and a manufacturer where the attack occurs is utilized to carry out knowledge reasoning in the knowledge graph, the clues relevant to the attack are found, and are visually displayed by using a Baidu visual framework ECharts, so that security personnel can visually obtain the information such as the vulnerability relevant to the attack, a utilization method, an attack result, a protection suggestion and the like, and the method has important value for the rapid recovery of the system.

Drawings

FIG. 1 is a schematic diagram of the general construction of a knowledge graph according to the present invention.

FIG. 2 is a schematic diagram of named entity recognition in accordance with the present invention.

FIG. 3 is a schematic diagram of the alignment of the entities of the present invention.

FIG. 4 is a schematic diagram of the knowledge inference of the present invention.

FIG. 5 is an example of knowledge-graph attack cue discovery constructed based on the present invention.

Detailed Description

The present invention will be described in detail below with reference to specific embodiments shown in the drawings.

Fig. 1 is a general structure diagram constructed by the knowledge graph of the present invention, as shown in fig. 1, in order to obtain device information, vulnerabilities, and vendor information, a web crawler is used to obtain and analyze web page information of each vulnerability library, and the device, vulnerabilities, and vendor entities are merged and deduplicated to form the knowledge graph, which includes attributes such as vulnerability name, release date, threat level, CVE number, description, vulnerability patch, vulnerability type, vulnerability reference, and the like. Aiming at unstructured vulnerability bulletins, vulnerability descriptions and the like, a named entity recognition and extraction attack method and attack consequences based on a linear chain random field are used. The entity extracted by multiple data sources has the condition of 'multiple words and one meaning', the reason of the attack method and the attack consequence entity generating 'multiple words and one meaning' is analyzed, and an entity alignment framework is provided for entity alignment.

The problem of relation loss among a plurality of entities exists in the process of establishing the knowledge graph, and the purpose of knowledge reasoning is to complement the relation of the knowledge graph. On the basis of a translation model TransE, aiming at the fact that positive triples and meaningless triples are generated in the process of randomly replacing and generating negative triples by the model, the invention provides a knowledge inference algorithm based on the potential correct probability of the negative triples of a pre-training model, and effectively improves the effect of knowledge inference by combining an entity type limiting method.

Fig. 2 is a schematic diagram of a named entity recognition process based on a CRF model, as shown in fig. 2, including:

extracting attack information from the unstructured text by using a linear chain element random field model, firstly labeling the text by using a BIOES labeling method, and then dividing a training set and a test set in a ratio of 7: 3. The characteristics used by the invention are word characteristics, part of speech characteristics, entity boundary characteristics, key word characteristics before and after the entity and high-frequency word characteristics of the entity. The key word characteristics before and after the entity refer to an attack method and high-frequency words before and after the entity with attack consequences, for example, a noun phrase after by and using generally represents the attack method, the high-frequency words of the entity refer to words with high occurrence frequency in the entity, and the words have a triggering effect on model identification. Training corpuses are subjected to pretreatment and feature extraction to continuously and iteratively train the CRF model, testing corpuses are also subjected to pretreatment and feature extraction to verify the effect of the CRF model, the accuracy rate and the recall rate are used as the evaluation standards of the model, and the CRF model with the best effect is selected as the model for named entity recognition.

FIG. 3 is a flow diagram of a physical alignment framework, as shown in FIG. 3, including:

step 31, constructing rules of English abbreviations;

step 32, judging whether the two input entities are in an abbreviated form of one another according to the rule of step 31;

step 33, if the entity is not the multiple word meaning caused by the abbreviation, standardizing the entity;

step 34, extracting the word stem of the entity;

step 35, removing stop words contained in the entity;

step 36, synonym replacement is carried out on the words in the entities one by utilizing WordNet, and whether the two entities have multiple words and one meaning caused by synonym replacement is judged;

step 37, if the similarity is not caused by synonym replacement, calculating the similarity of the two entities, wherein the similarity comprises Edit distance, Jaro-Winkler, ISUB and Jaccard;

step 38, aggregating the four similarities as comprehensive similarities by using a Sigmoid function;

step 39, judging whether the comprehensive similarity is greater than a threshold value, if so, determining the entity is the same entity, otherwise, determining the entity is two different entities;

FIG. 4 is a flow chart of knowledge inference of the present invention, as shown in FIG. 4, comprising:

in the process of establishing the knowledge graph, a plurality of relations are lost, and knowledge reasoning is needed to complement the relations. The translation model based on representation learning overcomes the problems of low calculation efficiency and data sparsity of symbolic representation triples, and a maximum interval method is adopted to train the model, namely, an optimization target separates positive samples from negative samples. Most translation models generate negative samples in a random way, which may cause the negative samples to include positive samples and also include many negative samples with low quality, for example, (Beijing, Lound, Banana) is a very poor negative sample. The method for calculating the potential correct probability of the negative triple based on the pre-training model is provided based on the thought of the potential correct probability of the negative triple, the problem of model deviation caused by the problem is solved by combining type limitation, the smaller the difference between the positive triple and the negative triple is, the larger the potential correct probability of the negative triple is, and the smaller the target function score of the negative triple is, so that the score of the negative triple on the pre-training TransE model can be used as the measurement of the potential correct probability.

The score formula of the translation model is: f (h, r, t) | | | h + r-t | | non-conducting hair_1/2Where h, r, t represent the vector representation of head, relationship and tail entities, respectively, 1 and 2 represent the L1 norm and L2 norm, respectively, the positive triplet score is close to 0, the larger the negative triplet score the better. The probability of potential correctness of the negative triplet based on the pre-trained model is defined as:

where T is the set of negative samples generated by the random substitution, and f (h ', r, T') represents the score of the negative triplet in the pre-trained model. Adding the concept of the potential correct probability of the negative sample into the calculation process, wherein the objective function is as follows:

wherein S is a positive triple set, S ' is a negative triple set, delta is the potential correct probability of the negative triple, and lambda is a hyper-parameter of the model, and (h, r, t) and (h ', r, t ') respectively represent the positive triple and the negative triple.

The whole process can be summarized as: firstly, a negative triple pre-training TransE model is generated by randomly replacing triples in a knowledge graph, the pre-training model is used for calculating the potential correct probability of the negative triples, and the nonsense negative triples doped in the negative triples are removed through type limitation. The TransE model, trained using data containing potentially correct probabilistic negative triples, is then used as the knowledge inference model.

After named entity recognition, entity alignment and knowledge reasoning, a relatively complete engineering system vulnerability exploitation knowledge graph is obtained. And (4) building an attack discovery system, and displaying clues related to the attack in a force guide graph mode according to the input of a user by using a knowledge graph. Specific examples are as follows:

the Phoenix Contact is a german industrial automation, connectivity and interface solution provider, the product of which is mainly applied in the key infrastructure field, such as the industries of communication, key manufacturing and information technology, and the FL SWITCH produced by the product of which has a plurality of exploitable holes. When the equipment is detected to have a denial of service attack, the channel is positioned to specific equipment by 'Phoenix Contact' and device is positioned to 'FL SWITCH', and depth-first traversal is performed by taking the channel as a starting point to find a vulnerability existing in the equipment, wherein the vulnerability may cause an abnormal result, and the vulnerability utilization method causes the abnormal result. The entities and the relations of the paths are shown in a force-directed graph manner as in fig. 5, and detailed attribute information of the entities and the relations can be displayed by mouse hovering, so that the entities are numbered for convenient expression. As can be seen from the path 1- >5- >12- >4, the attacker utilizes the cache overflow vulnerability (CVE-2018 and 10728) of the FL SWITCH device by constructing the cookie information of the GET request, thereby causing a denial of service attack. Path 1- >10- >2- >12- >4 may also cause device buffer overflow. Different exception results may be generated at different stages of attack execution, resulting in remote code execution due to buffer overflow. The

entities

8 and 14 are other vulnerabilities existing in the device, and the path 1- >6- >7- >8- >9 is known, so that an attacker utilizes the vulnerabilities in a command injection mode (CVE-2018 and 10730), obtains the authority for executing the system commands in the privilege escalation stage, can update the firmware of the device, and prepares for further expanding the attack influence. The path 1- >13- >14- >15 shows that information leakage is caused by the device vulnerability (CVE-2018 and 10729), and an attacker can read the configuration file of the device. Knowing the cause of the attack, security personnel can obtain a mitigation solution and a patch for the vulnerability from the "refer" and "patch" properties of the vulnerability, thereby thwarting the attack. The presentation mode of the force guide graph enables data presentation to be more visual, clue information can be presented in real time when an attack occurs, and decision support is provided for security personnel.

It should be understood that although the description is made in terms of embodiments, not every embodiment includes only a single embodiment, and such description is for clarity only, and those skilled in the art will recognize that the embodiments described herein may be combined as appropriate, and implemented as would be understood by those skilled in the art.

The above-listed series of detailed descriptions are merely specific illustrations of possible embodiments of the present invention, and they are not intended to limit the scope of the present invention, and all equivalent embodiments or modifications that do not depart from the technical spirit of the present invention should be included within the scope of the present invention.

Claims

1. Knowledge graph-based industrial control system attack clue discovery system is characterized in that: the method for constructing the knowledge graph of the industrial control system vulnerability exploitation provides an attack clue and a construction method of the knowledge graph from the aspect of vulnerability exploitation. The construction of the knowledge graph of the industrial control system vulnerability exploitation comprises the following steps:

1) and constructing a knowledge graph mode layer, determining the extracted entities and the relationship between the entities according to the application scene, and guiding the construction of the data layer.

2) And acquiring vulnerability and equipment data from NVD, CNVD and CNNVD vulnerability libraries by using a web crawler, acquiring related vulnerability descriptions and acquiring manufacturer notices according to manufacturer notice links.

3) And 2) extracting attack methods and attack result information by using a linear chain element random field, wherein the vulnerability description and manufacturer notice obtained in the step 2) are unstructured text data.

4) The information extracted by the multivariate data has the condition of 'meaning by word', and an entity alignment framework based on rules and similarity calculation is used for entity alignment.

5) The problem that entity relations are lost exists in the constructed knowledge graph inevitably, and the relation among the knowledge graph entities is complemented by a negative sample potential correct probability knowledge inference algorithm based on a pre-training model.

2. The knowledge-graph-based industrial control system attack cue discovery system according to claim 1, wherein: most of the currently used industrial control systems are designed and developed years ago, corresponding safety considerations are lacked, and many vulnerabilities which endanger the system safety inevitably exist, and the vulnerabilities can be possibly utilized by intruders to cause safety accidents. The current intrusion detection technology of the industrial control system can not provide attack clue information, so the invention introduces the knowledge graph into the attack clue discovery field of the industrial control system, can understand the advantages input by a user by utilizing the semantic retrieval of the knowledge graph, and provides visual and accurate attack clues for security personnel.

3. The knowledge-graph-based industrial control system attack cue discovery system according to claim 1, wherein the knowledge-graph pattern layer constructed in the step 1) is: the invention obtains an attack clue discovery concept set C of the industrial control system from three different dimensions by combining with an industrial control concrete scene, wherein the concept set C is { Vendor, Device, Vulnerability, Mean and sequence }. Vendor: manufacturer, Device: industrial control system equipment, Vulnerability: equipment vulnerability, Mean: vulnerability exploitation attack method, sequence: the abnormal result caused by the attack. The relationship R between concepts is { product, wave, show, cause, use, kid-of, lead-to }, which is a production relationship between a manufacturer and a device, an ownership relationship between a device and a vulnerability, an expression relationship between a device and an attack anomaly, a cause relationship between a vulnerability and an attack anomaly, a utilization relationship between an attack method and a vulnerability, a hierarchical relationship between entities, and a causal relationship between an attack result and an attack result.

4. The system for discovering industrial control system attack clues based on knowledge graph of claim 1, wherein the attack method based on conditional random field and the attack result named entity recognition method used in step 3). The length difference between the attack method and the attack result is large, a large number of nests and aliases exist, and in order to ensure the integrity of the extracted entity, the invention introduces the context environmental characteristics of the entity and determines the optimal characteristic combination:

1) word features. Each word generated after the text is segmented serves as a characteristic, and the characteristic can reflect basic information of the text more completely.

2) And (4) part of speech characteristics. During the process of text word segmentation, part-of-speech tagging is carried out on each word, and the used part-of-speech includes more than 20 part-of-speech characteristics including verbs, nouns and prepositions.

3) And (4) entity boundary characteristics. And marking the corpus by adopting a BIEOS marking method.

4) And key word characteristics before and after the entity. Attack methods and attack result entities generally appear before or after some keywords, and can be used for characteristics of entity identification.

5) And (4) entity high-frequency word characteristics. Many attack methods and attack result entities have a high probability of certain words occurring, which triggers recognition.

5. The knowledge-graph-based industrial control system attack cue discovery system according to claim 1, wherein the entity alignment framework used in step 4). The framework carries out targeted entity alignment aiming at the 'multi-word one meaning' caused by the reasons of abbreviation, synonym replacement, spelling error, symbol and the like by analyzing the attack method and the attack result entity. The process comprises the following steps of,

step 31, constructing rules of English abbreviations;

step 34, extracting the word stem of the entity;

step 35, removing stop words contained in the entity;

step 37, if the similarity is not caused by synonym replacement, calculating the similarity of the two entities, wherein the similarity comprises Editdasce, Jaro-Winkler, ISUB and Jaccard;

and 39, judging whether the comprehensive similarity is greater than a threshold value, wherein the comprehensive similarity is the same entity if the comprehensive similarity is greater than the threshold value, and otherwise, the comprehensive similarity is two different entities.

6. The knowledge-graph-based industrial control system attack cue discovery system according to claim 1, wherein the knowledge inference method used in step 5). In order to overcome the defect that negative samples are generated by random substitution of a translation model and are possibly doped with positive samples and meaningless samples in the negative samples, the method uses a type-limited and pre-trained model-based negative triple potential correct probability knowledge inference algorithm, and the inference effect of the model is improved. The specific process is as follows: firstly, a negative triple pre-training TransE model is generated by randomly replacing triples in a knowledge graph, the pre-training model is used for calculating the potential correct probability of the negative triples, and the nonsense negative triples doped in the negative triples are removed through type limitation. The TransE model, trained using data containing potentially correct probabilistic negative triples, is then used as the knowledge inference model.

The scoring formula for the translation model is: f (h, r, t) | | | h + r-t | | non-conducting hair_1/2Where h, r, t represent vector representations of head, relationship and tail entities, respectively, and 1 and 2 represent L1 and L2 norms, respectively. The score formula is used to define the potential correct probability of the negative triple as:

and after each negative triple obtains the potential correct probability, training the TransE model again, and adding the concept of the potential correct probability of the negative sample into the calculation process, wherein the objective function is as follows:

wherein S is a positive triple set and S' is a negative triple setAnd in the set, delta is the potential correct probability of the negative triplet, and lambda is the hyper-parameter of the model. Compared with the traditional TransE model, the TransE model trained at this time is used as a knowledge reasoning model, and has an obvious improvement effect.