CN113282759A

CN113282759A - Network security knowledge graph generation method based on threat information

Info

Publication number: CN113282759A
Application number: CN202110439459.1A
Authority: CN
Inventors: 李桐; 刘一涛; 刘刚; 王刚; 赵桐; 周小明; 宋进良; 姚羽; 刘扬; 王磊; 李广翱; 陈得丰; 刘莹; 杨智斌; 耿洪碧; 杨巍; 任帅; 陈剑; 李欢; 张彬
Original assignee: State Grid Liaoning Electric Power Co Ltd; Electric Power Research Institute of State Grid Liaoning Electric Power Co Ltd
Current assignee: State Grid Liaoning Electric Power Co Ltd; Electric Power Research Institute of State Grid Liaoning Electric Power Co Ltd
Priority date: 2021-04-23
Filing date: 2021-04-23
Publication date: 2021-08-20
Anticipated expiration: 2041-04-23
Also published as: CN113282759B

Abstract

The invention belongs to the technical field of industrial control network security, and particularly relates to a network security knowledge graph generation method based on threat intelligence. The method comprises the following steps: efficient distributed threat intelligence data collection; making a network security threat information data set through a distributed threat information crawling system; the data quality of the network security threat information is improved; carrying out network security entity identification on the manufactured network security threat intelligence data set; extracting the network security entity relationship; and (4) organizing data. According to the method, a large number of experiments verify that the threat information data quality improvement algorithm, the network security threat information and the quality of the knowledge map generated by extracting the entity identification and entity relation in the information text are remarkably improved, and the method has good local network weakness visualization capability and attack prediction analysis capability.

Description

Network security knowledge graph generation method based on threat information

Technical Field

The invention belongs to the technical field of industrial control network security, and particularly relates to a network security knowledge graph generation method based on threat intelligence.

Background

With the rapid development of network technologies, a great number of network technologies are introduced into various industries to improve productivity, which is accompanied by a problem of network security. With the increasing complexity of network security situation, dynamic defense of network security driven by threat intelligence becomes the focus of attention in the industry. The threat intelligence has the characteristics of rich data content, high accuracy and strong real-time performance, and can reflect the attack chain of the whole attack event, so the threat intelligence has extremely high application and analysis values.

The knowledge graph is used as a comprehensive data integration and organization method, attack information can be effectively extracted from massive threat information, and complex behaviors such as reasoning analysis and attack semantic association on the attack information data can be achieved. With the continuous updating of threat information, the knowledge graph network security system based on the threat information can realize dynamic defense, and compared with traditional static defense means such as antivirus software and firewall, the knowledge graph can sense the network security situation more quickly and accurately, so that the overall security of the network is improved, and advanced functions such as attack path prediction, attack tracing, security threat evaluation and the like are realized.

In the process of generating the relevant network security knowledge graph by using the threat intelligence, the data quality after the threat intelligence is collected is improved, the false positive rate of the threat intelligence data is reduced, and the network security entity identification and the security entity relationship extraction in the threat intelligence are difficult research contents.

The main problems are as follows:

1. the open source threat intelligence on the network generally has the problems of low data quality, high data false positive rate, missing or error of corresponding attributes of data entities and the like. The low-quality threat information data inevitably causes the problem that the generated network security knowledge graph has low quality, the network security situation cannot be correctly sensed, and the current network attack behavior can be wrongly predicted. The existing data quality improving algorithm mainly depends on a truth value discovering algorithm, the algorithm is mostly applied to single truth value discovering problems and cannot adapt to the condition that an entity in network security threat information data has multiple truth values, and the network security threat information data has stronger time-varying characteristics.

2. The existing entity identification and entity relation extraction method is mainly based on the traditional rule identification, machine learning and the recently popular deep learning method, needs a large number of labeled text data samples, and has higher data quality requirement. Although the method is widely applied to other fields such as natural language processing, the application of the method to entity identification and entity relationship extraction in the network security field is difficult because of the problems that large-scale high-quality security entity labeling data is lacked, multiple entity types are mixed in the data, and entity type labels in the data whole text are different.

At present, no network security entity identification and entity relationship extraction method with good effect exists in the field of network security.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a network security knowledge map generation method based on threat intelligence, and aims to provide a basic model for utilizing and analyzing massive threat intelligence data and realize the purpose of predicting the attack means and the attack target of an attacker.

The technical scheme adopted by the invention for realizing the purpose is as follows:

a network security knowledge graph generation method based on threat intelligence comprises the following steps:

step 1, collecting high-efficiency distributed threat information data, wherein a distributed threat information data crawling system is built by a script framework, and a script-redis scheduling crawler program is used for extracting data to be structured and then storing the data into a redis and mongodb database;

step 2, a network security threat information data set is made through a distributed threat information crawling system;

step 3, improving the data quality of the network security threat information;

step 4, utilizing the threat intelligence data to manufacture a network security threat intelligence data set for network security entity identification;

step 5, extracting the network security entity relationship;

and 6, organizing data.

Further, the high efficiency distributed threat intelligence data collection comprises: distributed crawler system architecture, crawler strategy, crawler implementation and data storage.

Further, the distributed crawler system architecture comprises: the threat information collection system framework is formed by the deployment of a distributed crawler system and a bottom environment; the distributed crawler system is formed by reconstructing a traditional crawler frame by Scapy, a Redis database is added, a multi-node distributed system is adopted in a bottom environment, a Docker container cluster is adopted, and Kubernets are used as a cluster management tool; the distributed crawler system adopts a Master/Slave structure, a Master (Master) terminal and a plurality of Slave terminals are arranged, the Master terminal deploys a Redis database to store and be scheduled to request to be crawled, the Slave terminals deploy crawler main programs to crawl webpages and analyze extracted data, and each Slave terminal stores the analyzed webpage data in the same MongoDB database.

Further, the crawler policy includes: for a Master terminal, storing an initial link in Redis, wherein Key is a next crawled page in a scheduling queue, and URL is a link of a certain page generally; then, a crawler is started, a starting URL is obtained from the Redis, and data of a webpage corresponding to the URL are downloaded; analyzing according to a defined relevant rule from response to obtain page data or a detail page link, analyzing according to a page format for the condition of directly being the page data, restarting a crawler for the condition of the detail page link, modifying the link into the detail page link, and acquiring final detail data; the crawler program continuously acquires the URL from the scheduling queue and crawls the next URL; if no URL exists, entering a waiting state; for the Slave end, a downloader executes a downloading task and analyzes an extracted field; the crawler program acquires URL from a scheduling queue of Key of Redis, and then downloads a corresponding webpage; and resolving the response according to the well-defined field rule, and storing the corresponding field into the MongoDB database after the corresponding field is processed by the text duplication removal module until the Key value is null.

Further, the crawler implementation includes: for the scheduler module, the scheduler module is responsible for scheduling tasks of the whole system, and mainly has the following functions: receiving a request sent by an engine; returning the URL to the downloading module; the URL is stored in a Redis database after being deduplicated; each crawler subtask transmits the obtained URL to a scheduler through an engine, and the URL is stored in a Redis queue after being subjected to deduplication processing by the scheduler; receiving the request of the engine, and returning the URL to the downloader; for the crawling downloader module, the crawling module integrates the functions of a spider and a downloader, the spider processes and extracts data of webpage information returned by the downloader, and extracts a directory URL and a detail page URL in the webpage information; key fields in the webpage information are extracted and then stored in a MongoDB database; the downloader downloads the URL returned by the scheduler and transmits the downloaded webpage information to the spider; the system is responsible for crawling corresponding websites, and comprises a starting URL (uniform resource locator), a duplicate removal module, a URL extraction module and a duplicate removal module, wherein the starting URL is taken at first, and the URL is extracted after crawling; then the scheduling module distributes URL to the Slave node from Redis;

the data storage includes: the storage module realizes two functions, wherein the URL is stored in Redis, and the Redis is deployed in a Master node; the webpage content obtained by analysis is stored in a MongoDB database and is deployed at a Master node; the stored webpage content information is extracted to be the final target of the system, and the distributed crawler crawls the webpage content for a data processing program to extract the required information.

Further, the network security threat information data set is produced through a distributed threat information crawling system; the method comprises the following steps:

(1) vulnerability data: the vulnerability data is collected from a main vulnerability publishing platform, and the data types comprise vulnerability occurrence system types, system versions and utilization methods;

(2) APT attack chain data: APT attack chain data is collected from an APTnodes platform; a total of 528 APT reports over the last 10 years;

(3) malware text data: the method comprises the name, the category, the common functions, the Hash and the utilization system platform of malicious software in threat intelligence; the partial data is collected in a threat intelligence source AlienVault;

(4) data discussion in the secure community: the part of data is collected in a StackExchange website and is the text of a recent security event;

(5) secure RSS subscription data: the part of data is collected in each large network security RSS and is recent network security news.

Further, the method for improving the quality of the network security threat information data comprises the following steps:

step (1), FPR false positive rate: for each source k ∈ S, a corresponding false positive rate is generated

The value is (1-specificity), and the compliance hyper-parameter is alpha₀＝(α_0,1,α_0,0) Beta distribution of (a), wherein_0,1Is the count of false positive samples a priori per source, alpha_0,0Is the count of true negative samples per source prior:

in the following, from the second time node

Using a previous time node

Instead, the time-varying characteristic of the truth finding model is calibrated by using second-order markov;

step (2) Sensitivity: for each source k ∈ S, a corresponding sensitivity rate is generated

Obeying a hyper-parameter of alpha₁＝(α_1,1,α_1,0) Beta distribution of (a), wherein_1,1Is the true positive sample count per source prior, α_1,0Is the per source a priori false negative sample count:

from a second time node

Using a previous time node

step (3), attack tag of Att fact: for the attribute of each entity, F belongs to F, and F is a set of observed values of all attributes under the entity; generating a priori true probability θ_fObeying a hyper-parameter of β ═ β (β)₁,β₀) Beta distribution of (a), wherein₁Is a prior entity attribute correct sample count, β₀Is a prior entity attribute error sample count:

θ_f～Beta(β₁,β₀)

will be from the second time node_fUsing theta of the previous time node_fInstead, the time-varying characteristic of the truth finding model is calibrated by using second-order markov;

step (4), Truth label: the attribute true value label is used for generating a true value label of each entity attribute, namely whether the observed value is correct or not; t is t_fIs an attribute truth label with a compliance parameter theta_fIn which t is_fIs a binary Boolean variable, the prior probability θ_fIs to represent an attribute tag t_fProbability of being correct:

t_f～Bernoulli(θ_f)

step (5) observer: an entity attribute observation value label, wherein the observation value C of each entity attribute belongs to C_fIts source uses s_cRepresents; generating a distribution of observation labels c is a compliance parameter

Bernoulli distribution of (a):

wherein if t_f＝0，o_cCompliance parameter of

The false positive rate of the distribution of bernoulli is s_c

If t is_f＝1，o_cCompliance parameter of

The false positive rate of the distribution of bernoulli is also s_c

The model solution is as follows:

the conditional probability of the model given the observed value c of each entity attribute is as follows:

in the above formula: p represents the prior probability theta when the parameter truth value is given_fSource sensitivity rate

And

when the observed value of the entity o is c, the conditional probability; where c is the observation, f is the attack tag, s_cA source representing the occurrence of an observation c;

the full likelihood function containing all variables and hyperparameters is written as:

in the above formula: p represents the hyperparameter alpha when given the parameter false positive rate₀，α₁And the prior true probability over-parameter beta, the entity o, the source s, the true label t, the prior probability parameter set theta and the sensitivity parameter set phi⁰，φ¹The conditional probability of (a); where S represents a set of all sources, F represents an attack tag set, F represents each attack tag element belonging to F, θ_fDenotes the f prior probability, t_fDenotes the true value of f, C_fA set of observations representing f, c representing each observation element in the set of observations;

and (3) given the observation value data of the attribute, solving the likelihood function by using a Gibbs Sampling algorithm in the MCMC algorithm:

t_mapthe result of maximum posterior estimation is obtained by the formula, and the rest parameters have the same meanings as the parameters with the same names in the formula;

the following formula is solved:

wherein: p represents when given a parameter t_-fTrue value t of f for entity o and source s_fConditional probability with value i, i representing an attack of fThe value of the label is {0,1}, and t is the value range_-fIs the set of all values in F except F,

source s representing observation j, attack tag not f, and truth tag i_cThe number of (2); c_-fRepresenting a set of attack tags without attack tag f, C' being C_-fEach of the elements in the set is,

the truth value when f is c' is shown, and the rest parameters have the same meanings as the parameters with the same names in the above;

to obtain p (t)_f＝i|t_-fO, s), estimating to obtain the FPR false positive rate and the Sensitivity rate of the next moment, and solving the following steps:

wherein

Observation set C representing all attack labels as f_fElement c of (a), source s making observations of the observation value c_cAnd attack tag o of entity o_fThe truth label for j takes the value of the probability sum of i, wherein i belongs to {0,1}, j belongs to {0, 1., | F | }, | F | represents the number of elements of the attack set F, the rest parameters have the same meanings as the parameters with the same names in the above, and finally the accuracy of each source can be estimated as well:

where precision represents the accuracy of each source.

Further, the network security entity identification is carried out on the manufactured network security threat intelligence data set, and a BIO marking method is adopted for an APT report to identify a sentence X in an APT report document as [ X ]]^N＝[x₁,...,x_i,...x_N]Wherein x is_iIs the ith character in sentence X; in the BIO labeling method, identifying the network security entity in the sentence X is equivalent to giving a standard sequence L_X＝[l]^N；

Performing model training on the labeled APT report document by using a BilSTM-CRF model, and extracting word characteristics before the ith character and word characteristics after the word by a forward process; the CRF model is used for acquiring the conditional probability distribution of another group of output random variables under the condition of giving a group of input random variables;

the CRF model is: given an input sentence, X ═ X]^N＝[x₁,...,x_i,...x_N]Assuming S is the output score matrix of the BiLSTM network of dimension NxK, K is the number of label categories, S_i，jIs the jth label score of the ith word, the predicted label y ═ y₁,...,y_i,...,y_N]The judgment score Z of (a) defines:

where T is a K +2 dimensional probability transition matrix, the probability of the generated tag sequence y:

then, solving the correctly labeled log-likelihood probability by utilizing maximum likelihood estimation:

further, the extracting the network security entity relationship comprises:

extracting a network security entity relationship by adopting an attention-based BilSTM (Att-BilSTM) model; the system comprises an input layer, a word embedding layer, a BilSTM layer, an Attention layer and an output layer;

wherein the word embedding layer is used to characterize a sentence in the APT report, X ═ X]^N＝[x₁,...,x_i,...x_N]Sentences are expressed into a matrix, words with similar meanings are adjacent in the matrix space, and the expressions can have relations;

wherein the significance of the output result of the salient part of the Attention layer introduces a weighting thought;

wherein the output of the BilSTM layer is B ═ B]^T＝[b₁,...,b_j,...,b_T]Then the parameter matrix W satisfies the following formula:

S＝tanh(B)

α＝softmax(W^TS)

r＝Bα^T

alpha is an attention weight coefficient, r is a result of weighted summation of the BilSTM output B, finally a characterization vector B ═ tanh (r) is generated through a nonlinear function, and then B is used^*And inputting a full-connection neural network, mapping the full-connection neural network to a label vector, and obtaining a prediction label through a softmax function.

Furthermore, the data organization adopts a non-relational database, namely a Mongodb database to store, and stores all data in a key-value pair mode.

The invention has the following beneficial effects and advantages:

the invention provides a basic model for utilizing and analyzing massive threat information data, and the key point of the invention is to improve the existing data quality improvement algorithm aiming at the network security threat information data, so that the data is adaptive to the network security threat information data, the data quality of the collected network security threat information data is improved, and the false positive rate of the collected network security threat information data is reduced. The invention improves the existing entity identification and entity relationship extraction method aiming at the characteristics of threat information data, improves the accuracy and efficiency of network security entity identification and security entity relationship extraction, and generates the threat information network security knowledge map with higher data quality. The invention also combines the data reasoning ability of the network security knowledge graph to research an attack graph visualization method combining the network security knowledge graph and the local network topology structure.

The method firstly improves the quality of threat information data aiming at the characteristics of network security threat information data, reduces the false positive rate of the threat information data and improves the overall quality of the data; then, the existing entity identification and entity relation extraction method is improved aiming at the characteristics of threat intelligence so as to generate a high-quality threat intelligence knowledge graph; then, the local network vulnerability is subjected to correlation analysis by using recent threat information and combining local network topological structure data, and the visual display of the security vulnerability nodes in the local network topology is realized; and finally, an attack prediction method based on the combination of the network security knowledge graph and the flow analysis of the observation building is provided, and the attack means and the attack target of the attacker are predicted. Through a large number of experiments, the invention verifies that the quality of threat information data quality improvement algorithm and network security threat information provided by the method, and the quality of the knowledge map extracted and generated by entity identification and entity relation in the information text are higher than that of the existing method, and the method has good local network weakness visualization capability and attack prejudgment analysis capability.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a process diagram of a threat intelligence-based network security knowledge-graph generation method of the present invention;

FIG. 2 is a diagram of a distributed crawler architecture for threat intelligence data collection in the present invention;

FIG. 3 is a probability map model diagram of threat intelligence data quality improvement algorithm in the present invention;

FIG. 4 is a schematic diagram of atomic attack entities and their relationships defined in the present invention;

FIG. 5 is a schematic structural diagram of a BilSTM-CRF model for network security entity identification in the present invention;

FIG. 6 is a schematic structural diagram of an Att-BilSTM model for network security entity relationship extraction in the present invention;

FIG. 7 is a data collection time chart for a distributed crawler system for threat intelligence data collection as developed in the present invention;

FIG. 8 is a graph comparing the effectiveness of a distributed crawler system and a stand-alone crawler system for threat intelligence data collection as developed in the present invention;

FIG. 9 is a diagram of an example of the organization of threat intelligence data related to the Windows system in embodiment 5 of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.

The solution of some embodiments of the invention is described below with reference to fig. 1-9.

Example 1

The invention relates to a network security knowledge graph generation method based on threat intelligence, which is shown in figure 1. figure 1 is a process diagram of the network security knowledge graph generation method based on the threat intelligence. The specific generation process of the network security knowledge graph comprises the following steps: the method comprises the steps of efficient distributed threat intelligence data collection, network security data set production, network security threat intelligence data quality improvement, network security entity identification, network security entity relation extraction and data organization. The following steps are described in detail:

step 1, efficient distributed threat intelligence data collection.

The generation of the network security knowledge graph requires a large amount of network security threat information data, and in order to efficiently collect the open source threat information data on the network in real time, the following distributed crawler system is realized for collecting the open source threat information data on the network. The distributed threat intelligence data crawling system is built by a script framework, and a script-redis scheduling crawler program is used for extracting data structuralization and storing the data structuralization into a redis and mongodb database.

(1) Distributed crawler system architecture: the threat intelligence collection system architecture is composed of a distributed crawler system and deployment of an underlying environment. The distributed crawler system is formed by reforming a traditional crawler frame by Scapy, a Redis database is newly added, and the problem that the distributed crawler system is not supported originally is solved. The underlying environment adopts a multi-node distributed system, a Docker container cluster, and uses mature Kubernetes as a cluster management tool. The distributed crawler system adopts a Master/Slave structure, a Master (Master) terminal and a plurality of Slave terminals are arranged, the Master terminal deploys a Redis database to store and be scheduled to request to be crawled, the Slave terminals deploy crawler main programs to crawl webpages and analyze extracted data, and then each Slave terminal stores the analyzed webpage data in the same MongoDB database. FIG. 2 is a diagram of a distributed crawler architecture for threat intelligence data collection in the present invention, as shown in FIG. 2. For each threat intelligence data item needing to be crawled, the threat intelligence data item is stored into a redis database, a script engine uses a scheduler to schedule the threat intelligence data item, and when a certain item is scheduled, a corresponding crawler program (spider) and middleware thereof are started to download the threat intelligence data.

(2) And (3) crawler strategies: for the Master terminal, an initial link is stored in Redis, a Key is a next crawled page in a scheduling queue, and a URL is a link of a certain page generally. And then, starting the crawler, acquiring the initial URL from the Redis, and downloading the data of the webpage corresponding to the URL. And analyzing the page data or the detail page link according to the defined related rule from the response, analyzing the page data directly according to the page format, restarting the crawler according to the condition of the detail page link, modifying the link into the detail page link, and acquiring the final detail data. The crawler continues to fetch URLs from the dispatch queue and crawl the next URL. If no URL exists, entering a waiting state. And for the Slave side, the downloader executes the downloading task and analyzes the extracted field. And the crawler program acquires the URL from the scheduling queue of the Key of the Redis and then downloads the corresponding webpage. And resolving the response according to the well-defined field rule, and storing the corresponding field into a MongoDB database after the corresponding field is processed by a text duplication removal module. Until the Key value is null.

(3) The crawler is realized: for the scheduler module, the scheduler module is responsible for scheduling tasks of the whole system, and mainly has the following functions: receiving a request sent by an engine; returning the URL to the downloading module; and (4) removing the weight of the URL and storing the URL into a Redis database. And each crawler subtask transmits the obtained URL to a scheduler through an engine, and the URL is stored in a Redis queue after being subjected to deduplication processing by the scheduler. And receiving the request of the engine, and returning the URL to the downloader. For the crawling downloader module, the crawling module integrates functions of a spider and a downloader, the spider processes and extracts data of webpage information returned by the downloader, and a directory URL and a detail page URL in the webpage information are extracted. And key fields in the webpage information are extracted and then stored in the MongoDB database. The downloader downloads the URL returned by the scheduler and transmits the downloaded webpage information to the spider. The system is responsible for crawling the corresponding website, firstly takes the initial URL, extracts the URL after crawling, returns the URL to the duplication removal module, and then distributes the URL to the Slave node from Redis by the scheduling module.

(4) Data storage: the storage module only needs to realize two functions, one is URL and stored in Redis, and the Redis is deployed in a Master node. And the webpage content obtained by analysis is stored in a MongoDB database and is deployed at a Master node. The stored webpage content information is extracted to be the final target of the system, and the distributed crawler crawls the webpage content and then provides a data processing program for extracting the required information.

And 2, making a network security threat intelligence data set.

The network security data is obtained by collecting the following 5 kinds of threat intelligence data by using the distributed threat intelligence crawling system in step 1. The method comprises the following steps:

(1) vulnerability data: the vulnerability data is collected from main vulnerability publishing platforms, such as CVE, NVD and the like. The data type comprises data such as a vulnerability occurrence system type, a system version, a utilization method and the like.

(2) APT (advanced persistent threat attack) attack chain data: APT attack chain data are collected from APTnodes platforms, 528 APT reports in the last 10 years are included, 50 reports are labeled manually, a BIO labeling method is adopted, 40 deep learning models for training entity recognition and entity relation extraction are used, and the remaining 10 reports are used for testing model effects.

(3) Malware text data: the data comprises the name, the category, the common functions, the Hash, the platform of the utilization system and the like of the malicious software in the threat intelligence. This portion of the data is collected in the threat intelligence source, AlienVault.

(4) Data discussion in the secure community: this portion of the data is collected at the StackExchange website, where the data is primarily text for security researchers to discuss recent security events.

(5) Secure RSS subscription data: the data is collected in large network security RSS, and the data is mainly recent network security news.

And 3, improving the data quality of the network security threat intelligence.

After the network security threat information data set is generated, the quality of the threat information data needs to be improved so as to improve the quality of the threat information data and reduce the false positive rate of the threat information data, so that a high-quality network security knowledge graph can be generated in the subsequent process.

The invention improves the time-varying characteristic of threat intelligence by the existing truth finding algorithm and introduces Markov property to improve the time-varying characteristic, so that the time-varying characteristic is suitable for the time-varying characteristic of the threat intelligence, as shown in figure 3, and figure 3 is a probability graph model diagram of the threat intelligence data quality improvement algorithm in the invention. In the figure, M_i: representing the set of model parameters at the ith time instant; c_i: represents the model M at the ith time_iA priori parameters of (a); wherein i is 1, 2.., N; the remaining parameters have the same meanings as indicated herein.

The invention provides a threat intelligence data quality improvement algorithm model, which comprises the following steps:

step (1) FPR (false positive)Sex ratio): for each source k ∈ S, a corresponding false positive rate is generated

The value is (1-specificity), and the compliance hyper-parameter is alpha₀＝(α₀₁,α₀₀) Beta distribution of (a), wherein₀₁Is the count of false positive samples a priori per source, alpha_0,0Is the count of true negative samples per source prior:

in the following, from the second time node

Using a previous time node

Instead, the time-varying characteristics of the truth finding model are calibrated using second-order markov.

Obeying a hyper-parameter of alpha₁＝(α₁₁,α₁₀) Beta distribution of (a), wherein₁₁Is the true positive sample count per source prior, α_1,0Is the per source a priori false negative sample count:

similar to FPR, will be from the second time node

Using a previous time node

Step (3) Att fact (attack tag): for each attribute to which an entity belongs, F ∈ F, which is the set of observations (i.e., the set of collected values) for all attributes under that entity. Generating a priori true probability θ_fObeying a hyper-parameter of β ═ β (β)₁,β₀) Beta distribution of (a), wherein₁Is a prior entity attribute correct sample count, β₀Is a prior entity attribute error sample count:

θ_f～Beta(β₁,β₀)

similar to FPR and Sensitivity above, θ from the second time node_fUsing theta of the previous time node_fInstead, the time-varying characteristics of the truth finding model are calibrated using second-order markov.

Step (4), Truth label: and the attribute true value label is used for generating a true value label of each entity attribute, namely whether the observed value is correct or not. t is t_fIs an attribute truth label with a compliance parameter theta_fIn which t is_fIs a binary Boolean variable, the prior probability θ_fIs to represent an attribute tag t_fProbability of being correct:

t_f～Bernoulli(θ_f)

Bernoulli distribution.

Wherein if t_f＝0，o_cCompliance parameter of

The false positive rate of the distribution of bernoulli is s_c

If t is_f＝1，o_cCompliance parameter of

The false positive rate of the distribution of bernoulli is also s_c

The model solution is as follows: from the above description, the conditional probability of the model given the observed value c of each entity attribute is as follows:

And

when the observed value of the entity o is c, the conditional probability is obtained. Where c is the observation, f is the attack tag, s_cA source representing the occurrence of an observation c;

the full likelihood function containing all variables and hyperparameters can be written as:

in the above formula: p represents the hyperparameter alpha when given the parameter false positive rate₀，α₁And the prior true probability over-parameter beta, the entity o, the source s, the true label t, the prior probability parameter set theta and the sensitivity parameter set phi⁰，φ¹The conditional probability of (2).Where S represents a set of all sources, F represents an attack tag set, F represents each attack tag element belonging to F, θ_fDenotes the f prior probability, t_fDenotes the true value of f, C_fA set of observations representing f, c representing each observation element in the set of observations.

Given the observed value data of the attribute, the likelihood function can be solved using the Gibbs Sampling algorithm in the MCMC algorithm:

t_mapthe result of maximum posterior estimation of the above formula is shown, and the rest parameters have the same meanings as the parameters with the same names in the above.

The following formula can be solved:

wherein: p represents when given a parameter t_-fTrue value t of f for entity o and source s_fThe value is the conditional probability of i, i represents the attack label value of f, and the value range is {0,1}, t_-fIs the set of all values in F except F,

source s representing observation j, attack tag not f, and truth tag i_cThe number of the cells. C_-fRepresenting a set of attack tags without attack tag f, C' being C_-fEach of the elements in the set is,

the value of f is the true value when c', and the rest parameters have the same meanings as the parameters with the same names.

To obtain p (t)_f＝i|t_-fO, s), the FPR (false positive rate) and Sensitivit (sensitivity) at the next time can be estimatedRate) y, which solves for:

wherein

where precision represents the accuracy of each source, and the remaining parameters are synonymous with the above-mentioned parameters.

Entities and relationships are defined as follows:

first, defining the concept of relationship between network security entities and entities. The knowledge graph reflects the specific information and the associated relation between the information, and the entity is an abstract expression of the concept and the relation between the concepts, so that good entity definition can be helpful for clearly expressing the information and the relation contained in the knowledge graph. Here, an atomic attack is used to describe a network security entity, and the atomic attack represents the smallest attack unit in a single attack and can be understood as the smallest step in the attack.

As shown in fig. 4, fig. 4 is a schematic diagram of the atomic attack entities and their relationships defined in the present invention. In the atomic attack graph, an atomic attack is represented by a vertex, and the actual meaning represents a once-exploit attack. Exploits are tied to software and hardware. The implementation of the attack depends on the attack condition, the attack mode, the attack effect and the like. The invention designs 4 entities of software, hardware, bugs and attacks for atomic attack, wherein the attacks have 3 attributes of attack conditions, attack modes and attack effects. Wherein the relationship between entities is defined as "existence" and "utilization" 2 relationships.

And 4, carrying out network security entity identification on the manufactured network security threat intelligence data set.

As described above, for the APT report in step 2, the sentence X ═ X in the APT report document is annotated by the BIO notation method]^N＝[x₁,...,x_i,...x_N]Wherein x is_iIs the ith character in sentence X. In the BIO labeling method, identifying the network security entity in the sentence X is equivalent to giving a standard sequence L_X＝[l]^N。

The invention uses a BilSTM-CRF (bidirectional long and short term memory artificial neural network-conditional random field algorithm) model to carry out model training on the labeled APT report document, as shown in figure 5, and figure 5 is a structural schematic diagram of the BilSTM-CRF model used for network security entity identification in the invention. In the figure, CRF represents a conditional random field; bi represents the output of the ith backward network; fi denotes an output of the ith forward network; ci represents the ith text vector; B-LOC, E-LOC, O in the CRF layer represents: start, end, outside. The model can simultaneously extract the word characteristics before the ith character and the word characteristics after the word through a forward process, thereby improving the learning ability of the word. A CRF (conditional random field) model is used to obtain the conditional probability distribution of one set of output random variables given a set of input random variables.

Wherein the CRF model is: given an input sentence, X ═ X]^N＝[x₁,...,x_i,...x_N]Let S be the output score matrix of the BiLSTM (bidirectional Long-short term memory artificial neural network) network with dimension NxK, K being the number of labeled species, S_i，jIs the jth label score of the ith word, the predicted label y ═ y₁,...,y_i,...,y_N]The judgment score Z of (a) defines:

and 5, extracting the network security entity relationship.

Network security entity relationship extraction adopts an attention mechanism-bidirectional long-short term memory artificial neural network (BILSTM) model. The model is mainly divided into 5 layers: an input layer, a word embedding layer, a BilSTM layer, an Attention layer, and an output layer (the CRF layer in the BilSTM-CRF model is replaced by the Attention layer, and the output layer becomes a softmax layer). As shown in FIG. 6, FIG. 6 is a schematic structural diagram of an Att-BilSTM model for network security entity relationship extraction in the present invention. Wherein Si represents the ith text vector; o, B-A and I-A in the output layer represent: exterior, beginning of a, interior of a.

Wherein the word embedding layer is used to characterize a sentence in the APT report, X ═ X]^N＝[x₁,...,x_i,...x_N]The sentences are expressed into a matrix, and words with similar meanings are adjacent in the space of the matrix to indicate that the sentences possibly have relations.

The significance of the output result of the Attention layer salient part introduces a weighting idea. Wherein the output of the BilSTM layer is B ═ B]^T＝[b₁,...,b_j,...,b_T]Then the parameter matrix W satisfies the following formula:

S＝tanh(B)

α＝softmax(W^TS)

r＝Bα^T

α is an attention weight coefficient, r is a result of weighted summation of the BiLSTM output B, and finally a characterization vector B ═ tanh (r) is generated by a nonlinear function. Then B is put^*And inputting a full-connection neural network, mapping the full-connection neural network to a label vector, and obtaining a prediction label through a softmax function.

And 6, organizing data.

Because threat intelligence data presents the characteristic of multi-source isomerism, the method adopts a non-relational database, namely a Mongobb database, to store data organization, and stores all data in a key value pair mode. The Mongodb database has extremely high performance and flexible data storage characteristics, and is suitable for storing threat intelligence and a generated network security knowledge graph model.

In the implementation steps of the invention, the software environment is a Windows10 system, the implementation language is Python3, the deep learning framework is Pythrch, and the database is a non-relational database Mongodb.

Example 2

The embodiment provides a network security knowledge graph generating method based on threat intelligence, and the method is used for testing a distributed threat intelligence crawling system.

The invention verifies that the developed distributed threat information crawling system has higher superiority in efficiency compared with the single-machine threat information collecting system by comparing the developed distributed threat information crawling system with the single-machine threat information collecting system. Taking a common open source threat information source as an example, the distributed crawler system is provided with 1 main node and 2 slave nodes, and after the continuous operation is carried out for 5 days, 11 thousands of pieces of webpage data are stored in the database in a coexisting manner. The number of pages crawled at various points in time is shown in fig. 7, where fig. 7 is a time chart of data collection by the distributed crawler system for threat intelligence data collection developed in the present invention. In the figure, the position of the upper end of the main shaft,

in the experiment, the total number of pages crawled by the 2 Slave nodes in a certain time is far higher than that of pages crawled by the single-machine operation, and the distributed system is fully demonstrated to improve the operation efficiency indeed. And the distributed crawler system and the crawler running in the single-machine environment perform comparison test, and record the number of pages crawled by the distributed crawler system and the single-machine environment. Respectively deploying distributed crawler projects in a Docker container cluster and a virtual machine cluster, wherein the hardware configuration is as follows: master1 and Slave2 are Ubuntu 16.04 and Python2.7, and the memories are 8G. Operational efficiency vs. time as shown in fig. 8, fig. 8 is a graph comparing the effectiveness of a distributed crawler system for threat intelligence data collection developed in the present invention with a stand-alone crawler system. The number of pages grabbed by the crawler at each time point is known, and the distributed crawler system is obviously superior to a single-machine crawler system.

Example 3

The embodiment provides a network security knowledge graph generating method based on threat intelligence, and comparison is carried out aiming at the effect of a threat intelligence data quality improvement algorithm.

The method and the device perform comparison of the entity attribute quality improvement effect of the threat intelligence data by using the algorithm provided by the invention for the threat intelligence data and other truth value discovery algorithms. The test criteria used were the accuracy, recall and F1 values commonly used in the true discovery model. The true value of the comparison finds that the algorithm is 3-Estimates, Voting, LTM. The comparative effects are shown in table 1. It can be seen that the quality improvement algorithm of the invention has better effect on the quality improvement of threat intelligence data than the existing algorithm.

Table 1 is a table of comparison results of different data quality improvement algorithm effects in the embodiment of the present invention.

Algorithm	Rate of accuracy	Recall rate	F1 value
				proposal	0.935	0.960	0.987
3-Estimates	0.874	0.903	0.927
				Voting	0.840	0.867	0.913
LTM	0.924	0.865	0.966

In the table: the Proposal represents the algorithm provided by the invention, the 3-Estimates represents the 3 sequence parameter estimation algorithm, the Voting represents the Voting algorithm, and the LTM represents the hidden truth model algorithm.

Example 4

The embodiment provides a network security knowledge graph generating method based on threat intelligence, and the network security knowledge graph generating method is used for comparing the network security entity identification effects in the threat intelligence.

The invention tests the effect of the network security entity identification model and the existing entity identification model through the marked remaining 10 APT report documents. The test criteria used were the accuracy, precision, recall and F1 values commonly used in entity identification. The compared entity recognition models are CRF, LSTM and LSTM-CRF. The comparative effect is shown in table 2. It can be seen that the network security entity identification model provided by the invention has better network security entity identification effect than the existing model in threat intelligence.

Table 2 is a table comparing the effects of different network security entity identification models in the embodiment of the present invention.

In the table: CRF represents a conditional random field algorithm, LSTM represents a long-short term memory artificial neural network algorithm, BilSTM represents a bidirectional long-short term memory artificial neural network algorithm, and BilSTM-CRF represents a bidirectional long-short term memory artificial neural network-conditional random field algorithm.

Example 5

The embodiment provides a network security knowledge graph generating method based on threat intelligence, and the network security knowledge graph generating method is used for comparing network security entity relation extraction effects in the threat intelligence.

The invention tests the effect of the network security entity relationship extraction model and the existing entity relationship extraction model through the remaining 10 APT report documents. The test criteria select entity relationships to extract commonly used precision, recall, and F1 values. The entity relationship extraction models of comparison are CRF, LSTM, BilSTM and BilSTM-CRF. The comparative effect is shown in table 3. It can be seen that the network security entity relationship extraction model provided by the invention has better network security entity relationship extraction effect than the existing model in threat intelligence.

Table 3 is a table comparing the effects of the different network security entity relationship extraction models in the embodiment of the present invention.

Model (model)	Rate of accuracy	Rate of accuracy	Recall rate	F1 value
					CRF	0.9041	0.8084	0.7963	0.7892
LSTM	0.9163	0.8162	0.8046	0.8018
					BiLSTM	0.9265	0.8339	0.8262	0.8491
BiLSTM-CRF	0.9374	0.8674	0.8344	0.8411
					_BiLSTM-CRF-Attentio_n	0.9405	₀.8652	₀.8748	₀.8751

In the table: BilSTM-CRF-Attention represents a bidirectional long-short term memory artificial neural network-conditional random field-Attention mechanism algorithm.

Example 6

The embodiment provides a network security knowledge graph generating method based on threat intelligence, and relates to a network security knowledge graph example based on the threat intelligence.

As shown in fig. 9, fig. 9 is a diagram of an example of the organization of threat intelligence data related to the Windows system in embodiment 5 of the present invention.

After network security entity identification and relationship extraction are carried out on various threat information data, the network security knowledge graph based on threat information can effectively organize entity data and relationship in various threat information and carry out association analysis on the data. The data associated with the storage and Mongodb is visually displayed in FIG. 9 using the grapeviz module in Python 3. The system shows that the Win10 system in the Windows system has remote desktop service remote code execution bugs, and can utilize four bugs, namely CVE-2019-. CVE represents a generic vulnerability disclosure number.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. A network security knowledge map generation method based on threat intelligence is characterized by comprising the following steps: the method comprises the following steps:

step 3, improving the data quality of the network security threat information;

step 5, extracting the network security entity relationship;

and 6, organizing data.

2. The method for generating network security knowledge graph based on threat intelligence as claimed in claim 1, wherein: the high efficiency distributed threat intelligence data collection comprises: distributed crawler system architecture, crawler strategy, crawler implementation and data storage.

3. The method of claim 2, wherein the method comprises: the distributed crawler system architecture comprises: the threat information collection system framework is formed by the deployment of a distributed crawler system and a bottom environment; the distributed crawler system is formed by reconstructing a traditional crawler frame by Scapy, a Redis database is added, a multi-node distributed system is adopted in a bottom environment, a Docker container cluster is adopted, and Kubernets are used as a cluster management tool; the distributed crawler system adopts a Master/Slave structure, a Master (Master) terminal and a plurality of Slave terminals are arranged, the Master terminal deploys a Redis database to store and be scheduled to request to be crawled, the Slave terminals deploy crawler main programs to crawl webpages and analyze extracted data, and each Slave terminal stores the analyzed webpage data in the same MongoDB database.

4. The method of claim 2, wherein the method comprises: the crawler strategy comprises: for a Master terminal, storing an initial link in Redis, wherein Key is a next crawled page in a scheduling queue, and URL is a link of a certain page generally; then, a crawler is started, a starting URL is obtained from the Redis, and data of a webpage corresponding to the URL are downloaded; analyzing according to a defined relevant rule from response to obtain page data or a detail page link, analyzing according to a page format for the condition of directly being the page data, restarting a crawler for the condition of the detail page link, modifying the link into the detail page link, and acquiring final detail data; the crawler program continuously acquires the URL from the scheduling queue and crawls the next URL; if no URL exists, entering a waiting state; for the Slave end, a downloader executes a downloading task and analyzes an extracted field; the crawler program acquires URL from a scheduling queue of Key of Redis, and then downloads a corresponding webpage; and resolving the response according to the well-defined field rule, and storing the corresponding field into the MongoDB database after the corresponding field is processed by the text duplication removal module until the Key value is null.

5. The method of claim 2, wherein the method comprises: the crawler implementation includes: for the scheduler module, the scheduler module is responsible for scheduling tasks of the whole system, and mainly has the following functions: receiving a request sent by an engine; returning the URL to the downloading module; the URL is stored in a Redis database after being deduplicated; each crawler subtask transmits the obtained URL to a scheduler through an engine, and the URL is stored in a Redis queue after being subjected to deduplication processing by the scheduler; receiving the request of the engine, and returning the URL to the downloader; for the crawling downloader module, the crawling module integrates the functions of a spider and a downloader, the spider processes and extracts data of webpage information returned by the downloader, and extracts a directory URL and a detail page URL in the webpage information; key fields in the webpage information are extracted and then stored in a MongoDB database; the downloader downloads the URL returned by the scheduler and transmits the downloaded webpage information to the spider; the system is responsible for crawling corresponding websites, and comprises a starting URL (uniform resource locator), a duplicate removal module, a URL extraction module and a duplicate removal module, wherein the starting URL is taken at first, and the URL is extracted after crawling; then the scheduling module distributes URL to the Slave node from Redis;

6. The method for generating network security knowledge graph based on threat intelligence as claimed in claim 1, wherein: the network security threat information data set is produced through a distributed threat information crawling system; the method comprises the following steps:

7. The method for generating network security knowledge graph based on threat intelligence as claimed in claim 1, wherein: the method for improving the quality of the network security threat information data comprises the following steps:

The value is (1-specificity)Obedience hyper-parameter is alpha₀＝(α_0,1,α_0,0) Beta distribution of (a), wherein_0,1Is the count of false positive samples a priori per source, alpha_0,0Is the count of true negative samples per source prior:

in the following, from the second time node

Using a previous time node

from a second time node

Using a previous time node

θ_f～Beta(β₁,β₀)

t_f～Bernoulli(θ_f)

Bernoulli distribution of (a):

wherein if t_f＝0，o_cCompliance parameter of

The false positive rate of the distribution of bernoulli is s_c

If t is_f＝1，o_cCompliance parameter of

The false positive rate of the distribution of bernoulli is also s_c

The model solution is as follows:

And

in the above formula: p represents the hyperparameter alpha when given the parameter false positive rate₀，α₁And the prior true probability over-parameter beta, the entity o, the source s, the true label t, the prior probability parameter set theta and the sensitivity parameter set phi⁰，φ¹The conditional probability of (a); where S represents a set of all sources, F represents a set of attack tags, and F represents each attack belonging to FHit against the label element, θ_fDenotes the f prior probability, t_fDenotes the true value of f, C_fA set of observations representing f, c representing each observation element in the set of observations;

the following formula is solved:

wherein

where precision represents the accuracy of each source.

8. The method for generating network security knowledge graph based on threat intelligence as claimed in claim 1, wherein: the network security entity identification is carried out on the manufactured network security threat information data set, and the sentence X in the APT report document is [ X ] by adopting a BIO marking method for the APT report]^N＝[x₁,...,x_i,...x_N]Wherein x is_iIs the ith character in sentence X; in the BIO labeling method, identifying the network security entity in the sentence X is equivalent to giving a standard sequence L_X＝[l]^N；

9. the method for generating network security knowledge graph based on threat intelligence as claimed in claim 1, wherein: the extracting of the network security entity relationship comprises:

S＝tanh(B)

α＝softmax(W^TS)

r＝Bα^T

10. The method for generating network security knowledge graph based on threat intelligence as claimed in claim 1, wherein: and the data organization adopts a non-relational database, namely a Mongodb database to store, and stores all data in a key-value pair mode.