CN113282759B

CN113282759B - Threat information-based network security knowledge graph generation method

Info

Publication number: CN113282759B
Application number: CN202110439459.1A
Authority: CN
Inventors: 李桐; 刘一涛; 刘刚; 王刚; 赵桐; 周小明; 宋进良; 姚羽; 刘扬; 王磊; 李广翱; 陈得丰; 刘莹; 杨智斌; 耿洪碧; 杨巍; 任帅; 陈剑; 李欢; 张彬
Original assignee: State Grid Liaoning Electric Power Co Ltd; Electric Power Research Institute of State Grid Liaoning Electric Power Co Ltd
Current assignee: State Grid Liaoning Electric Power Co Ltd; Electric Power Research Institute of State Grid Liaoning Electric Power Co Ltd
Priority date: 2021-04-23
Filing date: 2021-04-23
Publication date: 2024-02-20
Anticipated expiration: 2041-04-23
Also published as: CN113282759A

Abstract

The invention belongs to the technical field of industrial control network security, and particularly relates to a network security knowledge graph generation method based on threat information. Comprising the following steps: high-efficiency distributed threat intelligence data collection; the network security threat information data set is manufactured through a distributed threat information crawling system; improving the quality of the network security threat information data; carrying out network security entity identification on the manufactured network security threat information data set; extracting the relation of the network security entity; and (5) data organization. Through a large number of experiments, the threat information data quality improvement algorithm and the network security threat information provided by the method are verified, the quality of the knowledge patterns extracted and generated by entity identification and entity relation in the information text is obviously improved, and the method has good local network weakness visualization capability and attack pre-judgment analysis capability.

Description

Threat information-based network security knowledge graph generation method

Technical Field

The invention belongs to the technical field of industrial control network security, and particularly relates to a network security knowledge graph generation method based on threat information.

Background

With the rapid development of network technology, various industries have introduced a large amount of network technology to improve productivity, with the consequent network security problem. In the situation that network security situation is getting more complex, threat information-driven network security dynamic defense is the focus of attention in the industry. Threat information has the characteristics of rich data content, high accuracy and strong real-time performance, and can reflect the attack chain of the whole attack event, so that the threat information has extremely high application and analysis values.

The knowledge graph is used as a comprehensive data integration and organization method, so that attack information can be effectively extracted from massive threat information, and complex behaviors such as reasoning analysis, attack semantic association and the like can be performed on the attack information data. With the continuous updating of threat information, the knowledge-graph network security system based on threat information can realize dynamic defense, and compared with traditional static defense means such as antivirus software, a firewall and the like, the knowledge-graph can sense the network security situation faster and more accurately, so that the overall security of the network is improved, and advanced functions such as attack path prediction, attack tracing, security threat judgment and the like are realized.

In the process of generating a related network security knowledge graph by using threat information, the quality of the data after threat information collection is improved, the false positive rate of threat information data is reduced, and network security entity identification and security entity relation extraction in threat information are difficult research contents.

The main problems are as follows:

1. the open source threat information on the network generally has the problems of low data quality, high false positive rate of data, missing or error of the corresponding attribute of the data entity and the like. The low-quality threat intelligence data inevitably causes the problem that the generated network security knowledge graph is low in quality, so that the network security situation cannot be perceived correctly, and the current network attack behavior is predicted wrongly. The existing data quality improvement algorithm mainly depends on a true value discovery algorithm, the algorithm is mostly applied to single true value discovery problems, the condition that entities in network security threat information data have multiple true values and the network security threat information data have strong time-varying characteristics cannot be adapted to, the traditional true value discovery algorithm assumes that the true value does not change with time, and the assumption that the sensitivity to time change is weak necessarily leads to the fact that the existing true value discovery algorithm cannot adapt to the quality improvement problems of the network security threat information data.

2. The existing entity identification and entity relation extraction method is mainly based on the traditional rule identification, machine learning and recently popular deep learning method, a large number of marked text data samples are needed, and the data quality requirement is high. Although the method is widely applied to other fields such as natural language processing, the method is difficult to identify and extract entity relations in the network security field because of the lack of large-scale high-quality security entity labeling data in the network security field, the mixing of multiple entity types in the data, and the different entity class labels in the data in the whole text.

At present, no network security entity identification and entity relation extraction method with good effect exists in the network security field.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention provides a network security knowledge graph generation method based on threat information, which aims to provide a basic model for utilizing and analyzing massive threat information data and realize the aim of predicting an attack means and an attack target of an attacker.

The technical scheme adopted by the invention for achieving the purpose is as follows:

A network security knowledge graph generation method based on threat information comprises the following steps:

step 1, collecting high-efficiency distributed threat information data, constructing a distributed threat information data crawling system by a scrapy framework, extracting data structure by using a scrapy-redis scheduling crawler program, and storing the data structure into a redis and mongasdb database;

step 2, making a network security threat information data set through a distributed threat information crawling system;

step 3, improving the quality of the network security threat information data;

step 4, utilizing threat information data to manufacture a network security threat information data set to identify a network security entity;

step 5, extracting the relation of the network security entity;

and 6, data organization.

Further, the efficient distributed threat intelligence data collection includes: distributed crawler system architecture, crawler policy, crawler implementation and data storage.

Further, the distributed crawler system architecture includes: the threat information collection system architecture is formed by the deployment of a distributed crawler system and a bottom layer environment; the distributed crawler system is formed by modifying a traditional crawler framework, namely, the Scapy, a Redis database is added, a multi-node distributed system is adopted in a bottom environment, a Docker container cluster is adopted, and Kubernetes is used as a cluster management tool; the distributed crawler system adopts a Master/Slave structure, a Master end and a plurality of Slave ends are arranged, the Master end deploys a Redis database to store and be scheduled to-be-crawled requests, the Slave end deploys a crawler main program to crawl web pages and analyze extracted data, and each Slave end stores the analyzed web page data in the same MongoDB database.

Further, the crawler policy includes: for a Master terminal, firstly storing an initial link in a Redis, wherein Key is the next crawled page in a scheduling queue, and URL is generally the link of a certain page; then starting a crawler, acquiring a starting URL from the Redis, and downloading data of a webpage corresponding to the URL; analyzing the response according to the defined related rules to obtain page data or detail page links, analyzing the condition of the direct page data according to the webpage format, starting the crawler again in the detail page link condition, modifying the links into detail page links, and obtaining final detail data; the crawler program continues to acquire the URL from the scheduling queue and crawls the next URL; if the URL does not exist, entering a waiting state; for the Slave end, the downloader executes a downloading task and analyzes and extracts the fields; the crawler program acquires the URL from the scheduling queue of the Key of the Redis, and then downloads the corresponding webpage; and analyzing response according to the defined field rule, processing the corresponding field by a text deduplication module, and storing the processed field into a MongoDB database until the Key value is null.

Further, the crawler implementation includes: for the scheduler module, the scheduler module is responsible for scheduling tasks of the whole system, and mainly has the following functions: accepting a request sent by an engine; returning the URL to the downloading module; the URL is stored in a Redis database after being de-duplicated; each crawler subtask transmits the crawled URL to a dispatcher through an engine, and the dispatcher carries out duplication elimination treatment and then stores the URL into a Redis queue; receiving the request of the engine, and returning the URL to the downloader; for the crawling downloader module, the crawling module integrates the functions of the spider and the downloader, the spider processes and extracts data of webpage information returned by the downloader, and directory URL and detail page URL in the webpage information are extracted; extracting key fields in the webpage information and storing the key fields in a MongoDB database; the downloader downloads the URL returned by the scheduler and transmits the downloaded webpage information to the spider; the method is responsible for crawling corresponding websites, firstly taking a starting URL, extracting the URL after crawling, and returning the URL to the duplicate removal module; then the dispatching module distributes URL to the Slave node from Redis;

The data store includes: the storage module realizes two parts of functions, the URL is stored in Redis, and the Redis is deployed on a Master node; the analyzed webpage content is stored in a MongoDB database and is deployed in a Master node; extracting the stored webpage content information is a final target of the system, and the distributed crawlers crawl the webpage content for a data processing program to extract the required information.

Further, the network security threat information data set is manufactured through the distributed threat information crawling system; comprising the following steps:

(1) Vulnerability data: the vulnerability data is collected from a main vulnerability publishing platform, and the data types comprise vulnerability occurrence system types, system versions and utilization methods;

(2) APT attack chain data: APT attack chain data are collected from an APTnites platform; a total of 528 APT reports have been included over the last 10 years;

(3) Malware text data: the name, the category, the common function, the Hash and the utilization system platform of the malicious software in the threat information are included; the part of data is collected in threat information source alien vault;

(4) Secure community discussion data: the part of data is collected in a Stackexchange website and is the text of a recent security event;

(5) Secure RSS subscription data: the part of data is collected in each large network security RSS and is the recent network security news.

Further, the method for improving the quality of the network security threat information data comprises the following steps:

step (1) FPR false positive rate: for each source k E S, generating a corresponding false positive rateThe value is (1-specificity), and the compliance super parameter is alpha ₀ ＝(α _0,1 ,α _0,0 ) Beta distribution of (2), wherein alpha _0,1 Is the count of each source a priori false positive samples, alpha _0,0 Is the true negative sample count per source a priori:

in the following, the second time node will be followedWith +.>Instead, using second-order markov, calibrating the time-varying characteristics of the truth-value discovery model;

step (2) SensSensitivity rate of iticity: for each source k E S, generating a corresponding sensitivity rateObeying the super parameter alpha ₁ ＝(α _1,1 ,α _1,0 ) Beta distribution of (2), wherein alpha _1,1 Is the true positive sample count of each source a priori, alpha _1,0 Is per source a priori false negative sample count:

will be from the second time nodeWith +.>Instead, using second-order markov, calibrating the time-varying characteristics of the truth-value discovery model;

step (3) Att face attack tag: for the attribute of each entity, F epsilon F, wherein F is the set of the observed values of all the attributes under the entity; generating a priori true value probability θ _f Obeying the super parameter as beta= (beta) ₁ ,β ₀ ) Beta distribution of (2), wherein Beta ₁ Is the correct sample count of the prior entity attribute, beta ₀ Is a priori entity attribute error sample count:

θ _f ～Beta(β ₁ ,β ₀ )

θ to be transmitted from the second time node _f θ with previous time node _f Instead, using second-order markov, calibrating the time-varying characteristics of the truth-value discovery model;

step (4) Truth label: the attribute truth value label generates a truth value label of each entity attribute, namely whether the observed value is correct or not; t is t _f Is an attribute truth value label, obeys the parameter theta _f Bernoulli distribution of (1), wherein t _f Is a binary Boolean variable, a priori probability θ _f Is a representation of genusSex label t _f Probability of being correct:

t _f ～Bernoulli(θ _f )

step (5) Observation: entity attribute observation value labels, for each entity attribute observation value C, C E C _f S is used as a source thereof _c A representation; generating a distribution of observations tags c is subject to parametersBernoulli distribution of (a):

wherein if t _f ＝0，o _c Obeying parameters ofIs the bernoulli distribution, the false positive rate of which is s _c

If t _f ＝1，o _c Obeying parameters ofThe false positive rate is also s for the bernoulli distribution _c

The model solution is as follows:

the conditional probability of the model given the observations c of each entity attribute is as follows:

in the above formula: p represents the prior probability θ when given parameters are true _f Source sensitivity And->When the observed value of the entity o is the conditional probability of c; where c is the observed value, f is the attack tag, s _c A source representing the occurrence of observation c;

the complete likelihood function containing all variables and super-parameters is written as:

in the above formula: p represents the hyper-parameter alpha when the false positive rate of a given parameter ₀ ，α ₁ And a priori true value probability superparameter beta, an entity o, a source s, a true value label t, a priori probability parameter set theta and a sensitivity parameter set phi ⁰ ，φ ¹ Conditional probability of (2); where S represents the set of all sources, F represents the set of attack tags, F represents each attack tag element belonging to F, θ _f Represents f priori probability, t _f Represents the true value of f, C _f Representing a set of observations of f, c representing each observation element in the set of observations;

given observation data of the attribute, solving the likelihood function by using a Gibbs Sampling algorithm in the MCMC algorithm:

t _map the result obtained by carrying out maximum posterior estimation on the above formula is shown, and the rest parameters have the same meaning as the parameters with the same name;

the following formula solution is obtained:

wherein: p denotes when given parameter t _-f True value t of f for entity o and source s _f Conditional probability of i, i representing attack tag fTake the value of {0,1}, t _-f Is the set of all but F values in F,

source s, representing observation j, attack tag not f and true value tag i _c Is the number of (3); c (C) _-f Representing an attack tag set without an attack tag f, C' being C _-f Each element in the set, +.>The true value when the value of f is c' is shown, and the rest parameters have the same meaning as the parameters with the same names;

after p (t) _f ＝i|t _-f O, s), the FPR false positive rate and the Sensitivity rate at the next time are estimated, and the following solutions are obtained:

wherein the method comprises the steps ofObservation set C representing all attack tags as f _f Source s that makes observations on observation c _c And attack tag o of entity o _f The true value label of j takes the sum of probabilities of i, wherein i epsilon {0,1}, j epsilon {0, 1., |F| } and|F| represent the number of elements of the attack set F, the rest parameters have the same meaning as the parameters with the same name, and finally the accuracy rate of each source can be estimated as well:

where precision represents the accuracy of each source.

Further, the network security entity identification is performed on the created network security threat information data set, namely, the BIO labeling method is adopted for the APT report to make sentence X= [ X ] in the APT report document] ^N ＝[x ₁ ,...,x _i ,...x _N ]Wherein x is _i Is the ith character in sentence X; in the BIO labeling method, identifying the network security entity in sentence X corresponds to giving a standard sequence L _X ＝[l] ^N ；

Model training is carried out on the marked APT report document by using a BiLSTM-CRF model, and simultaneously word characteristics before the ith character and word characteristics after the ith character are extracted through a forward process; the CRF model is used for obtaining the conditional probability distribution of another set of output random variables under the condition of a given set of input random variables;

the CRF model is: given an input sentence, x= [ X ]] ^N ＝[x ₁ ,...,x _i ,...x _N ]Let S be the output score matrix of BiLSTM network of dimension NxK, K be the number of label categories, S _i，j Is the jth tag score of the ith word, then the predicted tag y= [ y ] ₁ ,...,y _i ,...,y _N ]Is defined by the judgment score Z:

where T is the k+2-dimensional probability transition matrix, the probability of the generated tag sequence y:

and then solving the correctly marked log-likelihood probability by using the maximum likelihood estimation:

further, the extracting the relation of the network security entity includes:

the network security entity relation extraction adopts a BiLSTM (Att-BiLSTM) model based on an attention mechanism; the method comprises an input layer, a word embedding layer, a BiLSTM layer, an Attention layer and an output layer;

wherein the word embedding layer is used for characterizing sentences in the APT report, and X= [ X ]] ^N ＝[x ₁ ,...,x _i ,...x _N ]Sentence is expressed as a matrix, words with similar meaning are adjacent in the matrix space, and the expression possibly has a relation;

Wherein the importance of the output result of the protrusion part of the Attention layer introduces a weighting idea;

wherein the output of the BiLSTM layer is B= [ B ]] ^T ＝[b ₁ ,...,b _j ,...,b _T ]The parameter matrix W satisfies the following formula:

S＝tanh(B)

α＝softmax(W ^T S)

r＝Bα ^T

alpha is the attention weight coefficient, r is the weighted sum of the BiLSTM output B, and the characterization vector B=tanh (r) is finally generated by a nonlinear function, and then B is calculated ^* The input fully connected neural network is mapped to the labeling vector, and the prediction labeling is obtained through a softmax function.

Further, the data organization adopts a non-relational database Mongodb database for storage, and stores all data in the form of key value pairs.

The invention has the following beneficial effects and advantages:

the invention provides a basic model for utilizing and analyzing massive threat information data, and the invention aims at improving the existing data quality improving algorithm aiming at the network security threat information data, so that the method is suitable for the network security threat information data, the data quality of the collected network security threat information data is improved, and the false positive rate of the collected network security threat information data is reduced. The invention improves the existing entity identification and entity relation extraction method aiming at threat information data characteristics, improves the accuracy and efficiency of network security entity identification and security entity relation extraction, and generates a threat information network security knowledge graph with higher data quality. The invention also combines the data reasoning capability of the network security knowledge graph to research and utilize the attack graph visualization method of the network security knowledge graph combined with the local network topology structure.

The method of the invention firstly improves the threat information data quality according to the characteristics of the network security threat information data, reduces the false positive rate of the threat information data and improves the overall quality of the data; then, the existing entity identification and entity relation extraction method is improved aiming at threat information characteristics so as to generate a high-quality threat information knowledge graph; then, the recent threat information is combined with the local network topology structure data to perform association analysis on the local network weaknesses, so that visual display on the security weaknesses in the local network topology is realized; finally, an attack prediction method based on the combination of the network security knowledge graph and the traffic analysis of the inspection building is provided, and an attack means and an attack target of an attacker are predicted. Through a large number of experiments, the quality of threat information data quality improvement algorithm and network security threat information provided by the method is verified, the quality of knowledge patterns extracted and generated by entity identification and entity relation in the information text is higher than that of the existing method, and the method has good local network weakness visualization capability and attack pre-judgment analysis capability.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:

FIG. 1 is a process diagram of a threat intelligence based network security knowledge graph generation method of the present invention;

FIG. 2 is a diagram of a distributed crawler architecture for threat intelligence data collection in accordance with the present invention;

FIG. 3 is a probabilistic graphical model of a threat intelligence data quality enhancement algorithm in accordance with the present invention;

FIG. 4 is a schematic diagram of an atomic attack entity and its relationship defined in the present invention;

FIG. 5 is a schematic diagram of the BiLSTM-CRF model structure for network security entity identification in the present invention;

FIG. 6 is a schematic diagram of the Att-BiLSTM model structure for network security entity relationship extraction in the present invention;

FIG. 7 is a data collection time diagram of a distributed crawler system for threat intelligence data collection developed in the present invention;

FIG. 8 is a graph comparing the effects of a distributed crawler system with a stand-alone crawler system for threat intelligence data collection developed in the present invention;

fig. 9 is a diagram showing an example of the organization of threat intelligence data related to Windows system in embodiment 5 of the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description. It should be noted that, in the case of no conflict, the embodiments of the present application and the features in the embodiments may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.

The following describes some embodiments of the present invention with reference to fig. 1-9.

Example 1

The invention relates to a threat information-based network security knowledge graph generation method, as shown in fig. 1, and fig. 1 is a process diagram of the threat information-based network security knowledge graph generation method. The specific generation process of the network security knowledge graph comprises the following steps: high-efficiency distributed threat intelligence data collection, network security data set production, network security threat intelligence data quality improvement, network security entity identification, network security entity relation extraction and data organization. The following steps are described in detail:

and step 1, high-efficiency distributed threat information data collection.

The generation of the network security knowledge graph requires a large amount of network security threat information data, and in order to efficiently collect the network open source threat information data in real time, the following distributed crawler system is realized to collect the network open source threat information data. The distributed threat information data crawling system is built by a scrapy framework, and the scrapy-redis scheduling crawler program is used for extracting data structures and storing the data structures into redis and mong odb databases.

(1) Distributed crawler system architecture: the threat intelligence collection system architecture is composed of a distributed crawler system and deployment of the underlying environment. The distributed crawler system is formed by modifying a traditional crawler framework, namely, the Scapy, and the Redis database is newly added, so that the problem that the distributed type is not supported originally is solved. The underlying environment employs a multi-node distributed system, a Docker container cluster, using already established Kubernetes as a cluster management tool. The distributed crawler system adopts a Master/Slave structure, a Master end and a plurality of Slave ends are arranged, the Master end deploys a Redis database to store and be scheduled to-be-crawled requests, the Slave end deploys a crawler main program to crawl web pages and analyze extracted data, and then each Slave end stores the analyzed web page data in the same MongoDB database. As shown in fig. 2, fig. 2 is a diagram of a distributed crawler architecture for threat intelligence data collection in accordance with the present invention. For each threat information data item to be crawled, the threat information data item is firstly stored in a redis database, the crawler engine uses a scheduler to schedule the threat information data item, and when a certain item is scheduled, a corresponding crawler program (spider) and middleware thereof are started to download threat information data of the corresponding crawler program (spider).

(2) Crawler policy: for the Master terminal, an initial link is first stored in Redis, key is the next crawl page in the scheduling queue, and URL is generally the link of a certain page. And then starting the crawler, acquiring a starting URL from the Redis, and downloading the data of the webpage corresponding to the URL. And analyzing the response according to the defined relevant rules to obtain page data or detail page links, analyzing the condition of the direct page data according to the webpage format, starting the crawler again in the detail page link condition, modifying the links into detail page links, and obtaining final detail data. The crawler continues to obtain URLs from the dispatch queue, crawling a next URL. If the URL does not exist, the method enters a waiting state. For the Slave end, the downloader executes the downloading task and analyzes the extracted field. The crawler program acquires the URL from the scheduling queue of the Key of the Redis, and then downloads the corresponding webpage. And analyzing response according to the defined field rules, processing the corresponding field by a text duplication removal module, and storing the processed field into a MongoDB database. Until the Key value is null.

(3) The crawler realizes: for the scheduler module, the scheduler module is responsible for scheduling tasks of the whole system, and mainly has the following functions: accepting a request sent by an engine; returning the URL to the downloading module; and (5) after the URL is de-duplicated, storing the URL in a Redis database. And each crawler subtask transmits the crawled URL to a dispatcher through an engine, and the dispatcher performs de-duplication processing and then stores the URL in a Redis queue. Accepting the request of the engine and returning the URL to the downloader. And for the crawling downloader module, the crawling module integrates the functions of the spider and the downloader, the spider processes and extracts data of the webpage information returned by the downloader, and the catalog URL and the detail page URL in the webpage information are extracted. And key fields in the webpage information are extracted and then stored in the MongoDB database. The downloader downloads the URL returned by the scheduler and transmits the downloaded webpage information to the spider. And the method is responsible for crawling the corresponding website, firstly taking the initial URL, extracting the URL after crawling, returning the URL to the deduplication module, and then distributing the URL from the Redis to the Slave node by the scheduling module.

(4) And (3) data storage: the storage module only needs to realize two parts of functions, namely the URL is stored in the Redis, and the Redis is deployed on the Master node. And storing the analyzed webpage content in a MongoDB database, and deploying the analyzed webpage content in a Master node. Extracting the stored webpage content information is a final target of the system, and the distributed crawlers crawl the webpage content and then provide the data processing program with the information required by us.

And 2, manufacturing a network security threat information data set.

The network security data is obtained by collecting the following 5 threat intelligence data using the distributed threat intelligence crawling system in step 1. Comprising the following steps:

(1) Vulnerability data: vulnerability data is collected from main vulnerability publishing platforms, such as CVE, NVD and the like. The data types comprise vulnerability emergence system type, system version, utilization method and other data.

(2) APT (advanced persistent threat attack) attack chain data: the APT attack chain data is acquired from an APTnites platform and comprises 528 APT reports in the last 10 years, wherein 50 reports are manually marked, 40 deep learning models for training entity recognition and entity relation extraction are adopted by a BIO marking method, and the rest 10 reports are used as test model effects.

(3) Malware text data: the data comprises the name, the category, the common function, the Hash, the system platform and the like of the malicious software in the threat information. This portion of the data is collected from the threat intelligence source alien vat.

(4) Secure community discussion data: this portion of the data is collected on the Stackexchange website, which is primarily the text of security researchers discussing recent security events.

(5) Secure RSS subscription data: the data is collected in each large cyber-safe RSS, and the data is mainly recent cyber-safe news.

And step 3, improving the quality of the network security threat information data.

After the network security threat information data set is generated, the threat information data needs to be improved in quality so as to improve the quality of the threat information data and reduce the false positive rate of the threat information data, so that a high-quality network security knowledge graph can be generated later.

The invention improves the time-varying characteristics of the threat information by the existing true value discovery algorithm, introduces markov property to improve the time-varying characteristics, so that the time-varying characteristics are suitable for the threat information, as shown in fig. 3, and fig. 3 is a probability map model diagram of the threat information data quality improvement algorithm in the invention. In the figure, M _i : representing the set of model parameters at the i-th moment; c (C) _i : representing the model M at the ith moment _i Is a priori parameter of (2); where i=1, 2,. -%, N; the remaining parameters are as indicated herein.

The threat information data quality improvement algorithm model provided by the invention comprises the following steps:

step (1) FPR (false Positive)Rate): for each source k E S, generating a corresponding false positive rateThe value is (1-specificity), and the compliance super parameter is alpha ₀ ＝(α ₀₁ ,α ₀₀ ) Beta distribution of (2), wherein alpha ₀₁ Is the count of each source a priori false positive samples, alpha _0,0 Is the true negative sample count per source a priori:

in the following, the second time node will be followedWith +.>Instead, the truth discovery model is calibrated for time-varying characteristics using second-order markov.

Step (2) Sensitivity: for each source k E S, generating a corresponding sensitivity rateObeying the super parameter alpha ₁ ＝(α ₁₁ ,α ₁₀ ) Beta distribution of (2), wherein alpha ₁₁ Is the true positive sample count of each source a priori, alpha _1,0 Is per source a priori false negative sample count:

similar to FPR, from the second time nodeWith +.>Instead, the truth discovery model is calibrated for time-varying characteristics using second-order markov.

Step (3) attface (attack tag): for each entity's belonging attribute, F ε F, F is the set of observations of all the attributes under that entity (i.e., the collected set of values). Generating a priori true value probability θ _f Obeying the super parameter as beta= (beta) ₁ ,β ₀ ) Beta distribution of (2), wherein Beta ₁ Is the correct sample count of the prior entity attribute, beta ₀ Is a priori entity attribute error sample count:

θ _f ～Beta(β ₁ ,β ₀ )

similar to FPR and Sensitivity above, θ will be measured from the second time node _f θ with previous time node _f Instead, the truth discovery model is calibrated for time-varying characteristics using second-order markov.

Step (4) Truth label: and generating a true value label of each entity attribute, namely whether the observed value is correct or not. t is t _f Is an attribute truth value label, obeys the parameter theta _f Bernoulli distribution of (1), wherein t _f Is a binary Boolean variable, a priori probability θ _f Is a representative attribute tag t _f Probability of being correct:

t _f ～Bernoulli(θ _f )

step (5) Observation: entity attribute observation value labels, for each entity attribute observation value C, C E C _f S is used as a source thereof _c A representation; generating a distribution of observations tags c is subject to parametersBernoulli distribution of (A).

The model solution is as follows: from the above description, the conditional probability of the model given the observed value c of each entity attribute is as follows:

in the above formula: p represents the prior probability θ when given parameters are true _f Source sensitivityAnd->When the observed value of the entity o is c. Where c is the observed value, f is the attack tag, s _c A source representing the occurrence of observation c;

the complete likelihood function containing all variables and super-parameters can be written as:

in the above formula: p represents the hyper-parameter alpha when the false positive rate of a given parameter ₀ ，α ₁ And a priori true value probability superparameter beta, an entity o, a source s, a true value label t, a priori probability parameter set theta and a sensitivity parameter set phi ⁰ ，φ ¹ Conditional probability of (2). Where S represents the set of all sources, F represents the set of attack tags, F represents each attack tag element belonging to F, θ _f Represents f priori probability, t _f Represents the true value of f, C _f Representing a set of observations of f, c representing each observation element in the set of observations.

Given the observed value data of the attribute, the likelihood function can be solved using the Gibbs Sampling algorithm in the MCMC algorithm:

t _map And the result obtained by carrying out maximum posterior estimation on the above formula is shown, and the rest parameters have the same meaning as the parameters with the same names.

The following formula solution is available:

wherein: p denotes when given parameter t _-f True value t of f for entity o and source s _f The conditional probability of i, i representing the attack tag value of f, the range of values is {0,1}, t _-f Is the set of all but F values in F,

source s, representing observation j, attack tag not f and true value tag i _c Is a number of (3). C (C) _-f Representing an attack tag set without an attack tag f, C' being C _-f Each element in the set, +.>The true value when f takes the value c' is indicated, and the rest parameters have the same meaning as the parameters with the same names.

After p (t) _f ＝i|t _-f O, s) can be estimated to obtain the FPR (false positive rate) at the next moment) And sensitivity y, which is solved as follows:

Precision indicates the accuracy of each source, and the rest parameters have the same meaning as the parameters with the same names.

The entities and relationships are defined as follows:

first, a relationship concept between network security entities and entities is defined. Knowledge maps reflect specific information and the association relationship between the information, and entities are abstract expressions of concepts and relations between concepts, so that good entity definition can help to express the information and the relations contained in the knowledge maps clearly. The network security entity is described by using an atomic attack, wherein the atomic attack represents the smallest attack unit in a single attack and can be understood as the smallest step in the attack.

As shown in fig. 4, fig. 4 is a schematic diagram of an atomic attack entity and its relationship defined in the present invention. In the atomic attack graph, an atomic attack is represented by a vertex, and the actual meaning represents a vulnerability exploitation attack. Exploit is attached to both software and hardware. The implementation of the attack depends on the attack condition, the attack mode, the attack effect and the like. The invention designs software, hardware, loopholes and attacks for the atomic attack, wherein the attack has 3 attributes of attack conditions, attack modes and attack effects. Where the relationship between entities is defined as "present", "utilize" 2 relationships.

And 4, carrying out network security entity identification on the manufactured network security threat information data set.

As described above, the BIO labeling method is adopted for the APT report in the step 2 to make sentence X= [ X ] in the APT report document] ^N ＝[x ₁ ,...,x _i ,...x _N ]Wherein x is _i Is the ith character in sentence X. In the BIO labeling method, identifying the network security entity in sentence X corresponds to giving a standard sequence L _X ＝[l] ^N 。

The invention uses BiLSTM-CRF (two-way long-short-term memory artificial neural network-conditional random field algorithm) model to carry out model training on the marked APT report document, as shown in figure 5, and figure 5 is a schematic diagram of BiLSTM-CRF model structure for network security entity identification in the invention. In the figure, CRF represents a conditional random field; bi represents the output of the ith backward network; fi represents the output of the ith forward network; ci represents the ith text vector; B-LOC, E-LOC, O in the CRF layer represents: start, end, external. The model can extract the word characteristics before the ith character and the word characteristics after the ith character through a forward process, so that the learning capacity of the words is improved. The CRF (conditional random field) model is used to obtain a conditional probability distribution for a given set of input random variables and another set of output random variables.

Wherein the CRF model is: given an input sentence, x= [ X ]] ^N ＝[x ₁ ,...,x _i ,...x _N ]Let S be the output score matrix of BiLSTM (two-way long-short-term memory artificial neural network) network with dimension N×K, K be the number of labeling categories, S _i，j Is the jth tag score of the ith word, then the predicted tag y= [ y ] ₁ ,...,y _i ,...,y _N ]Is determined by the judgment score Z of (a)The meaning is as follows:

and 5, extracting the relation of the network security entity.

The network security entity relation extraction adopts a BiLSTM (Att-BiLSTM) model based on an attention mechanism (attention mechanism-two-way long-short-term memory artificial neural network). The model is mainly divided into 5 layers: input layer, word embedding layer, biLSTM layer, attention mechanism layer, and output layer (CRF layer in BiLSTM-CRF model is replaced by Attention layer, and output layer is changed into softmax layer). As shown in FIG. 6, FIG. 6 is a schematic diagram of the Att-BiLSTM model structure for network security entity relationship extraction in the present invention. Wherein Si represents the ith text vector; o, B-A, I-A in the output layer represents: outside, beginning of a, inside of a.

Wherein the word embedding layer is used for characterizing sentences in the APT report, and X= [ X ] ] ^N ＝[x ₁ ,...,x _i ,...x _N ]Sentences are expressed as a matrix, words with similar meanings are adjacent in the matrix space, and the expressions may have relations.

The importance of the output result of the Attention layer salient part introduces a weighting idea. Wherein the output of the BiLSTM layer is B= [ B ]] ^T ＝[b ₁ ,...,b _j ,...,b _T ]The parameter matrix W satisfies the following formula:

S＝tanh(B)

α＝softmax(W ^T S)

r＝Bα ^T

α is the attention weight coefficient, r is the weighted sum of the bimstm outputs B, and finally the characterization vector b=tanh (r) is generated by a nonlinear function. Thereafter B is carried out ^* The input fully connected neural network is mapped to the labeling vector, and the prediction labeling is obtained through a softmax function.

And 6, data organization.

Because threat information data presents the characteristic of multi-source isomerism, the invention adopts a non-relational database Mongodb database for data organization to store, and stores all data in the form of key value pairs. The Mongodb database has extremely high performance and flexible data storage characteristics, and is suitable for storing threat information and generating a network security knowledge graph model.

In the implementation step of the invention, the software environment is a Windows10 system, the implementation language is Python3, the deep learning framework is Pytorch, and the database is a non-relational database Mongodb.

Example 2

The embodiment provides a network security knowledge graph generation method based on threat information, which aims at testing a distributed threat information crawling system.

The invention compares the developed distributed threat information crawling system with the single threat information collecting system, and verifies that the developed distributed threat information crawling system has higher superiority in efficiency compared with the single threat information collecting system. Taking a common open source threat information source as an example, the distributed crawler system is provided with 1 master node and 2 slave nodes, and after the distributed crawler system runs continuously for 5 days, the database stores more than 11 ten thousand pieces of webpage data. The number of crawled pages at various points in time is shown in fig. 7, fig. 7 is a data collection time chart of a distributed crawler system for threat intelligence data collection developed in the present invention. In the drawing the view of the figure,

in the experiment, the total number of the pages crawled by 2 Slave nodes in a certain time is far higher than that of the pages crawled by single machine operation, so that the distributed system is fully explained, and the operation efficiency is truly improved. And the distributed crawler system runs a crawler comparison test with a single machine environment, and records the number of the pages crawled by the distributed crawler system and the single machine environment. The distributed crawler items are respectively deployed in a Docker container cluster and a virtual machine cluster, and the hardware configuration is as follows: master1, slave2, ubuntu 16.04, python2.7 memory 8G. Operational efficiency versus, for example, FIG. 8 is a graph of the effectiveness of a distributed crawler system versus a stand-alone crawler system for threat intelligence data collection developed in the present invention. As can be seen from the number of pages crawled at each time point, the distributed crawler system is significantly better than the stand-alone crawler system.

Example 3

The embodiment provides a network security knowledge graph generation method based on threat information, which is used for comparing the threat information data quality improvement algorithm effects.

The threat information data entity attribute quality improvement effect comparison is carried out on threat information data by using the algorithm provided by the invention and other truth value discovery algorithms. The test standard selects the accuracy, recall and F1 value commonly used in the true value discovery model. The true value of the comparison found the algorithm to be 3-Estimates, voting, LTM. The comparative effects are shown in Table 1. It can be seen that the quality improvement algorithm provided by the invention has better effect on the quality improvement of threat intelligence data than the existing algorithm.

Table 1 is a table of results of comparison of different data quality improvement algorithms in accordance with an embodiment of the present invention.

Algorithm	Accuracy rate of	Recall rate of recall	F1 value
				proposal	0.935	0.960	0.987
3-Estimates	0.874	0.903	0.927
				Voting	0.840	0.867	0.913
LTM	0.924	0.865	0.966

In the table: propos al represents the proposed algorithm of the invention, 3-Estimates represents the 3-sequence parameter estimation algorithm, voing represents the Voting algorithm, LTM represents the hidden value model algorithm.

Example 4

The embodiment provides a network security knowledge graph generation method based on threat information, which is used for comparing network security entity identification effects in the threat information.

According to the invention, the effect of the network security entity identification model and the existing entity identification model is tested through the rest 10 marked APT report documents. The test standard selects the accuracy, precision, recall and F1 value commonly used in entity identification. The entity recognition models compared are CRF, LSTM and LSTM-CRF. The comparative effects are shown in Table 2. It can be seen that the network security entity identification model provided by the invention has better network security entity identification effect than the existing model in threat information.

Table 2 shows the results of comparison of the results of different network security entity identification models in the embodiment of the present invention.

In the table: CRF represents a conditional random field algorithm, LSTM represents a long-short-term memory artificial neural network algorithm, biLSTM represents a two-way long-short-term memory artificial neural network algorithm, and BiLSTM-CRF represents a two-way long-term memory artificial neural network-conditional random field algorithm.

Example 5

The embodiment provides a network security knowledge graph generation method based on threat information, which is used for comparing network security entity relation extraction effects in the threat information.

According to the invention, the effect of the network security entity relation extraction model and the existing entity relation extraction model is tested through the rest 10 APT report documents. The test standard selects the entity relation to extract the commonly used accuracy rate, recall rate and F1 value. The comparative entity relationship extraction model was CRF, LSTM, biLSTM and BiLSTM-CRF. The comparative effects are shown in Table 3. The network security entity relation extraction model provided by the invention has better network security entity relation extraction effect than the existing model in threat information.

Table 3 shows the results of the extraction model for different network security entity relationships in the embodiment of the present invention.

Model	Accuracy rate of	Accuracy rate of	Recall rate of recall	F1 value
					CRF	0.9041	0.8084	0.7963	0.7892
LSTM	0.9163	0.8162	0.8046	0.8018
					BiLSTM	0.9265	0.8339	0.8262	0.8491
BiLSTM-CRF	0.9374	0.8674	0.8344	0.8411
					_BiLSTM -CRF-Attentio _n	0.9405	₀ .8652	₀ .8748	₀ .8751

In the table: biLSTM-CRF-Attention represents a two-way long and short term memory artificial neural network-conditional random field-Attention mechanism algorithm.

Example 6

The embodiment provides a network security knowledge graph generating method based on threat information, and a network security knowledge graph instance based on threat information.

As shown in fig. 9, fig. 9 is a diagram showing an example of the organization of threat intelligence data related to Windows system in embodiment 5 of the present invention.

The network security knowledge graph based on threat information can effectively organize the entity data and the relation in each threat information after carrying out network security entity identification and relation extraction on various threat information data, and carry out association analysis on the data. The associated data stored in monglodb is visually shown in fig. 9 using the graphviz module in Python 3. Indicating that a remote desktop service remote code execution vulnerability exists in the Win10 system in the Windows system, four vulnerabilities of CVE-2019-1222, CVE-2019-1182, CVE-2019-1181 and CVE-09-1126 can be utilized. CVE represents a generic vulnerability disclosure number.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims

1. A network security knowledge graph generation method based on threat information is characterized by comprising the following steps: the method comprises the following steps:

step 1, collecting high-efficiency distributed threat information data, constructing a distributed threat information data crawling system by a scrapy framework, extracting data structure by using a scrapy-redis scheduling crawler program, and storing the data structure into a redis and mongasdb database; efficient distributed threat intelligence data collection includes: distributed crawler system architecture, crawler strategy, crawler realization and data storage;

step 3, improving the quality of the network security threat information data;

step 5, extracting the relation of the network security entity;

step 6, data organization;

the distributed crawler system architecture includes: the threat information collection system architecture is formed by the deployment of a distributed crawler system and a bottom layer environment; the distributed crawler system is formed by modifying a traditional crawler framework, namely, the Scapy, a Redis database is added, a multi-node distributed system is adopted in a bottom environment, a Docker container cluster is adopted, and Kubernetes is used as a cluster management tool; the distributed crawler system adopts a Master/Slave structure, a Master end and a plurality of Slave ends are arranged, the Master end deploys a Redis database to store and be scheduled to-be-crawled requests, the Slave end deploys a crawler main program to crawl webpages and analyze extracted data, and each Slave end stores the analyzed webpage data in the same MongoDB database; the crawler strategy comprises: for a Master terminal, firstly storing an initial link in a Redis, wherein Key is the next crawled page in a scheduling queue, and URL is generally the link of a certain page; then starting a crawler, acquiring a starting URL from the Redis, and downloading data of a webpage corresponding to the URL; analyzing the response according to the defined related rules to obtain page data or detail page links, analyzing the condition of the direct page data according to the webpage format, starting the crawler again in the detail page link condition, modifying the links into detail page links, and obtaining final detail data; the crawler continues to fetch from the dispatch queue The URL is crawled to be a next URL; if the URL does not exist, entering a waiting state; for the Slave end, the downloader executes a downloading task and analyzes and extracts the fields; the crawler program acquires the URL from the scheduling queue of the Key of the Redis, and then downloads the corresponding webpage; according to the defined field rule, resolving response, processing the corresponding field by a text duplication removal module, and storing the processed field into a MongoDB database until the Key value is null; the crawler implementation includes: for the scheduler module, the scheduler module is responsible for scheduling tasks of the whole system, and mainly has the following functions: accepting a request sent by an engine; returning the URL to the downloading module; the URL is stored in a Redis database after being de-duplicated; each crawler subtask transmits the crawled URL to a dispatcher through an engine, and the dispatcher carries out duplication elimination treatment and then stores the URL into a Redis queue; receiving the request of the engine, and returning the URL to the downloader; for the crawling downloader module, the crawling module integrates the functions of the spider and the downloader, the spider processes and extracts data of webpage information returned by the downloader, and directory URL and detail page URL in the webpage information are extracted; extracting key fields in the webpage information and storing the key fields in a MongoDB database; the downloader downloads the URL returned by the scheduler and transmits the downloaded webpage information to the spider; the method is responsible for crawling corresponding websites, firstly taking a starting URL, extracting the URL after crawling, and returning the URL to the duplicate removal module; then the dispatching module distributes URL to the Slave node from Redis; the data store includes: the storage module realizes two parts of functions, the URL is stored in Redis, and the Redis is deployed on a Master node; the analyzed webpage content is stored in a MongoDB database and is deployed in a Master node; extracting the stored webpage content information is a final target of the system, and the distributed crawlers crawl the webpage content for a data processing program to extract the required information; the network security threat information data set is manufactured through the distributed threat information crawling system; comprising the following steps: (1) vulnerability data: the vulnerability data is collected from a main vulnerability publishing platform, and the data types comprise vulnerability occurrence system types, system versions and utilization methods; (2) APT attack chain data: APT attack chain data are collected from an APTnites platform; a total of 528 APT reports have been included over the last 10 years; (3) malware text data: containing maliciousness in threat intelligence The name, the category, the common function, the Hash and the utilization system platform of the software; the part of data is collected in threat information source alien vault; (4) secure community discussion data: the part of data is collected in a Stackexchange website and is the text of a recent security event; (5) secure RSS subscription data: the partial data is collected in each large network security RSS, and is recent network security news; the method for improving the quality of the network security threat information data comprises the following steps: step (1) FPR false positive rate: for each source k E S, generating a corresponding false positive rateThe value is (1-specificity), and the compliance super parameter is alpha ₀ ＝(α _0,1 ,α _0,0 ) Beta distribution of (2), wherein alpha _0,1 Is the count of each source a priori false positive samples, alpha _0,0 Is the true negative sample count per source a priori:

step (2) Sensitivity rate: for each source k E S, generating a corresponding sensitivity rateObeying the super parameter alpha ₁ ＝(α _1,1 ,α _1,0 ) Beta distribution of (2), wherein alpha _1,1 Is the true positive sample count of each source a priori, alpha _1,0 Is per source a priori false negative sample count:

θ _f ～Beta(β ₁ ,β ₀ )

step (4) Truth label: the attribute truth value label generates a truth value label of each entity attribute, namely whether the observed value is correct or not; t is t _f Is an attribute truth value label, obeys the parameter theta _f Bernoulli distribution of (1), wherein t _f Is a binary Boolean variable, a priori probability θ _f Is a representative attribute tag t _f Probability of being correct:

t _f ～Bernoulli(θ _f )

step (5) Observation: entity attribute observation value labels, for each entity attribute observation value C, C E C _f S is used as a source thereof _c A representation; generating a distribution of observations tags c is subject to parameters Bernoulli distribution of (a):

The model solution is as follows: the conditional probability of the model given the observations c of each entity attribute is as follows:

in the above formula: p represents the prior probability θ when given parameters are true _f Source sensitivityAnd->When the observed value of the entity o is the conditional probability of c; where c is the observed value, f is the attack tag, s _c A source representing the occurrence of observation c;

the following formula solution is obtained:

Where precision represents the accuracy of each source.

2. Threat intelligence based network security according to claim 1The full knowledge graph generation method is characterized in that: the network security entity identification is carried out on the produced network security threat information data set, namely, the BIO labeling method is adopted for the APT report to carry out the sentence X= [ X ] in the APT report document] ^N ＝[x ₁ ,...,x _i ,...x _N ]Wherein x is _i Is the ith character in sentence X; in the BIO labeling method, identifying the network security entity in sentence X corresponds to giving a standard sequence L _X ＝[l] ^N ；

The extracting the relation of the network security entity comprises the following steps: biLSTM based on attention mechanism is adopted for network security entity relation extraction

(Att-BiLSTM) model; the method comprises an input layer, a word embedding layer, a BiLSTM layer, an Attention layer and an output layer; wherein the word embedding layer is used for characterizing sentences in the APT report, and X= [ X ]] ^N ＝[x ₁ ,...,x _i ,...x _N ]Sentence is expressed as a matrix, words with similar meaning are adjacent in the matrix space, and the expression possibly has a relation; wherein the importance of the output result of the protrusion part of the Attention layer introduces a weighting idea; wherein the output of the BiLSTM layer is B= [ B ]] ^T ＝[b ₁ ,...,b _j ,...,b _T ]The parameter matrix W satisfies the following formula:

S＝tanh(B)

α＝softmax(W ^T S)

r＝Bα ^T

alpha is the attention weight coefficient, r is the weighted sum of the BiLSTM output B, and the characterization vector B=tanh (r) is finally generated by a nonlinear function, and then B is calculated ^* The input fully-connected neural network is mapped to the labeling vector, and the prediction labeling is obtained through a softmax function;

and the data organization adopts a non-relational database Mongodb database to store, and stores all data in the form of key value pairs.