CN113282759A - Network security knowledge graph generation method based on threat information - Google Patents
Network security knowledge graph generation method based on threat information Download PDFInfo
- Publication number
- CN113282759A CN113282759A CN202110439459.1A CN202110439459A CN113282759A CN 113282759 A CN113282759 A CN 113282759A CN 202110439459 A CN202110439459 A CN 202110439459A CN 113282759 A CN113282759 A CN 113282759A
- Authority
- CN
- China
- Prior art keywords
- data
- network security
- url
- entity
- crawler
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 73
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 34
- 230000009193 crawling Effects 0.000 claims abstract description 24
- 238000013480 data collection Methods 0.000 claims abstract description 13
- 238000004458 analytical method Methods 0.000 claims abstract description 7
- 230000006870 function Effects 0.000 claims description 29
- 238000009826 distribution Methods 0.000 claims description 24
- 238000000605 extraction Methods 0.000 claims description 22
- 230000035945 sensitivity Effects 0.000 claims description 16
- 239000011159 matrix material Substances 0.000 claims description 15
- 230000008569 process Effects 0.000 claims description 12
- 241000239290 Araneae Species 0.000 claims description 10
- 238000013528 artificial neural network Methods 0.000 claims description 10
- 239000000284 extract Substances 0.000 claims description 7
- 230000008520 organization Effects 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 7
- 238000003860 storage Methods 0.000 claims description 7
- 238000013500 data storage Methods 0.000 claims description 6
- 230000004044 response Effects 0.000 claims description 6
- 238000002372 labelling Methods 0.000 claims description 5
- 238000004519 manufacturing process Methods 0.000 claims description 4
- 238000012549 training Methods 0.000 claims description 4
- 238000007476 Maximum Likelihood Methods 0.000 claims description 3
- 238000012512 characterization method Methods 0.000 claims description 3
- 230000014509 gene expression Effects 0.000 claims description 3
- 238000007726 management method Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims description 3
- 101150099271 FHIT gene Proteins 0.000 claims 1
- 230000006872 improvement Effects 0.000 abstract description 12
- 238000002474 experimental method Methods 0.000 abstract description 3
- 238000012800 visualization Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 22
- 230000000694 effects Effects 0.000 description 16
- 230000015654 memory Effects 0.000 description 9
- 238000012360 testing method Methods 0.000 description 8
- 238000004590 computer program Methods 0.000 description 7
- 230000002457 bidirectional effect Effects 0.000 description 5
- 230000008901 benefit Effects 0.000 description 3
- 230000000052 comparative effect Effects 0.000 description 3
- 230000007123 defense Effects 0.000 description 3
- 230000001537 neural effect Effects 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000002155 anti-virotic effect Effects 0.000 description 1
- 238000012098 association analyses Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000005206 flow analysis Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000002407 reforming Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 241000894007 species Species 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000000547 structure data Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 238000007794 visualization technique Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/288—Entity relationship models
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Computational Linguistics (AREA)
- Computer And Data Communications (AREA)
Abstract
The invention belongs to the technical field of industrial control network security, and particularly relates to a network security knowledge graph generation method based on threat intelligence. The method comprises the following steps: efficient distributed threat intelligence data collection; making a network security threat information data set through a distributed threat information crawling system; the data quality of the network security threat information is improved; carrying out network security entity identification on the manufactured network security threat intelligence data set; extracting the network security entity relationship; and (4) organizing data. According to the method, a large number of experiments verify that the threat information data quality improvement algorithm, the network security threat information and the quality of the knowledge map generated by extracting the entity identification and entity relation in the information text are remarkably improved, and the method has good local network weakness visualization capability and attack prediction analysis capability.
Description
Technical Field
The invention belongs to the technical field of industrial control network security, and particularly relates to a network security knowledge graph generation method based on threat intelligence.
Background
With the rapid development of network technologies, a great number of network technologies are introduced into various industries to improve productivity, which is accompanied by a problem of network security. With the increasing complexity of network security situation, dynamic defense of network security driven by threat intelligence becomes the focus of attention in the industry. The threat intelligence has the characteristics of rich data content, high accuracy and strong real-time performance, and can reflect the attack chain of the whole attack event, so the threat intelligence has extremely high application and analysis values.
The knowledge graph is used as a comprehensive data integration and organization method, attack information can be effectively extracted from massive threat information, and complex behaviors such as reasoning analysis and attack semantic association on the attack information data can be achieved. With the continuous updating of threat information, the knowledge graph network security system based on the threat information can realize dynamic defense, and compared with traditional static defense means such as antivirus software and firewall, the knowledge graph can sense the network security situation more quickly and accurately, so that the overall security of the network is improved, and advanced functions such as attack path prediction, attack tracing, security threat evaluation and the like are realized.
In the process of generating the relevant network security knowledge graph by using the threat intelligence, the data quality after the threat intelligence is collected is improved, the false positive rate of the threat intelligence data is reduced, and the network security entity identification and the security entity relationship extraction in the threat intelligence are difficult research contents.
The main problems are as follows:
1. the open source threat intelligence on the network generally has the problems of low data quality, high data false positive rate, missing or error of corresponding attributes of data entities and the like. The low-quality threat information data inevitably causes the problem that the generated network security knowledge graph has low quality, the network security situation cannot be correctly sensed, and the current network attack behavior can be wrongly predicted. The existing data quality improving algorithm mainly depends on a truth value discovering algorithm, the algorithm is mostly applied to single truth value discovering problems and cannot adapt to the condition that an entity in network security threat information data has multiple truth values, and the network security threat information data has stronger time-varying characteristics.
2. The existing entity identification and entity relation extraction method is mainly based on the traditional rule identification, machine learning and the recently popular deep learning method, needs a large number of labeled text data samples, and has higher data quality requirement. Although the method is widely applied to other fields such as natural language processing, the application of the method to entity identification and entity relationship extraction in the network security field is difficult because of the problems that large-scale high-quality security entity labeling data is lacked, multiple entity types are mixed in the data, and entity type labels in the data whole text are different.
At present, no network security entity identification and entity relationship extraction method with good effect exists in the field of network security.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a network security knowledge map generation method based on threat intelligence, and aims to provide a basic model for utilizing and analyzing massive threat intelligence data and realize the purpose of predicting the attack means and the attack target of an attacker.
The technical scheme adopted by the invention for realizing the purpose is as follows:
a network security knowledge graph generation method based on threat intelligence comprises the following steps:
step 2, a network security threat information data set is made through a distributed threat information crawling system;
step 5, extracting the network security entity relationship;
and 6, organizing data.
Further, the high efficiency distributed threat intelligence data collection comprises: distributed crawler system architecture, crawler strategy, crawler implementation and data storage.
Further, the distributed crawler system architecture comprises: the threat information collection system framework is formed by the deployment of a distributed crawler system and a bottom environment; the distributed crawler system is formed by reconstructing a traditional crawler frame by Scapy, a Redis database is added, a multi-node distributed system is adopted in a bottom environment, a Docker container cluster is adopted, and Kubernets are used as a cluster management tool; the distributed crawler system adopts a Master/Slave structure, a Master (Master) terminal and a plurality of Slave terminals are arranged, the Master terminal deploys a Redis database to store and be scheduled to request to be crawled, the Slave terminals deploy crawler main programs to crawl webpages and analyze extracted data, and each Slave terminal stores the analyzed webpage data in the same MongoDB database.
Further, the crawler policy includes: for a Master terminal, storing an initial link in Redis, wherein Key is a next crawled page in a scheduling queue, and URL is a link of a certain page generally; then, a crawler is started, a starting URL is obtained from the Redis, and data of a webpage corresponding to the URL are downloaded; analyzing according to a defined relevant rule from response to obtain page data or a detail page link, analyzing according to a page format for the condition of directly being the page data, restarting a crawler for the condition of the detail page link, modifying the link into the detail page link, and acquiring final detail data; the crawler program continuously acquires the URL from the scheduling queue and crawls the next URL; if no URL exists, entering a waiting state; for the Slave end, a downloader executes a downloading task and analyzes an extracted field; the crawler program acquires URL from a scheduling queue of Key of Redis, and then downloads a corresponding webpage; and resolving the response according to the well-defined field rule, and storing the corresponding field into the MongoDB database after the corresponding field is processed by the text duplication removal module until the Key value is null.
Further, the crawler implementation includes: for the scheduler module, the scheduler module is responsible for scheduling tasks of the whole system, and mainly has the following functions: receiving a request sent by an engine; returning the URL to the downloading module; the URL is stored in a Redis database after being deduplicated; each crawler subtask transmits the obtained URL to a scheduler through an engine, and the URL is stored in a Redis queue after being subjected to deduplication processing by the scheduler; receiving the request of the engine, and returning the URL to the downloader; for the crawling downloader module, the crawling module integrates the functions of a spider and a downloader, the spider processes and extracts data of webpage information returned by the downloader, and extracts a directory URL and a detail page URL in the webpage information; key fields in the webpage information are extracted and then stored in a MongoDB database; the downloader downloads the URL returned by the scheduler and transmits the downloaded webpage information to the spider; the system is responsible for crawling corresponding websites, and comprises a starting URL (uniform resource locator), a duplicate removal module, a URL extraction module and a duplicate removal module, wherein the starting URL is taken at first, and the URL is extracted after crawling; then the scheduling module distributes URL to the Slave node from Redis;
the data storage includes: the storage module realizes two functions, wherein the URL is stored in Redis, and the Redis is deployed in a Master node; the webpage content obtained by analysis is stored in a MongoDB database and is deployed at a Master node; the stored webpage content information is extracted to be the final target of the system, and the distributed crawler crawls the webpage content for a data processing program to extract the required information.
Further, the network security threat information data set is produced through a distributed threat information crawling system; the method comprises the following steps:
(1) vulnerability data: the vulnerability data is collected from a main vulnerability publishing platform, and the data types comprise vulnerability occurrence system types, system versions and utilization methods;
(2) APT attack chain data: APT attack chain data is collected from an APTnodes platform; a total of 528 APT reports over the last 10 years;
(3) malware text data: the method comprises the name, the category, the common functions, the Hash and the utilization system platform of malicious software in threat intelligence; the partial data is collected in a threat intelligence source AlienVault;
(4) data discussion in the secure community: the part of data is collected in a StackExchange website and is the text of a recent security event;
(5) secure RSS subscription data: the part of data is collected in each large network security RSS and is recent network security news.
Further, the method for improving the quality of the network security threat information data comprises the following steps:
step (1), FPR false positive rate: for each source k ∈ S, a corresponding false positive rate is generatedThe value is (1-specificity), and the compliance hyper-parameter is alpha0=(α0,1,α0,0) Beta distribution of (a), wherein0,1Is the count of false positive samples a priori per source, alpha0,0Is the count of true negative samples per source prior:
in the following, from the second time nodeUsing a previous time nodeInstead, the time-varying characteristic of the truth finding model is calibrated by using second-order markov;
step (2) Sensitivity: for each source k ∈ S, a corresponding sensitivity rate is generatedObeying a hyper-parameter of alpha1=(α1,1,α1,0) Beta distribution of (a), wherein1,1Is the true positive sample count per source prior, α1,0Is the per source a priori false negative sample count:
from a second time nodeUsing a previous time nodeInstead, the time-varying characteristic of the truth finding model is calibrated by using second-order markov;
step (3), attack tag of Att fact: for the attribute of each entity, F belongs to F, and F is a set of observed values of all attributes under the entity; generating a priori true probability θfObeying a hyper-parameter of β ═ β (β)1,β0) Beta distribution of (a), wherein1Is a prior entity attribute correct sample count, β0Is a prior entity attribute error sample count:
θf~Beta(β1,β0)
will be from the second time nodefUsing theta of the previous time nodefInstead, the time-varying characteristic of the truth finding model is calibrated by using second-order markov;
step (4), Truth label: the attribute true value label is used for generating a true value label of each entity attribute, namely whether the observed value is correct or not; t is tfIs an attribute truth label with a compliance parameter thetafIn which t isfIs a binary Boolean variable, the prior probability θfIs to represent an attribute tag tfProbability of being correct:
tf~Bernoulli(θf)
step (5) observer: an entity attribute observation value label, wherein the observation value C of each entity attribute belongs to CfIts source uses scRepresents; generating a distribution of observation labels c is a compliance parameterBernoulli distribution of (a):
wherein if tf=0,ocCompliance parameter ofThe false positive rate of the distribution of bernoulli is sc
If t isf=1,ocCompliance parameter ofThe false positive rate of the distribution of bernoulli is also sc
The model solution is as follows:
the conditional probability of the model given the observed value c of each entity attribute is as follows:
in the above formula: p represents the prior probability theta when the parameter truth value is givenfSource sensitivity rateAndwhen the observed value of the entity o is c, the conditional probability; where c is the observation, f is the attack tag, scA source representing the occurrence of an observation c;
the full likelihood function containing all variables and hyperparameters is written as:
in the above formula: p represents the hyperparameter alpha when given the parameter false positive rate0,α1And the prior true probability over-parameter beta, the entity o, the source s, the true label t, the prior probability parameter set theta and the sensitivity parameter set phi0,φ1The conditional probability of (a); where S represents a set of all sources, F represents an attack tag set, F represents each attack tag element belonging to F, θfDenotes the f prior probability, tfDenotes the true value of f, CfA set of observations representing f, c representing each observation element in the set of observations;
and (3) given the observation value data of the attribute, solving the likelihood function by using a Gibbs Sampling algorithm in the MCMC algorithm:
tmapthe result of maximum posterior estimation is obtained by the formula, and the rest parameters have the same meanings as the parameters with the same names in the formula;
the following formula is solved:
wherein: p represents when given a parameter t-fTrue value t of f for entity o and source sfConditional probability with value i, i representing an attack of fThe value of the label is {0,1}, and t is the value range-fIs the set of all values in F except F,
source s representing observation j, attack tag not f, and truth tag icThe number of (2); c-fRepresenting a set of attack tags without attack tag f, C' being C-fEach of the elements in the set is,the truth value when f is c' is shown, and the rest parameters have the same meanings as the parameters with the same names in the above;
to obtain p (t)f=i|t-fO, s), estimating to obtain the FPR false positive rate and the Sensitivity rate of the next moment, and solving the following steps:
whereinObservation set C representing all attack labels as ffElement c of (a), source s making observations of the observation value ccAnd attack tag o of entity ofThe truth label for j takes the value of the probability sum of i, wherein i belongs to {0,1}, j belongs to {0, 1., | F | }, | F | represents the number of elements of the attack set F, the rest parameters have the same meanings as the parameters with the same names in the above, and finally the accuracy of each source can be estimated as well:
where precision represents the accuracy of each source.
Further, the network security entity identification is carried out on the manufactured network security threat intelligence data set, and a BIO marking method is adopted for an APT report to identify a sentence X in an APT report document as [ X ]]N=[x1,...,xi,...xN]Wherein x isiIs the ith character in sentence X; in the BIO labeling method, identifying the network security entity in the sentence X is equivalent to giving a standard sequence LX=[l]N;
Performing model training on the labeled APT report document by using a BilSTM-CRF model, and extracting word characteristics before the ith character and word characteristics after the word by a forward process; the CRF model is used for acquiring the conditional probability distribution of another group of output random variables under the condition of giving a group of input random variables;
the CRF model is: given an input sentence, X ═ X]N=[x1,...,xi,...xN]Assuming S is the output score matrix of the BiLSTM network of dimension NxK, K is the number of label categories, Si,jIs the jth label score of the ith word, the predicted label y ═ y1,...,yi,...,yN]The judgment score Z of (a) defines:
where T is a K +2 dimensional probability transition matrix, the probability of the generated tag sequence y:
then, solving the correctly labeled log-likelihood probability by utilizing maximum likelihood estimation:
further, the extracting the network security entity relationship comprises:
extracting a network security entity relationship by adopting an attention-based BilSTM (Att-BilSTM) model; the system comprises an input layer, a word embedding layer, a BilSTM layer, an Attention layer and an output layer;
wherein the word embedding layer is used to characterize a sentence in the APT report, X ═ X]N=[x1,...,xi,...xN]Sentences are expressed into a matrix, words with similar meanings are adjacent in the matrix space, and the expressions can have relations;
wherein the significance of the output result of the salient part of the Attention layer introduces a weighting thought;
wherein the output of the BilSTM layer is B ═ B]T=[b1,...,bj,...,bT]Then the parameter matrix W satisfies the following formula:
S=tanh(B)
α=softmax(WTS)
r=BαT
alpha is an attention weight coefficient, r is a result of weighted summation of the BilSTM output B, finally a characterization vector B ═ tanh (r) is generated through a nonlinear function, and then B is used*And inputting a full-connection neural network, mapping the full-connection neural network to a label vector, and obtaining a prediction label through a softmax function.
Furthermore, the data organization adopts a non-relational database, namely a Mongodb database to store, and stores all data in a key-value pair mode.
The invention has the following beneficial effects and advantages:
the invention provides a basic model for utilizing and analyzing massive threat information data, and the key point of the invention is to improve the existing data quality improvement algorithm aiming at the network security threat information data, so that the data is adaptive to the network security threat information data, the data quality of the collected network security threat information data is improved, and the false positive rate of the collected network security threat information data is reduced. The invention improves the existing entity identification and entity relationship extraction method aiming at the characteristics of threat information data, improves the accuracy and efficiency of network security entity identification and security entity relationship extraction, and generates the threat information network security knowledge map with higher data quality. The invention also combines the data reasoning ability of the network security knowledge graph to research an attack graph visualization method combining the network security knowledge graph and the local network topology structure.
The method firstly improves the quality of threat information data aiming at the characteristics of network security threat information data, reduces the false positive rate of the threat information data and improves the overall quality of the data; then, the existing entity identification and entity relation extraction method is improved aiming at the characteristics of threat intelligence so as to generate a high-quality threat intelligence knowledge graph; then, the local network vulnerability is subjected to correlation analysis by using recent threat information and combining local network topological structure data, and the visual display of the security vulnerability nodes in the local network topology is realized; and finally, an attack prediction method based on the combination of the network security knowledge graph and the flow analysis of the observation building is provided, and the attack means and the attack target of the attacker are predicted. Through a large number of experiments, the invention verifies that the quality of threat information data quality improvement algorithm and network security threat information provided by the method, and the quality of the knowledge map extracted and generated by entity identification and entity relation in the information text are higher than that of the existing method, and the method has good local network weakness visualization capability and attack prejudgment analysis capability.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a process diagram of a threat intelligence-based network security knowledge-graph generation method of the present invention;
FIG. 2 is a diagram of a distributed crawler architecture for threat intelligence data collection in the present invention;
FIG. 3 is a probability map model diagram of threat intelligence data quality improvement algorithm in the present invention;
FIG. 4 is a schematic diagram of atomic attack entities and their relationships defined in the present invention;
FIG. 5 is a schematic structural diagram of a BilSTM-CRF model for network security entity identification in the present invention;
FIG. 6 is a schematic structural diagram of an Att-BilSTM model for network security entity relationship extraction in the present invention;
FIG. 7 is a data collection time chart for a distributed crawler system for threat intelligence data collection as developed in the present invention;
FIG. 8 is a graph comparing the effectiveness of a distributed crawler system and a stand-alone crawler system for threat intelligence data collection as developed in the present invention;
FIG. 9 is a diagram of an example of the organization of threat intelligence data related to the Windows system in embodiment 5 of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.
The solution of some embodiments of the invention is described below with reference to fig. 1-9.
Example 1
The invention relates to a network security knowledge graph generation method based on threat intelligence, which is shown in figure 1. figure 1 is a process diagram of the network security knowledge graph generation method based on the threat intelligence. The specific generation process of the network security knowledge graph comprises the following steps: the method comprises the steps of efficient distributed threat intelligence data collection, network security data set production, network security threat intelligence data quality improvement, network security entity identification, network security entity relation extraction and data organization. The following steps are described in detail:
The generation of the network security knowledge graph requires a large amount of network security threat information data, and in order to efficiently collect the open source threat information data on the network in real time, the following distributed crawler system is realized for collecting the open source threat information data on the network. The distributed threat intelligence data crawling system is built by a script framework, and a script-redis scheduling crawler program is used for extracting data structuralization and storing the data structuralization into a redis and mongodb database.
(1) Distributed crawler system architecture: the threat intelligence collection system architecture is composed of a distributed crawler system and deployment of an underlying environment. The distributed crawler system is formed by reforming a traditional crawler frame by Scapy, a Redis database is newly added, and the problem that the distributed crawler system is not supported originally is solved. The underlying environment adopts a multi-node distributed system, a Docker container cluster, and uses mature Kubernetes as a cluster management tool. The distributed crawler system adopts a Master/Slave structure, a Master (Master) terminal and a plurality of Slave terminals are arranged, the Master terminal deploys a Redis database to store and be scheduled to request to be crawled, the Slave terminals deploy crawler main programs to crawl webpages and analyze extracted data, and then each Slave terminal stores the analyzed webpage data in the same MongoDB database. FIG. 2 is a diagram of a distributed crawler architecture for threat intelligence data collection in the present invention, as shown in FIG. 2. For each threat intelligence data item needing to be crawled, the threat intelligence data item is stored into a redis database, a script engine uses a scheduler to schedule the threat intelligence data item, and when a certain item is scheduled, a corresponding crawler program (spider) and middleware thereof are started to download the threat intelligence data.
(2) And (3) crawler strategies: for the Master terminal, an initial link is stored in Redis, a Key is a next crawled page in a scheduling queue, and a URL is a link of a certain page generally. And then, starting the crawler, acquiring the initial URL from the Redis, and downloading the data of the webpage corresponding to the URL. And analyzing the page data or the detail page link according to the defined related rule from the response, analyzing the page data directly according to the page format, restarting the crawler according to the condition of the detail page link, modifying the link into the detail page link, and acquiring the final detail data. The crawler continues to fetch URLs from the dispatch queue and crawl the next URL. If no URL exists, entering a waiting state. And for the Slave side, the downloader executes the downloading task and analyzes the extracted field. And the crawler program acquires the URL from the scheduling queue of the Key of the Redis and then downloads the corresponding webpage. And resolving the response according to the well-defined field rule, and storing the corresponding field into a MongoDB database after the corresponding field is processed by a text duplication removal module. Until the Key value is null.
(3) The crawler is realized: for the scheduler module, the scheduler module is responsible for scheduling tasks of the whole system, and mainly has the following functions: receiving a request sent by an engine; returning the URL to the downloading module; and (4) removing the weight of the URL and storing the URL into a Redis database. And each crawler subtask transmits the obtained URL to a scheduler through an engine, and the URL is stored in a Redis queue after being subjected to deduplication processing by the scheduler. And receiving the request of the engine, and returning the URL to the downloader. For the crawling downloader module, the crawling module integrates functions of a spider and a downloader, the spider processes and extracts data of webpage information returned by the downloader, and a directory URL and a detail page URL in the webpage information are extracted. And key fields in the webpage information are extracted and then stored in the MongoDB database. The downloader downloads the URL returned by the scheduler and transmits the downloaded webpage information to the spider. The system is responsible for crawling the corresponding website, firstly takes the initial URL, extracts the URL after crawling, returns the URL to the duplication removal module, and then distributes the URL to the Slave node from Redis by the scheduling module.
(4) Data storage: the storage module only needs to realize two functions, one is URL and stored in Redis, and the Redis is deployed in a Master node. And the webpage content obtained by analysis is stored in a MongoDB database and is deployed at a Master node. The stored webpage content information is extracted to be the final target of the system, and the distributed crawler crawls the webpage content and then provides a data processing program for extracting the required information.
And 2, making a network security threat intelligence data set.
The network security data is obtained by collecting the following 5 kinds of threat intelligence data by using the distributed threat intelligence crawling system in step 1. The method comprises the following steps:
(1) vulnerability data: the vulnerability data is collected from main vulnerability publishing platforms, such as CVE, NVD and the like. The data type comprises data such as a vulnerability occurrence system type, a system version, a utilization method and the like.
(2) APT (advanced persistent threat attack) attack chain data: APT attack chain data are collected from APTnodes platforms, 528 APT reports in the last 10 years are included, 50 reports are labeled manually, a BIO labeling method is adopted, 40 deep learning models for training entity recognition and entity relation extraction are used, and the remaining 10 reports are used for testing model effects.
(3) Malware text data: the data comprises the name, the category, the common functions, the Hash, the platform of the utilization system and the like of the malicious software in the threat intelligence. This portion of the data is collected in the threat intelligence source, AlienVault.
(4) Data discussion in the secure community: this portion of the data is collected at the StackExchange website, where the data is primarily text for security researchers to discuss recent security events.
(5) Secure RSS subscription data: the data is collected in large network security RSS, and the data is mainly recent network security news.
And 3, improving the data quality of the network security threat intelligence.
After the network security threat information data set is generated, the quality of the threat information data needs to be improved so as to improve the quality of the threat information data and reduce the false positive rate of the threat information data, so that a high-quality network security knowledge graph can be generated in the subsequent process.
The invention improves the time-varying characteristic of threat intelligence by the existing truth finding algorithm and introduces Markov property to improve the time-varying characteristic, so that the time-varying characteristic is suitable for the time-varying characteristic of the threat intelligence, as shown in figure 3, and figure 3 is a probability graph model diagram of the threat intelligence data quality improvement algorithm in the invention. In the figure, Mi: representing the set of model parameters at the ith time instant; ci: represents the model M at the ith timeiA priori parameters of (a); wherein i is 1, 2.., N; the remaining parameters have the same meanings as indicated herein.
The invention provides a threat intelligence data quality improvement algorithm model, which comprises the following steps:
step (1) FPR (false positive)Sex ratio): for each source k ∈ S, a corresponding false positive rate is generatedThe value is (1-specificity), and the compliance hyper-parameter is alpha0=(α01,α00) Beta distribution of (a), wherein01Is the count of false positive samples a priori per source, alpha0,0Is the count of true negative samples per source prior:
in the following, from the second time nodeUsing a previous time nodeInstead, the time-varying characteristics of the truth finding model are calibrated using second-order markov.
Step (2) Sensitivity: for each source k ∈ S, a corresponding sensitivity rate is generatedObeying a hyper-parameter of alpha1=(α11,α10) Beta distribution of (a), wherein11Is the true positive sample count per source prior, α1,0Is the per source a priori false negative sample count:
similar to FPR, will be from the second time nodeUsing a previous time nodeInstead, the time-varying characteristics of the truth finding model are calibrated using second-order markov.
Step (3) Att fact (attack tag): for each attribute to which an entity belongs, F ∈ F, which is the set of observations (i.e., the set of collected values) for all attributes under that entity. Generating a priori true probability θfObeying a hyper-parameter of β ═ β (β)1,β0) Beta distribution of (a), wherein1Is a prior entity attribute correct sample count, β0Is a prior entity attribute error sample count:
θf~Beta(β1,β0)
similar to FPR and Sensitivity above, θ from the second time nodefUsing theta of the previous time nodefInstead, the time-varying characteristics of the truth finding model are calibrated using second-order markov.
Step (4), Truth label: and the attribute true value label is used for generating a true value label of each entity attribute, namely whether the observed value is correct or not. t is tfIs an attribute truth label with a compliance parameter thetafIn which t isfIs a binary Boolean variable, the prior probability θfIs to represent an attribute tag tfProbability of being correct:
tf~Bernoulli(θf)
step (5) observer: an entity attribute observation value label, wherein the observation value C of each entity attribute belongs to CfIts source uses scRepresents; generating a distribution of observation labels c is a compliance parameterBernoulli distribution.
Wherein if tf=0,ocCompliance parameter ofThe false positive rate of the distribution of bernoulli is sc
If t isf=1,ocCompliance parameter ofThe false positive rate of the distribution of bernoulli is also sc
The model solution is as follows: from the above description, the conditional probability of the model given the observed value c of each entity attribute is as follows:
in the above formula: p represents the prior probability theta when the parameter truth value is givenfSource sensitivity rateAndwhen the observed value of the entity o is c, the conditional probability is obtained. Where c is the observation, f is the attack tag, scA source representing the occurrence of an observation c;
the full likelihood function containing all variables and hyperparameters can be written as:
in the above formula: p represents the hyperparameter alpha when given the parameter false positive rate0,α1And the prior true probability over-parameter beta, the entity o, the source s, the true label t, the prior probability parameter set theta and the sensitivity parameter set phi0,φ1The conditional probability of (2).Where S represents a set of all sources, F represents an attack tag set, F represents each attack tag element belonging to F, θfDenotes the f prior probability, tfDenotes the true value of f, CfA set of observations representing f, c representing each observation element in the set of observations.
Given the observed value data of the attribute, the likelihood function can be solved using the Gibbs Sampling algorithm in the MCMC algorithm:
tmapthe result of maximum posterior estimation of the above formula is shown, and the rest parameters have the same meanings as the parameters with the same names in the above.
The following formula can be solved:
wherein: p represents when given a parameter t-fTrue value t of f for entity o and source sfThe value is the conditional probability of i, i represents the attack label value of f, and the value range is {0,1}, t-fIs the set of all values in F except F,
source s representing observation j, attack tag not f, and truth tag icThe number of the cells. C-fRepresenting a set of attack tags without attack tag f, C' being C-fEach of the elements in the set is,the value of f is the true value when c', and the rest parameters have the same meanings as the parameters with the same names.
To obtain p (t)f=i|t-fO, s), the FPR (false positive rate) and Sensitivit (sensitivity) at the next time can be estimatedRate) y, which solves for:
whereinObservation set C representing all attack labels as ffElement c of (a), source s making observations of the observation value ccAnd attack tag o of entity ofThe truth label for j takes the value of the probability sum of i, wherein i belongs to {0,1}, j belongs to {0, 1., | F | }, | F | represents the number of elements of the attack set F, the rest parameters have the same meanings as the parameters with the same names in the above, and finally the accuracy of each source can be estimated as well:
where precision represents the accuracy of each source, and the remaining parameters are synonymous with the above-mentioned parameters.
Entities and relationships are defined as follows:
first, defining the concept of relationship between network security entities and entities. The knowledge graph reflects the specific information and the associated relation between the information, and the entity is an abstract expression of the concept and the relation between the concepts, so that good entity definition can be helpful for clearly expressing the information and the relation contained in the knowledge graph. Here, an atomic attack is used to describe a network security entity, and the atomic attack represents the smallest attack unit in a single attack and can be understood as the smallest step in the attack.
As shown in fig. 4, fig. 4 is a schematic diagram of the atomic attack entities and their relationships defined in the present invention. In the atomic attack graph, an atomic attack is represented by a vertex, and the actual meaning represents a once-exploit attack. Exploits are tied to software and hardware. The implementation of the attack depends on the attack condition, the attack mode, the attack effect and the like. The invention designs 4 entities of software, hardware, bugs and attacks for atomic attack, wherein the attacks have 3 attributes of attack conditions, attack modes and attack effects. Wherein the relationship between entities is defined as "existence" and "utilization" 2 relationships.
And 4, carrying out network security entity identification on the manufactured network security threat intelligence data set.
As described above, for the APT report in step 2, the sentence X ═ X in the APT report document is annotated by the BIO notation method]N=[x1,...,xi,...xN]Wherein x isiIs the ith character in sentence X. In the BIO labeling method, identifying the network security entity in the sentence X is equivalent to giving a standard sequence LX=[l]N。
The invention uses a BilSTM-CRF (bidirectional long and short term memory artificial neural network-conditional random field algorithm) model to carry out model training on the labeled APT report document, as shown in figure 5, and figure 5 is a structural schematic diagram of the BilSTM-CRF model used for network security entity identification in the invention. In the figure, CRF represents a conditional random field; bi represents the output of the ith backward network; fi denotes an output of the ith forward network; ci represents the ith text vector; B-LOC, E-LOC, O in the CRF layer represents: start, end, outside. The model can simultaneously extract the word characteristics before the ith character and the word characteristics after the word through a forward process, thereby improving the learning ability of the word. A CRF (conditional random field) model is used to obtain the conditional probability distribution of one set of output random variables given a set of input random variables.
Wherein the CRF model is: given an input sentence, X ═ X]N=[x1,...,xi,...xN]Let S be the output score matrix of the BiLSTM (bidirectional Long-short term memory artificial neural network) network with dimension NxK, K being the number of labeled species, Si,jIs the jth label score of the ith word, the predicted label y ═ y1,...,yi,...,yN]The judgment score Z of (a) defines:
where T is a K +2 dimensional probability transition matrix, the probability of the generated tag sequence y:
then, solving the correctly labeled log-likelihood probability by utilizing maximum likelihood estimation:
and 5, extracting the network security entity relationship.
Network security entity relationship extraction adopts an attention mechanism-bidirectional long-short term memory artificial neural network (BILSTM) model. The model is mainly divided into 5 layers: an input layer, a word embedding layer, a BilSTM layer, an Attention layer, and an output layer (the CRF layer in the BilSTM-CRF model is replaced by the Attention layer, and the output layer becomes a softmax layer). As shown in FIG. 6, FIG. 6 is a schematic structural diagram of an Att-BilSTM model for network security entity relationship extraction in the present invention. Wherein Si represents the ith text vector; o, B-A and I-A in the output layer represent: exterior, beginning of a, interior of a.
Wherein the word embedding layer is used to characterize a sentence in the APT report, X ═ X]N=[x1,...,xi,...xN]The sentences are expressed into a matrix, and words with similar meanings are adjacent in the space of the matrix to indicate that the sentences possibly have relations.
The significance of the output result of the Attention layer salient part introduces a weighting idea. Wherein the output of the BilSTM layer is B ═ B]T=[b1,...,bj,...,bT]Then the parameter matrix W satisfies the following formula:
S=tanh(B)
α=softmax(WTS)
r=BαT
α is an attention weight coefficient, r is a result of weighted summation of the BiLSTM output B, and finally a characterization vector B ═ tanh (r) is generated by a nonlinear function. Then B is put*And inputting a full-connection neural network, mapping the full-connection neural network to a label vector, and obtaining a prediction label through a softmax function.
And 6, organizing data.
Because threat intelligence data presents the characteristic of multi-source isomerism, the method adopts a non-relational database, namely a Mongobb database, to store data organization, and stores all data in a key value pair mode. The Mongodb database has extremely high performance and flexible data storage characteristics, and is suitable for storing threat intelligence and a generated network security knowledge graph model.
In the implementation steps of the invention, the software environment is a Windows10 system, the implementation language is Python3, the deep learning framework is Pythrch, and the database is a non-relational database Mongodb.
Example 2
The embodiment provides a network security knowledge graph generating method based on threat intelligence, and the method is used for testing a distributed threat intelligence crawling system.
The invention verifies that the developed distributed threat information crawling system has higher superiority in efficiency compared with the single-machine threat information collecting system by comparing the developed distributed threat information crawling system with the single-machine threat information collecting system. Taking a common open source threat information source as an example, the distributed crawler system is provided with 1 main node and 2 slave nodes, and after the continuous operation is carried out for 5 days, 11 thousands of pieces of webpage data are stored in the database in a coexisting manner. The number of pages crawled at various points in time is shown in fig. 7, where fig. 7 is a time chart of data collection by the distributed crawler system for threat intelligence data collection developed in the present invention. In the figure, the position of the upper end of the main shaft,
in the experiment, the total number of pages crawled by the 2 Slave nodes in a certain time is far higher than that of pages crawled by the single-machine operation, and the distributed system is fully demonstrated to improve the operation efficiency indeed. And the distributed crawler system and the crawler running in the single-machine environment perform comparison test, and record the number of pages crawled by the distributed crawler system and the single-machine environment. Respectively deploying distributed crawler projects in a Docker container cluster and a virtual machine cluster, wherein the hardware configuration is as follows: master1 and Slave2 are Ubuntu 16.04 and Python2.7, and the memories are 8G. Operational efficiency vs. time as shown in fig. 8, fig. 8 is a graph comparing the effectiveness of a distributed crawler system for threat intelligence data collection developed in the present invention with a stand-alone crawler system. The number of pages grabbed by the crawler at each time point is known, and the distributed crawler system is obviously superior to a single-machine crawler system.
Example 3
The embodiment provides a network security knowledge graph generating method based on threat intelligence, and comparison is carried out aiming at the effect of a threat intelligence data quality improvement algorithm.
The method and the device perform comparison of the entity attribute quality improvement effect of the threat intelligence data by using the algorithm provided by the invention for the threat intelligence data and other truth value discovery algorithms. The test criteria used were the accuracy, recall and F1 values commonly used in the true discovery model. The true value of the comparison finds that the algorithm is 3-Estimates, Voting, LTM. The comparative effects are shown in table 1. It can be seen that the quality improvement algorithm of the invention has better effect on the quality improvement of threat intelligence data than the existing algorithm.
Table 1 is a table of comparison results of different data quality improvement algorithm effects in the embodiment of the present invention.
Algorithm | Rate of accuracy | Recall rate | F1 value |
proposal | 0.935 | 0.960 | 0.987 |
3-Estimates | 0.874 | 0.903 | 0.927 |
Voting | 0.840 | 0.867 | 0.913 |
LTM | 0.924 | 0.865 | 0.966 |
In the table: the Proposal represents the algorithm provided by the invention, the 3-Estimates represents the 3 sequence parameter estimation algorithm, the Voting represents the Voting algorithm, and the LTM represents the hidden truth model algorithm.
Example 4
The embodiment provides a network security knowledge graph generating method based on threat intelligence, and the network security knowledge graph generating method is used for comparing the network security entity identification effects in the threat intelligence.
The invention tests the effect of the network security entity identification model and the existing entity identification model through the marked remaining 10 APT report documents. The test criteria used were the accuracy, precision, recall and F1 values commonly used in entity identification. The compared entity recognition models are CRF, LSTM and LSTM-CRF. The comparative effect is shown in table 2. It can be seen that the network security entity identification model provided by the invention has better network security entity identification effect than the existing model in threat intelligence.
Table 2 is a table comparing the effects of different network security entity identification models in the embodiment of the present invention.
In the table: CRF represents a conditional random field algorithm, LSTM represents a long-short term memory artificial neural network algorithm, BilSTM represents a bidirectional long-short term memory artificial neural network algorithm, and BilSTM-CRF represents a bidirectional long-short term memory artificial neural network-conditional random field algorithm.
Example 5
The embodiment provides a network security knowledge graph generating method based on threat intelligence, and the network security knowledge graph generating method is used for comparing network security entity relation extraction effects in the threat intelligence.
The invention tests the effect of the network security entity relationship extraction model and the existing entity relationship extraction model through the remaining 10 APT report documents. The test criteria select entity relationships to extract commonly used precision, recall, and F1 values. The entity relationship extraction models of comparison are CRF, LSTM, BilSTM and BilSTM-CRF. The comparative effect is shown in table 3. It can be seen that the network security entity relationship extraction model provided by the invention has better network security entity relationship extraction effect than the existing model in threat intelligence.
Table 3 is a table comparing the effects of the different network security entity relationship extraction models in the embodiment of the present invention.
Model (model) | Rate of accuracy | Rate of accuracy | Recall rate | F1 value |
CRF | 0.9041 | 0.8084 | 0.7963 | 0.7892 |
LSTM | 0.9163 | 0.8162 | 0.8046 | 0.8018 |
BiLSTM | 0.9265 | 0.8339 | 0.8262 | 0.8491 |
BiLSTM-CRF | 0.9374 | 0.8674 | 0.8344 | 0.8411 |
BiLSTM-CRF-Attention | 0.9405 | 0.8652 | 0.8748 | 0.8751 |
In the table: BilSTM-CRF-Attention represents a bidirectional long-short term memory artificial neural network-conditional random field-Attention mechanism algorithm.
Example 6
The embodiment provides a network security knowledge graph generating method based on threat intelligence, and relates to a network security knowledge graph example based on the threat intelligence.
As shown in fig. 9, fig. 9 is a diagram of an example of the organization of threat intelligence data related to the Windows system in embodiment 5 of the present invention.
After network security entity identification and relationship extraction are carried out on various threat information data, the network security knowledge graph based on threat information can effectively organize entity data and relationship in various threat information and carry out association analysis on the data. The data associated with the storage and Mongodb is visually displayed in FIG. 9 using the grapeviz module in Python 3. The system shows that the Win10 system in the Windows system has remote desktop service remote code execution bugs, and can utilize four bugs, namely CVE-2019-. CVE represents a generic vulnerability disclosure number.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.
Claims (10)
1. A network security knowledge map generation method based on threat intelligence is characterized by comprising the following steps: the method comprises the following steps:
step 1, collecting high-efficiency distributed threat information data, wherein a distributed threat information data crawling system is built by a script framework, and a script-redis scheduling crawler program is used for extracting data to be structured and then storing the data into a redis and mongodb database;
step 2, a network security threat information data set is made through a distributed threat information crawling system;
step 3, improving the data quality of the network security threat information;
step 4, utilizing the threat intelligence data to manufacture a network security threat intelligence data set for network security entity identification;
step 5, extracting the network security entity relationship;
and 6, organizing data.
2. The method for generating network security knowledge graph based on threat intelligence as claimed in claim 1, wherein: the high efficiency distributed threat intelligence data collection comprises: distributed crawler system architecture, crawler strategy, crawler implementation and data storage.
3. The method of claim 2, wherein the method comprises: the distributed crawler system architecture comprises: the threat information collection system framework is formed by the deployment of a distributed crawler system and a bottom environment; the distributed crawler system is formed by reconstructing a traditional crawler frame by Scapy, a Redis database is added, a multi-node distributed system is adopted in a bottom environment, a Docker container cluster is adopted, and Kubernets are used as a cluster management tool; the distributed crawler system adopts a Master/Slave structure, a Master (Master) terminal and a plurality of Slave terminals are arranged, the Master terminal deploys a Redis database to store and be scheduled to request to be crawled, the Slave terminals deploy crawler main programs to crawl webpages and analyze extracted data, and each Slave terminal stores the analyzed webpage data in the same MongoDB database.
4. The method of claim 2, wherein the method comprises: the crawler strategy comprises: for a Master terminal, storing an initial link in Redis, wherein Key is a next crawled page in a scheduling queue, and URL is a link of a certain page generally; then, a crawler is started, a starting URL is obtained from the Redis, and data of a webpage corresponding to the URL are downloaded; analyzing according to a defined relevant rule from response to obtain page data or a detail page link, analyzing according to a page format for the condition of directly being the page data, restarting a crawler for the condition of the detail page link, modifying the link into the detail page link, and acquiring final detail data; the crawler program continuously acquires the URL from the scheduling queue and crawls the next URL; if no URL exists, entering a waiting state; for the Slave end, a downloader executes a downloading task and analyzes an extracted field; the crawler program acquires URL from a scheduling queue of Key of Redis, and then downloads a corresponding webpage; and resolving the response according to the well-defined field rule, and storing the corresponding field into the MongoDB database after the corresponding field is processed by the text duplication removal module until the Key value is null.
5. The method of claim 2, wherein the method comprises: the crawler implementation includes: for the scheduler module, the scheduler module is responsible for scheduling tasks of the whole system, and mainly has the following functions: receiving a request sent by an engine; returning the URL to the downloading module; the URL is stored in a Redis database after being deduplicated; each crawler subtask transmits the obtained URL to a scheduler through an engine, and the URL is stored in a Redis queue after being subjected to deduplication processing by the scheduler; receiving the request of the engine, and returning the URL to the downloader; for the crawling downloader module, the crawling module integrates the functions of a spider and a downloader, the spider processes and extracts data of webpage information returned by the downloader, and extracts a directory URL and a detail page URL in the webpage information; key fields in the webpage information are extracted and then stored in a MongoDB database; the downloader downloads the URL returned by the scheduler and transmits the downloaded webpage information to the spider; the system is responsible for crawling corresponding websites, and comprises a starting URL (uniform resource locator), a duplicate removal module, a URL extraction module and a duplicate removal module, wherein the starting URL is taken at first, and the URL is extracted after crawling; then the scheduling module distributes URL to the Slave node from Redis;
the data storage includes: the storage module realizes two functions, wherein the URL is stored in Redis, and the Redis is deployed in a Master node; the webpage content obtained by analysis is stored in a MongoDB database and is deployed at a Master node; the stored webpage content information is extracted to be the final target of the system, and the distributed crawler crawls the webpage content for a data processing program to extract the required information.
6. The method for generating network security knowledge graph based on threat intelligence as claimed in claim 1, wherein: the network security threat information data set is produced through a distributed threat information crawling system; the method comprises the following steps:
(1) vulnerability data: the vulnerability data is collected from a main vulnerability publishing platform, and the data types comprise vulnerability occurrence system types, system versions and utilization methods;
(2) APT attack chain data: APT attack chain data is collected from an APTnodes platform; a total of 528 APT reports over the last 10 years;
(3) malware text data: the method comprises the name, the category, the common functions, the Hash and the utilization system platform of malicious software in threat intelligence; the partial data is collected in a threat intelligence source AlienVault;
(4) data discussion in the secure community: the part of data is collected in a StackExchange website and is the text of a recent security event;
(5) secure RSS subscription data: the part of data is collected in each large network security RSS and is recent network security news.
7. The method for generating network security knowledge graph based on threat intelligence as claimed in claim 1, wherein: the method for improving the quality of the network security threat information data comprises the following steps:
step (1), FPR false positive rate: for each source k ∈ S, a corresponding false positive rate is generatedThe value is (1-specificity)Obedience hyper-parameter is alpha0=(α0,1,α0,0) Beta distribution of (a), wherein0,1Is the count of false positive samples a priori per source, alpha0,0Is the count of true negative samples per source prior:
in the following, from the second time nodeUsing a previous time nodeInstead, the time-varying characteristic of the truth finding model is calibrated by using second-order markov;
step (2) Sensitivity: for each source k ∈ S, a corresponding sensitivity rate is generatedObeying a hyper-parameter of alpha1=(α1,1,α1,0) Beta distribution of (a), wherein1,1Is the true positive sample count per source prior, α1,0Is the per source a priori false negative sample count:
from a second time nodeUsing a previous time nodeInstead, the time-varying characteristic of the truth finding model is calibrated by using second-order markov;
step (3), attack tag of Att fact: for the attribute of each entity, F belongs to F, and F is a set of observed values of all attributes under the entity; generating a priori true probability θfObeying a hyper-parameter of β ═ β (β)1,β0) Beta distribution of (a), wherein1Is a prior entity attribute correct sample count, β0Is a prior entity attribute error sample count:
θf~Beta(β1,β0)
will be from the second time nodefUsing theta of the previous time nodefInstead, the time-varying characteristic of the truth finding model is calibrated by using second-order markov;
step (4), Truth label: the attribute true value label is used for generating a true value label of each entity attribute, namely whether the observed value is correct or not; t is tfIs an attribute truth label with a compliance parameter thetafIn which t isfIs a binary Boolean variable, the prior probability θfIs to represent an attribute tag tfProbability of being correct:
tf~Bernoulli(θf)
step (5) observer: an entity attribute observation value label, wherein the observation value C of each entity attribute belongs to CfIts source uses scRepresents; generating a distribution of observation labels c is a compliance parameterBernoulli distribution of (a):
wherein if tf=0,ocCompliance parameter ofThe false positive rate of the distribution of bernoulli is sc
If t isf=1,ocCompliance parameter ofThe false positive rate of the distribution of bernoulli is also sc
The model solution is as follows:
the conditional probability of the model given the observed value c of each entity attribute is as follows:
in the above formula: p represents the prior probability theta when the parameter truth value is givenfSource sensitivity rateAndwhen the observed value of the entity o is c, the conditional probability; where c is the observation, f is the attack tag, scA source representing the occurrence of an observation c;
the full likelihood function containing all variables and hyperparameters is written as:
in the above formula: p represents the hyperparameter alpha when given the parameter false positive rate0,α1And the prior true probability over-parameter beta, the entity o, the source s, the true label t, the prior probability parameter set theta and the sensitivity parameter set phi0,φ1The conditional probability of (a); where S represents a set of all sources, F represents a set of attack tags, and F represents each attack belonging to FHit against the label element, θfDenotes the f prior probability, tfDenotes the true value of f, CfA set of observations representing f, c representing each observation element in the set of observations;
and (3) given the observation value data of the attribute, solving the likelihood function by using a Gibbs Sampling algorithm in the MCMC algorithm:
tmapthe result of maximum posterior estimation is obtained by the formula, and the rest parameters have the same meanings as the parameters with the same names in the formula;
the following formula is solved:
wherein: p represents when given a parameter t-fTrue value t of f for entity o and source sfThe value is the conditional probability of i, i represents the attack label value of f, and the value range is {0,1}, t-fIs the set of all values in F except F,
source s representing observation j, attack tag not f, and truth tag icThe number of (2); c-fRepresenting a set of attack tags without attack tag f, C' being C-fEach of the elements in the set is,the truth value when f is c' is shown, and the rest parameters have the same meanings as the parameters with the same names in the above;
to obtain p (t)f=i|t-fO, s), estimating to obtain the FPR false positive rate and the Sensitivity rate of the next moment, and solving the following steps:
whereinObservation set C representing all attack labels as ffElement c of (a), source s making observations of the observation value ccAnd attack tag o of entity ofThe truth label for j takes the value of the probability sum of i, wherein i belongs to {0,1}, j belongs to {0, 1., | F | }, | F | represents the number of elements of the attack set F, the rest parameters have the same meanings as the parameters with the same names in the above, and finally the accuracy of each source can be estimated as well:
where precision represents the accuracy of each source.
8. The method for generating network security knowledge graph based on threat intelligence as claimed in claim 1, wherein: the network security entity identification is carried out on the manufactured network security threat information data set, and the sentence X in the APT report document is [ X ] by adopting a BIO marking method for the APT report]N=[x1,...,xi,...xN]Wherein x isiIs the ith character in sentence X; in the BIO labeling method, identifying the network security entity in the sentence X is equivalent to giving a standard sequence LX=[l]N;
Performing model training on the labeled APT report document by using a BilSTM-CRF model, and extracting word characteristics before the ith character and word characteristics after the word by a forward process; the CRF model is used for acquiring the conditional probability distribution of another group of output random variables under the condition of giving a group of input random variables;
the CRF model is: given an input sentence, X ═ X]N=[x1,...,xi,...xN]Assuming S is the output score matrix of the BiLSTM network of dimension NxK, K is the number of label categories, Si,jIs the jth label score of the ith word, the predicted label y ═ y1,...,yi,...,yN]The judgment score Z of (a) defines:
where T is a K +2 dimensional probability transition matrix, the probability of the generated tag sequence y:
then, solving the correctly labeled log-likelihood probability by utilizing maximum likelihood estimation:
9. the method for generating network security knowledge graph based on threat intelligence as claimed in claim 1, wherein: the extracting of the network security entity relationship comprises:
extracting a network security entity relationship by adopting an attention-based BilSTM (Att-BilSTM) model; the system comprises an input layer, a word embedding layer, a BilSTM layer, an Attention layer and an output layer;
wherein the word embedding layer is used to characterize a sentence in the APT report, X ═ X]N=[x1,...,xi,...xN]Sentences are expressed into a matrix, words with similar meanings are adjacent in the matrix space, and the expressions can have relations;
wherein the significance of the output result of the salient part of the Attention layer introduces a weighting thought;
wherein the output of the BilSTM layer is B ═ B]T=[b1,...,bj,...,bT]Then the parameter matrix W satisfies the following formula:
S=tanh(B)
α=softmax(WTS)
r=BαT
alpha is an attention weight coefficient, r is a result of weighted summation of the BilSTM output B, finally a characterization vector B ═ tanh (r) is generated through a nonlinear function, and then B is used*And inputting a full-connection neural network, mapping the full-connection neural network to a label vector, and obtaining a prediction label through a softmax function.
10. The method for generating network security knowledge graph based on threat intelligence as claimed in claim 1, wherein: and the data organization adopts a non-relational database, namely a Mongodb database to store, and stores all data in a key-value pair mode.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110439459.1A CN113282759B (en) | 2021-04-23 | 2021-04-23 | Threat information-based network security knowledge graph generation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110439459.1A CN113282759B (en) | 2021-04-23 | 2021-04-23 | Threat information-based network security knowledge graph generation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113282759A true CN113282759A (en) | 2021-08-20 |
CN113282759B CN113282759B (en) | 2024-02-20 |
Family
ID=77277242
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110439459.1A Active CN113282759B (en) | 2021-04-23 | 2021-04-23 | Threat information-based network security knowledge graph generation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113282759B (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113746838A (en) * | 2021-09-03 | 2021-12-03 | 杭州安恒信息技术股份有限公司 | Threat information sensing method, device, equipment and medium |
CN113746832A (en) * | 2021-09-02 | 2021-12-03 | 华中科技大学 | Multi-method mixed distributed APT malicious flow detection defense system and method |
CN113934914A (en) * | 2021-12-20 | 2022-01-14 | 成都橙视传媒科技股份公司 | Method for collecting batch encrypted data of news media |
CN114065767A (en) * | 2021-11-29 | 2022-02-18 | 北京航空航天大学 | Method for analyzing classification and evolution relation of threat information |
CN114222293A (en) * | 2021-12-21 | 2022-03-22 | 中国电信股份有限公司 | Network data security protection method and device, storage medium and terminal equipment |
CN114257420A (en) * | 2021-11-29 | 2022-03-29 | 中国人民解放军63891部队 | Method for generating network security test based on knowledge graph |
CN114697110A (en) * | 2022-03-30 | 2022-07-01 | 杭州安恒信息技术股份有限公司 | Network attack detection method, device, equipment and storage medium |
CN115208684A (en) * | 2022-07-26 | 2022-10-18 | 中国电子科技集团公司第十五研究所 | Hypergraph association-based APT attack clue expansion method and device |
CN115412372A (en) * | 2022-11-01 | 2022-11-29 | 中孚安全技术有限公司 | Network attack tracing method, system and equipment based on knowledge graph |
CN115622805A (en) * | 2022-12-06 | 2023-01-17 | 南宁重望电子商务有限公司 | Artificial intelligence-based safety payment protection method and AI system |
CN115618857A (en) * | 2022-09-09 | 2023-01-17 | 中国电信股份有限公司 | Threat information processing method, threat information pushing method and device |
CN115795058A (en) * | 2023-02-03 | 2023-03-14 | 北京安普诺信息技术有限公司 | Threat modeling method, system, electronic device and storage medium |
CN116723042A (en) * | 2023-07-12 | 2023-09-08 | 北汽蓝谷信息技术有限公司 | Data packet security protection method and system |
CN117354065A (en) * | 2023-12-05 | 2024-01-05 | 国网四川省电力公司电力科学研究院 | Industrial control network threat information analysis method and system based on big data |
CN117792801A (en) * | 2024-02-28 | 2024-03-29 | 贵州华谊联盛科技有限公司 | Network security threat identification method and system based on multivariate event analysis |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH075892A (en) * | 1993-04-29 | 1995-01-10 | Matsushita Electric Ind Co Ltd | Voice recognition method |
CN102932147A (en) * | 2012-10-09 | 2013-02-13 | 上海大学 | Elliptic curve cipher timing attacking method based on hidden markov model (HMM) |
US8489635B1 (en) * | 2010-01-13 | 2013-07-16 | Louisiana Tech University Research Foundation, A Division Of Louisiana Tech University Foundation, Inc. | Method and system of identifying users based upon free text keystroke patterns |
WO2016061586A1 (en) * | 2014-10-17 | 2016-04-21 | Cireca Theranostics, Llc | Methods and systems for classifying biological samples, including optimization of analyses and use of correlation |
CN106021608A (en) * | 2016-06-22 | 2016-10-12 | 广东亿迅科技有限公司 | Distributed crawler system and implementing method thereof |
CN109922075A (en) * | 2019-03-22 | 2019-06-21 | 中国南方电网有限责任公司 | Network security knowledge map construction method and apparatus, computer equipment |
CN110177114A (en) * | 2019-06-06 | 2019-08-27 | 腾讯科技(深圳)有限公司 | The recognition methods of network security threats index, unit and computer readable storage medium |
CN110717049A (en) * | 2019-08-29 | 2020-01-21 | 四川大学 | Text data-oriented threat information knowledge graph construction method |
CN110929128A (en) * | 2019-12-11 | 2020-03-27 | 北京启迪区块链科技发展有限公司 | Data crawling method, device, equipment and medium |
CN111831905A (en) * | 2020-06-19 | 2020-10-27 | 中国科学院计算机网络信息中心 | Recommendation method and device based on team scientific research influence and sustainability modeling |
CN111881622A (en) * | 2020-07-27 | 2020-11-03 | 南京睿辰欣创网络科技股份有限公司 | Method for deductive evaluation of combat plan by person in loop |
CN112115331A (en) * | 2020-09-21 | 2020-12-22 | 朱彤 | Capital market public opinion monitoring method based on distributed web crawler and NLP |
US20210042619A1 (en) * | 2019-08-05 | 2021-02-11 | Intuit Inc. | Finite rank deep kernel learning with linear computational complexity |
-
2021
- 2021-04-23 CN CN202110439459.1A patent/CN113282759B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH075892A (en) * | 1993-04-29 | 1995-01-10 | Matsushita Electric Ind Co Ltd | Voice recognition method |
US8489635B1 (en) * | 2010-01-13 | 2013-07-16 | Louisiana Tech University Research Foundation, A Division Of Louisiana Tech University Foundation, Inc. | Method and system of identifying users based upon free text keystroke patterns |
CN102932147A (en) * | 2012-10-09 | 2013-02-13 | 上海大学 | Elliptic curve cipher timing attacking method based on hidden markov model (HMM) |
WO2016061586A1 (en) * | 2014-10-17 | 2016-04-21 | Cireca Theranostics, Llc | Methods and systems for classifying biological samples, including optimization of analyses and use of correlation |
CN106021608A (en) * | 2016-06-22 | 2016-10-12 | 广东亿迅科技有限公司 | Distributed crawler system and implementing method thereof |
CN109922075A (en) * | 2019-03-22 | 2019-06-21 | 中国南方电网有限责任公司 | Network security knowledge map construction method and apparatus, computer equipment |
CN110177114A (en) * | 2019-06-06 | 2019-08-27 | 腾讯科技(深圳)有限公司 | The recognition methods of network security threats index, unit and computer readable storage medium |
US20210042619A1 (en) * | 2019-08-05 | 2021-02-11 | Intuit Inc. | Finite rank deep kernel learning with linear computational complexity |
CN110717049A (en) * | 2019-08-29 | 2020-01-21 | 四川大学 | Text data-oriented threat information knowledge graph construction method |
CN110929128A (en) * | 2019-12-11 | 2020-03-27 | 北京启迪区块链科技发展有限公司 | Data crawling method, device, equipment and medium |
CN111831905A (en) * | 2020-06-19 | 2020-10-27 | 中国科学院计算机网络信息中心 | Recommendation method and device based on team scientific research influence and sustainability modeling |
CN111881622A (en) * | 2020-07-27 | 2020-11-03 | 南京睿辰欣创网络科技股份有限公司 | Method for deductive evaluation of combat plan by person in loop |
CN112115331A (en) * | 2020-09-21 | 2020-12-22 | 朱彤 | Capital market public opinion monitoring method based on distributed web crawler and NLP |
Non-Patent Citations (3)
Title |
---|
O. YOUSIF 等: "Improving SAR-Based Urban Change Detection by Combining MAP-MRF Classifier and Nonlocal Means Similarity Weights", 《IN IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING》, vol. 7, no. 10, pages 4288 - 4300, XP011568812, DOI: 10.1109/JSTARS.2014.2347171 * |
曹玉琳 等: "基于状态空间模型和概率矩阵分解的推荐算法", 《计算机应用研究》, vol. 37, no. 11, pages 1001 - 3695 * |
邵昊阳 等: "基于多域先验的乳腺超声图像协同分割", 《自动化学报》, vol. 42, no. 4, pages 580 - 592 * |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113746832B (en) * | 2021-09-02 | 2022-04-29 | 华中科技大学 | Multi-method mixed distributed APT malicious flow detection defense system and method |
CN113746832A (en) * | 2021-09-02 | 2021-12-03 | 华中科技大学 | Multi-method mixed distributed APT malicious flow detection defense system and method |
CN113746838A (en) * | 2021-09-03 | 2021-12-03 | 杭州安恒信息技术股份有限公司 | Threat information sensing method, device, equipment and medium |
CN113746838B (en) * | 2021-09-03 | 2022-12-13 | 杭州安恒信息技术股份有限公司 | Threat information sensing method, device, equipment and medium |
CN114257420B (en) * | 2021-11-29 | 2024-01-09 | 中国人民解放军63891部队 | Knowledge graph-based network security test generation method |
CN114257420A (en) * | 2021-11-29 | 2022-03-29 | 中国人民解放军63891部队 | Method for generating network security test based on knowledge graph |
CN114065767A (en) * | 2021-11-29 | 2022-02-18 | 北京航空航天大学 | Method for analyzing classification and evolution relation of threat information |
CN114065767B (en) * | 2021-11-29 | 2024-05-14 | 北京航空航天大学 | Threat information classification and evolution relation analysis method |
CN113934914A (en) * | 2021-12-20 | 2022-01-14 | 成都橙视传媒科技股份公司 | Method for collecting batch encrypted data of news media |
CN114222293A (en) * | 2021-12-21 | 2022-03-22 | 中国电信股份有限公司 | Network data security protection method and device, storage medium and terminal equipment |
CN114697110A (en) * | 2022-03-30 | 2022-07-01 | 杭州安恒信息技术股份有限公司 | Network attack detection method, device, equipment and storage medium |
CN115208684A (en) * | 2022-07-26 | 2022-10-18 | 中国电子科技集团公司第十五研究所 | Hypergraph association-based APT attack clue expansion method and device |
CN115208684B (en) * | 2022-07-26 | 2023-03-14 | 中国电子科技集团公司第十五研究所 | Hypergraph association-based APT attack clue expansion method and device |
CN115618857A (en) * | 2022-09-09 | 2023-01-17 | 中国电信股份有限公司 | Threat information processing method, threat information pushing method and device |
CN115618857B (en) * | 2022-09-09 | 2024-03-01 | 中国电信股份有限公司 | Threat information processing method, threat information pushing method and threat information pushing device |
CN115412372A (en) * | 2022-11-01 | 2022-11-29 | 中孚安全技术有限公司 | Network attack tracing method, system and equipment based on knowledge graph |
CN115622805B (en) * | 2022-12-06 | 2023-08-25 | 深圳慧卡科技有限公司 | Safety payment protection method and AI system based on artificial intelligence |
CN115622805A (en) * | 2022-12-06 | 2023-01-17 | 南宁重望电子商务有限公司 | Artificial intelligence-based safety payment protection method and AI system |
CN115795058A (en) * | 2023-02-03 | 2023-03-14 | 北京安普诺信息技术有限公司 | Threat modeling method, system, electronic device and storage medium |
CN116723042A (en) * | 2023-07-12 | 2023-09-08 | 北汽蓝谷信息技术有限公司 | Data packet security protection method and system |
CN116723042B (en) * | 2023-07-12 | 2024-01-26 | 北汽蓝谷信息技术有限公司 | Data packet security protection method and system |
CN117354065A (en) * | 2023-12-05 | 2024-01-05 | 国网四川省电力公司电力科学研究院 | Industrial control network threat information analysis method and system based on big data |
CN117792801A (en) * | 2024-02-28 | 2024-03-29 | 贵州华谊联盛科技有限公司 | Network security threat identification method and system based on multivariate event analysis |
CN117792801B (en) * | 2024-02-28 | 2024-05-14 | 贵州华谊联盛科技有限公司 | Network security threat identification method and system based on multivariate event analysis |
Also Published As
Publication number | Publication date |
---|---|
CN113282759B (en) | 2024-02-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113282759B (en) | Threat information-based network security knowledge graph generation method | |
Le et al. | Deep learning at the shallow end: Malware classification for non-domain experts | |
CN115563610B (en) | Training method, recognition method and device for intrusion detection model | |
Carlin et al. | A cost analysis of machine learning using dynamic runtime opcodes for malware detection | |
Dionísio et al. | Towards end-to-end cyberthreat detection from Twitter using multi-task learning | |
CN112115326B (en) | Multi-label classification and vulnerability detection method for Etheng intelligent contracts | |
Herath et al. | Cfgexplainer: Explaining graph neural network-based malware classification from control flow graphs | |
US20220318387A1 (en) | Method and Computer for Learning Correspondence Between Malware and Execution Trace of the Malware | |
CN112287199A (en) | Big data center processing system based on cloud server | |
CN115358397A (en) | Parallel graph rule mining method and device based on data sampling | |
CN111400713A (en) | Malicious software family classification method based on operation code adjacency graph characteristics | |
US20220277219A1 (en) | Systems and methods for machine learning data generation and visualization | |
Haile et al. | Identifying ubiquitious third-party libraries in compiled executables using annotated and translated disassembled code with supervised machine learning | |
Klassen et al. | Web document classification by keywords using random forests | |
Eken et al. | Predicting defects with latent and semantic features from commit logs in an industrial setting | |
CN110740111B (en) | Data leakage prevention method and device and computer readable storage medium | |
CN111026940A (en) | Network public opinion and risk information monitoring system and electronic equipment for power grid electromagnetic environment | |
Sharif et al. | Function identification in android binaries with deep learning | |
CN116361788A (en) | Binary software vulnerability prediction method based on machine learning | |
Tuhin et al. | Smart cybercrime classification for digital forensics with small datasets | |
CN113934813A (en) | Method, system and equipment for dividing sample data and readable storage medium | |
Yuan et al. | Research of intelligent reasoning system of Arabidopsis thaliana phenotype based on automated multi-task machine learning | |
Tenenboim et al. | Multi-label classification by analyzing labels dependencies | |
Rodriguez et al. | A multi-core computing approach for large-scale multi-label classification | |
Düzgün et al. | Benchmark Static API Call Datasets for Malware Family Classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |