CN113282759A - Network security knowledge graph generation method based on threat information - Google Patents

Network security knowledge graph generation method based on threat information Download PDF

Info

Publication number
CN113282759A
CN113282759A CN202110439459.1A CN202110439459A CN113282759A CN 113282759 A CN113282759 A CN 113282759A CN 202110439459 A CN202110439459 A CN 202110439459A CN 113282759 A CN113282759 A CN 113282759A
Authority
CN
China
Prior art keywords
data
network security
url
entity
crawler
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110439459.1A
Other languages
Chinese (zh)
Other versions
CN113282759B (en
Inventor
李桐
刘一涛
刘刚
王刚
赵桐
周小明
宋进良
姚羽
刘扬
王磊
李广翱
陈得丰
刘莹
杨智斌
耿洪碧
杨巍
任帅
陈剑
李欢
张彬
王琛
佟昊松
孙茜
孙赫阳
何立帅
赵玲玲
李菁菁
姜力行
杨滢璇
范维
杨璐羽
刘芮彤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Liaoning Electric Power Co Ltd
Electric Power Research Institute of State Grid Liaoning Electric Power Co Ltd
Original Assignee
State Grid Liaoning Electric Power Co Ltd
Electric Power Research Institute of State Grid Liaoning Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Liaoning Electric Power Co Ltd, Electric Power Research Institute of State Grid Liaoning Electric Power Co Ltd filed Critical State Grid Liaoning Electric Power Co Ltd
Priority to CN202110439459.1A priority Critical patent/CN113282759B/en
Publication of CN113282759A publication Critical patent/CN113282759A/en
Application granted granted Critical
Publication of CN113282759B publication Critical patent/CN113282759B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention belongs to the technical field of industrial control network security, and particularly relates to a network security knowledge graph generation method based on threat intelligence. The method comprises the following steps: efficient distributed threat intelligence data collection; making a network security threat information data set through a distributed threat information crawling system; the data quality of the network security threat information is improved; carrying out network security entity identification on the manufactured network security threat intelligence data set; extracting the network security entity relationship; and (4) organizing data. According to the method, a large number of experiments verify that the threat information data quality improvement algorithm, the network security threat information and the quality of the knowledge map generated by extracting the entity identification and entity relation in the information text are remarkably improved, and the method has good local network weakness visualization capability and attack prediction analysis capability.

Description

Network security knowledge graph generation method based on threat information
Technical Field
The invention belongs to the technical field of industrial control network security, and particularly relates to a network security knowledge graph generation method based on threat intelligence.
Background
With the rapid development of network technologies, a great number of network technologies are introduced into various industries to improve productivity, which is accompanied by a problem of network security. With the increasing complexity of network security situation, dynamic defense of network security driven by threat intelligence becomes the focus of attention in the industry. The threat intelligence has the characteristics of rich data content, high accuracy and strong real-time performance, and can reflect the attack chain of the whole attack event, so the threat intelligence has extremely high application and analysis values.
The knowledge graph is used as a comprehensive data integration and organization method, attack information can be effectively extracted from massive threat information, and complex behaviors such as reasoning analysis and attack semantic association on the attack information data can be achieved. With the continuous updating of threat information, the knowledge graph network security system based on the threat information can realize dynamic defense, and compared with traditional static defense means such as antivirus software and firewall, the knowledge graph can sense the network security situation more quickly and accurately, so that the overall security of the network is improved, and advanced functions such as attack path prediction, attack tracing, security threat evaluation and the like are realized.
In the process of generating the relevant network security knowledge graph by using the threat intelligence, the data quality after the threat intelligence is collected is improved, the false positive rate of the threat intelligence data is reduced, and the network security entity identification and the security entity relationship extraction in the threat intelligence are difficult research contents.
The main problems are as follows:
1. the open source threat intelligence on the network generally has the problems of low data quality, high data false positive rate, missing or error of corresponding attributes of data entities and the like. The low-quality threat information data inevitably causes the problem that the generated network security knowledge graph has low quality, the network security situation cannot be correctly sensed, and the current network attack behavior can be wrongly predicted. The existing data quality improving algorithm mainly depends on a truth value discovering algorithm, the algorithm is mostly applied to single truth value discovering problems and cannot adapt to the condition that an entity in network security threat information data has multiple truth values, and the network security threat information data has stronger time-varying characteristics.
2. The existing entity identification and entity relation extraction method is mainly based on the traditional rule identification, machine learning and the recently popular deep learning method, needs a large number of labeled text data samples, and has higher data quality requirement. Although the method is widely applied to other fields such as natural language processing, the application of the method to entity identification and entity relationship extraction in the network security field is difficult because of the problems that large-scale high-quality security entity labeling data is lacked, multiple entity types are mixed in the data, and entity type labels in the data whole text are different.
At present, no network security entity identification and entity relationship extraction method with good effect exists in the field of network security.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a network security knowledge map generation method based on threat intelligence, and aims to provide a basic model for utilizing and analyzing massive threat intelligence data and realize the purpose of predicting the attack means and the attack target of an attacker.
The technical scheme adopted by the invention for realizing the purpose is as follows:
a network security knowledge graph generation method based on threat intelligence comprises the following steps:
step 1, collecting high-efficiency distributed threat information data, wherein a distributed threat information data crawling system is built by a script framework, and a script-redis scheduling crawler program is used for extracting data to be structured and then storing the data into a redis and mongodb database;
step 2, a network security threat information data set is made through a distributed threat information crawling system;
step 3, improving the data quality of the network security threat information;
step 4, utilizing the threat intelligence data to manufacture a network security threat intelligence data set for network security entity identification;
step 5, extracting the network security entity relationship;
and 6, organizing data.
Further, the high efficiency distributed threat intelligence data collection comprises: distributed crawler system architecture, crawler strategy, crawler implementation and data storage.
Further, the distributed crawler system architecture comprises: the threat information collection system framework is formed by the deployment of a distributed crawler system and a bottom environment; the distributed crawler system is formed by reconstructing a traditional crawler frame by Scapy, a Redis database is added, a multi-node distributed system is adopted in a bottom environment, a Docker container cluster is adopted, and Kubernets are used as a cluster management tool; the distributed crawler system adopts a Master/Slave structure, a Master (Master) terminal and a plurality of Slave terminals are arranged, the Master terminal deploys a Redis database to store and be scheduled to request to be crawled, the Slave terminals deploy crawler main programs to crawl webpages and analyze extracted data, and each Slave terminal stores the analyzed webpage data in the same MongoDB database.
Further, the crawler policy includes: for a Master terminal, storing an initial link in Redis, wherein Key is a next crawled page in a scheduling queue, and URL is a link of a certain page generally; then, a crawler is started, a starting URL is obtained from the Redis, and data of a webpage corresponding to the URL are downloaded; analyzing according to a defined relevant rule from response to obtain page data or a detail page link, analyzing according to a page format for the condition of directly being the page data, restarting a crawler for the condition of the detail page link, modifying the link into the detail page link, and acquiring final detail data; the crawler program continuously acquires the URL from the scheduling queue and crawls the next URL; if no URL exists, entering a waiting state; for the Slave end, a downloader executes a downloading task and analyzes an extracted field; the crawler program acquires URL from a scheduling queue of Key of Redis, and then downloads a corresponding webpage; and resolving the response according to the well-defined field rule, and storing the corresponding field into the MongoDB database after the corresponding field is processed by the text duplication removal module until the Key value is null.
Further, the crawler implementation includes: for the scheduler module, the scheduler module is responsible for scheduling tasks of the whole system, and mainly has the following functions: receiving a request sent by an engine; returning the URL to the downloading module; the URL is stored in a Redis database after being deduplicated; each crawler subtask transmits the obtained URL to a scheduler through an engine, and the URL is stored in a Redis queue after being subjected to deduplication processing by the scheduler; receiving the request of the engine, and returning the URL to the downloader; for the crawling downloader module, the crawling module integrates the functions of a spider and a downloader, the spider processes and extracts data of webpage information returned by the downloader, and extracts a directory URL and a detail page URL in the webpage information; key fields in the webpage information are extracted and then stored in a MongoDB database; the downloader downloads the URL returned by the scheduler and transmits the downloaded webpage information to the spider; the system is responsible for crawling corresponding websites, and comprises a starting URL (uniform resource locator), a duplicate removal module, a URL extraction module and a duplicate removal module, wherein the starting URL is taken at first, and the URL is extracted after crawling; then the scheduling module distributes URL to the Slave node from Redis;
the data storage includes: the storage module realizes two functions, wherein the URL is stored in Redis, and the Redis is deployed in a Master node; the webpage content obtained by analysis is stored in a MongoDB database and is deployed at a Master node; the stored webpage content information is extracted to be the final target of the system, and the distributed crawler crawls the webpage content for a data processing program to extract the required information.
Further, the network security threat information data set is produced through a distributed threat information crawling system; the method comprises the following steps:
(1) vulnerability data: the vulnerability data is collected from a main vulnerability publishing platform, and the data types comprise vulnerability occurrence system types, system versions and utilization methods;
(2) APT attack chain data: APT attack chain data is collected from an APTnodes platform; a total of 528 APT reports over the last 10 years;
(3) malware text data: the method comprises the name, the category, the common functions, the Hash and the utilization system platform of malicious software in threat intelligence; the partial data is collected in a threat intelligence source AlienVault;
(4) data discussion in the secure community: the part of data is collected in a StackExchange website and is the text of a recent security event;
(5) secure RSS subscription data: the part of data is collected in each large network security RSS and is recent network security news.
Further, the method for improving the quality of the network security threat information data comprises the following steps:
step (1), FPR false positive rate: for each source k ∈ S, a corresponding false positive rate is generated
Figure BDA0003034480930000031
The value is (1-specificity), and the compliance hyper-parameter is alpha0=(α0,10,0) Beta distribution of (a), wherein0,1Is the count of false positive samples a priori per source, alpha0,0Is the count of true negative samples per source prior:
Figure BDA0003034480930000032
in the following, from the second time node
Figure BDA0003034480930000033
Using a previous time node
Figure BDA0003034480930000034
Instead, the time-varying characteristic of the truth finding model is calibrated by using second-order markov;
step (2) Sensitivity: for each source k ∈ S, a corresponding sensitivity rate is generated
Figure BDA0003034480930000035
Obeying a hyper-parameter of alpha1=(α1,11,0) Beta distribution of (a), wherein1,1Is the true positive sample count per source prior, α1,0Is the per source a priori false negative sample count:
Figure BDA0003034480930000036
from a second time node
Figure BDA0003034480930000037
Using a previous time node
Figure BDA0003034480930000038
Instead, the time-varying characteristic of the truth finding model is calibrated by using second-order markov;
step (3), attack tag of Att fact: for the attribute of each entity, F belongs to F, and F is a set of observed values of all attributes under the entity; generating a priori true probability θfObeying a hyper-parameter of β ═ β (β)10) Beta distribution of (a), wherein1Is a prior entity attribute correct sample count, β0Is a prior entity attribute error sample count:
θf~Beta(β10)
will be from the second time nodefUsing theta of the previous time nodefInstead, the time-varying characteristic of the truth finding model is calibrated by using second-order markov;
step (4), Truth label: the attribute true value label is used for generating a true value label of each entity attribute, namely whether the observed value is correct or not; t is tfIs an attribute truth label with a compliance parameter thetafIn which t isfIs a binary Boolean variable, the prior probability θfIs to represent an attribute tag tfProbability of being correct:
tf~Bernoulli(θf)
step (5) observer: an entity attribute observation value label, wherein the observation value C of each entity attribute belongs to CfIts source uses scRepresents; generating a distribution of observation labels c is a compliance parameter
Figure BDA0003034480930000039
Bernoulli distribution of (a):
Figure BDA0003034480930000041
wherein if tf=0,ocCompliance parameter of
Figure BDA0003034480930000042
The false positive rate of the distribution of bernoulli is sc
Figure BDA0003034480930000043
If t isf=1,ocCompliance parameter of
Figure BDA0003034480930000044
The false positive rate of the distribution of bernoulli is also sc
The model solution is as follows:
the conditional probability of the model given the observed value c of each entity attribute is as follows:
Figure BDA0003034480930000045
in the above formula: p represents the prior probability theta when the parameter truth value is givenfSource sensitivity rate
Figure BDA0003034480930000046
And
Figure BDA0003034480930000047
when the observed value of the entity o is c, the conditional probability; where c is the observation, f is the attack tag, scA source representing the occurrence of an observation c;
the full likelihood function containing all variables and hyperparameters is written as:
Figure BDA0003034480930000048
in the above formula: p represents the hyperparameter alpha when given the parameter false positive rate0,α1And the prior true probability over-parameter beta, the entity o, the source s, the true label t, the prior probability parameter set theta and the sensitivity parameter set phi0,φ1The conditional probability of (a); where S represents a set of all sources, F represents an attack tag set, F represents each attack tag element belonging to F, θfDenotes the f prior probability, tfDenotes the true value of f, CfA set of observations representing f, c representing each observation element in the set of observations;
and (3) given the observation value data of the attribute, solving the likelihood function by using a Gibbs Sampling algorithm in the MCMC algorithm:
Figure BDA0003034480930000049
tmapthe result of maximum posterior estimation is obtained by the formula, and the rest parameters have the same meanings as the parameters with the same names in the formula;
the following formula is solved:
Figure BDA00030344809300000410
wherein: p represents when given a parameter t-fTrue value t of f for entity o and source sfConditional probability with value i, i representing an attack of fThe value of the label is {0,1}, and t is the value range-fIs the set of all values in F except F,
Figure BDA00030344809300000411
source s representing observation j, attack tag not f, and truth tag icThe number of (2); c-fRepresenting a set of attack tags without attack tag f, C' being C-fEach of the elements in the set is,
Figure BDA0003034480930000057
the truth value when f is c' is shown, and the rest parameters have the same meanings as the parameters with the same names in the above;
to obtain p (t)f=i|t-fO, s), estimating to obtain the FPR false positive rate and the Sensitivity rate of the next moment, and solving the following steps:
Figure BDA0003034480930000051
Figure BDA0003034480930000052
wherein
Figure BDA0003034480930000053
Observation set C representing all attack labels as ffElement c of (a), source s making observations of the observation value ccAnd attack tag o of entity ofThe truth label for j takes the value of the probability sum of i, wherein i belongs to {0,1}, j belongs to {0, 1., | F | }, | F | represents the number of elements of the attack set F, the rest parameters have the same meanings as the parameters with the same names in the above, and finally the accuracy of each source can be estimated as well:
Figure BDA0003034480930000054
where precision represents the accuracy of each source.
Further, the network security entity identification is carried out on the manufactured network security threat intelligence data set, and a BIO marking method is adopted for an APT report to identify a sentence X in an APT report document as [ X ]]N=[x1,...,xi,...xN]Wherein x isiIs the ith character in sentence X; in the BIO labeling method, identifying the network security entity in the sentence X is equivalent to giving a standard sequence LX=[l]N
Performing model training on the labeled APT report document by using a BilSTM-CRF model, and extracting word characteristics before the ith character and word characteristics after the word by a forward process; the CRF model is used for acquiring the conditional probability distribution of another group of output random variables under the condition of giving a group of input random variables;
the CRF model is: given an input sentence, X ═ X]N=[x1,...,xi,...xN]Assuming S is the output score matrix of the BiLSTM network of dimension NxK, K is the number of label categories, Si,jIs the jth label score of the ith word, the predicted label y ═ y1,...,yi,...,yN]The judgment score Z of (a) defines:
Figure BDA0003034480930000055
where T is a K +2 dimensional probability transition matrix, the probability of the generated tag sequence y:
Figure BDA0003034480930000056
then, solving the correctly labeled log-likelihood probability by utilizing maximum likelihood estimation:
Figure BDA0003034480930000061
further, the extracting the network security entity relationship comprises:
extracting a network security entity relationship by adopting an attention-based BilSTM (Att-BilSTM) model; the system comprises an input layer, a word embedding layer, a BilSTM layer, an Attention layer and an output layer;
wherein the word embedding layer is used to characterize a sentence in the APT report, X ═ X]N=[x1,...,xi,...xN]Sentences are expressed into a matrix, words with similar meanings are adjacent in the matrix space, and the expressions can have relations;
wherein the significance of the output result of the salient part of the Attention layer introduces a weighting thought;
wherein the output of the BilSTM layer is B ═ B]T=[b1,...,bj,...,bT]Then the parameter matrix W satisfies the following formula:
S=tanh(B)
α=softmax(WTS)
r=BαT
alpha is an attention weight coefficient, r is a result of weighted summation of the BilSTM output B, finally a characterization vector B ═ tanh (r) is generated through a nonlinear function, and then B is used*And inputting a full-connection neural network, mapping the full-connection neural network to a label vector, and obtaining a prediction label through a softmax function.
Furthermore, the data organization adopts a non-relational database, namely a Mongodb database to store, and stores all data in a key-value pair mode.
The invention has the following beneficial effects and advantages:
the invention provides a basic model for utilizing and analyzing massive threat information data, and the key point of the invention is to improve the existing data quality improvement algorithm aiming at the network security threat information data, so that the data is adaptive to the network security threat information data, the data quality of the collected network security threat information data is improved, and the false positive rate of the collected network security threat information data is reduced. The invention improves the existing entity identification and entity relationship extraction method aiming at the characteristics of threat information data, improves the accuracy and efficiency of network security entity identification and security entity relationship extraction, and generates the threat information network security knowledge map with higher data quality. The invention also combines the data reasoning ability of the network security knowledge graph to research an attack graph visualization method combining the network security knowledge graph and the local network topology structure.
The method firstly improves the quality of threat information data aiming at the characteristics of network security threat information data, reduces the false positive rate of the threat information data and improves the overall quality of the data; then, the existing entity identification and entity relation extraction method is improved aiming at the characteristics of threat intelligence so as to generate a high-quality threat intelligence knowledge graph; then, the local network vulnerability is subjected to correlation analysis by using recent threat information and combining local network topological structure data, and the visual display of the security vulnerability nodes in the local network topology is realized; and finally, an attack prediction method based on the combination of the network security knowledge graph and the flow analysis of the observation building is provided, and the attack means and the attack target of the attacker are predicted. Through a large number of experiments, the invention verifies that the quality of threat information data quality improvement algorithm and network security threat information provided by the method, and the quality of the knowledge map extracted and generated by entity identification and entity relation in the information text are higher than that of the existing method, and the method has good local network weakness visualization capability and attack prejudgment analysis capability.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a process diagram of a threat intelligence-based network security knowledge-graph generation method of the present invention;
FIG. 2 is a diagram of a distributed crawler architecture for threat intelligence data collection in the present invention;
FIG. 3 is a probability map model diagram of threat intelligence data quality improvement algorithm in the present invention;
FIG. 4 is a schematic diagram of atomic attack entities and their relationships defined in the present invention;
FIG. 5 is a schematic structural diagram of a BilSTM-CRF model for network security entity identification in the present invention;
FIG. 6 is a schematic structural diagram of an Att-BilSTM model for network security entity relationship extraction in the present invention;
FIG. 7 is a data collection time chart for a distributed crawler system for threat intelligence data collection as developed in the present invention;
FIG. 8 is a graph comparing the effectiveness of a distributed crawler system and a stand-alone crawler system for threat intelligence data collection as developed in the present invention;
FIG. 9 is a diagram of an example of the organization of threat intelligence data related to the Windows system in embodiment 5 of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.
The solution of some embodiments of the invention is described below with reference to fig. 1-9.
Example 1
The invention relates to a network security knowledge graph generation method based on threat intelligence, which is shown in figure 1. figure 1 is a process diagram of the network security knowledge graph generation method based on the threat intelligence. The specific generation process of the network security knowledge graph comprises the following steps: the method comprises the steps of efficient distributed threat intelligence data collection, network security data set production, network security threat intelligence data quality improvement, network security entity identification, network security entity relation extraction and data organization. The following steps are described in detail:
step 1, efficient distributed threat intelligence data collection.
The generation of the network security knowledge graph requires a large amount of network security threat information data, and in order to efficiently collect the open source threat information data on the network in real time, the following distributed crawler system is realized for collecting the open source threat information data on the network. The distributed threat intelligence data crawling system is built by a script framework, and a script-redis scheduling crawler program is used for extracting data structuralization and storing the data structuralization into a redis and mongodb database.
(1) Distributed crawler system architecture: the threat intelligence collection system architecture is composed of a distributed crawler system and deployment of an underlying environment. The distributed crawler system is formed by reforming a traditional crawler frame by Scapy, a Redis database is newly added, and the problem that the distributed crawler system is not supported originally is solved. The underlying environment adopts a multi-node distributed system, a Docker container cluster, and uses mature Kubernetes as a cluster management tool. The distributed crawler system adopts a Master/Slave structure, a Master (Master) terminal and a plurality of Slave terminals are arranged, the Master terminal deploys a Redis database to store and be scheduled to request to be crawled, the Slave terminals deploy crawler main programs to crawl webpages and analyze extracted data, and then each Slave terminal stores the analyzed webpage data in the same MongoDB database. FIG. 2 is a diagram of a distributed crawler architecture for threat intelligence data collection in the present invention, as shown in FIG. 2. For each threat intelligence data item needing to be crawled, the threat intelligence data item is stored into a redis database, a script engine uses a scheduler to schedule the threat intelligence data item, and when a certain item is scheduled, a corresponding crawler program (spider) and middleware thereof are started to download the threat intelligence data.
(2) And (3) crawler strategies: for the Master terminal, an initial link is stored in Redis, a Key is a next crawled page in a scheduling queue, and a URL is a link of a certain page generally. And then, starting the crawler, acquiring the initial URL from the Redis, and downloading the data of the webpage corresponding to the URL. And analyzing the page data or the detail page link according to the defined related rule from the response, analyzing the page data directly according to the page format, restarting the crawler according to the condition of the detail page link, modifying the link into the detail page link, and acquiring the final detail data. The crawler continues to fetch URLs from the dispatch queue and crawl the next URL. If no URL exists, entering a waiting state. And for the Slave side, the downloader executes the downloading task and analyzes the extracted field. And the crawler program acquires the URL from the scheduling queue of the Key of the Redis and then downloads the corresponding webpage. And resolving the response according to the well-defined field rule, and storing the corresponding field into a MongoDB database after the corresponding field is processed by a text duplication removal module. Until the Key value is null.
(3) The crawler is realized: for the scheduler module, the scheduler module is responsible for scheduling tasks of the whole system, and mainly has the following functions: receiving a request sent by an engine; returning the URL to the downloading module; and (4) removing the weight of the URL and storing the URL into a Redis database. And each crawler subtask transmits the obtained URL to a scheduler through an engine, and the URL is stored in a Redis queue after being subjected to deduplication processing by the scheduler. And receiving the request of the engine, and returning the URL to the downloader. For the crawling downloader module, the crawling module integrates functions of a spider and a downloader, the spider processes and extracts data of webpage information returned by the downloader, and a directory URL and a detail page URL in the webpage information are extracted. And key fields in the webpage information are extracted and then stored in the MongoDB database. The downloader downloads the URL returned by the scheduler and transmits the downloaded webpage information to the spider. The system is responsible for crawling the corresponding website, firstly takes the initial URL, extracts the URL after crawling, returns the URL to the duplication removal module, and then distributes the URL to the Slave node from Redis by the scheduling module.
(4) Data storage: the storage module only needs to realize two functions, one is URL and stored in Redis, and the Redis is deployed in a Master node. And the webpage content obtained by analysis is stored in a MongoDB database and is deployed at a Master node. The stored webpage content information is extracted to be the final target of the system, and the distributed crawler crawls the webpage content and then provides a data processing program for extracting the required information.
And 2, making a network security threat intelligence data set.
The network security data is obtained by collecting the following 5 kinds of threat intelligence data by using the distributed threat intelligence crawling system in step 1. The method comprises the following steps:
(1) vulnerability data: the vulnerability data is collected from main vulnerability publishing platforms, such as CVE, NVD and the like. The data type comprises data such as a vulnerability occurrence system type, a system version, a utilization method and the like.
(2) APT (advanced persistent threat attack) attack chain data: APT attack chain data are collected from APTnodes platforms, 528 APT reports in the last 10 years are included, 50 reports are labeled manually, a BIO labeling method is adopted, 40 deep learning models for training entity recognition and entity relation extraction are used, and the remaining 10 reports are used for testing model effects.
(3) Malware text data: the data comprises the name, the category, the common functions, the Hash, the platform of the utilization system and the like of the malicious software in the threat intelligence. This portion of the data is collected in the threat intelligence source, AlienVault.
(4) Data discussion in the secure community: this portion of the data is collected at the StackExchange website, where the data is primarily text for security researchers to discuss recent security events.
(5) Secure RSS subscription data: the data is collected in large network security RSS, and the data is mainly recent network security news.
And 3, improving the data quality of the network security threat intelligence.
After the network security threat information data set is generated, the quality of the threat information data needs to be improved so as to improve the quality of the threat information data and reduce the false positive rate of the threat information data, so that a high-quality network security knowledge graph can be generated in the subsequent process.
The invention improves the time-varying characteristic of threat intelligence by the existing truth finding algorithm and introduces Markov property to improve the time-varying characteristic, so that the time-varying characteristic is suitable for the time-varying characteristic of the threat intelligence, as shown in figure 3, and figure 3 is a probability graph model diagram of the threat intelligence data quality improvement algorithm in the invention. In the figure, Mi: representing the set of model parameters at the ith time instant; ci: represents the model M at the ith timeiA priori parameters of (a); wherein i is 1, 2.., N; the remaining parameters have the same meanings as indicated herein.
The invention provides a threat intelligence data quality improvement algorithm model, which comprises the following steps:
step (1) FPR (false positive)Sex ratio): for each source k ∈ S, a corresponding false positive rate is generated
Figure BDA0003034480930000091
The value is (1-specificity), and the compliance hyper-parameter is alpha0=(α0100) Beta distribution of (a), wherein01Is the count of false positive samples a priori per source, alpha0,0Is the count of true negative samples per source prior:
Figure BDA0003034480930000092
in the following, from the second time node
Figure BDA0003034480930000093
Using a previous time node
Figure BDA0003034480930000094
Instead, the time-varying characteristics of the truth finding model are calibrated using second-order markov.
Step (2) Sensitivity: for each source k ∈ S, a corresponding sensitivity rate is generated
Figure BDA0003034480930000095
Obeying a hyper-parameter of alpha1=(α1110) Beta distribution of (a), wherein11Is the true positive sample count per source prior, α1,0Is the per source a priori false negative sample count:
Figure BDA0003034480930000096
similar to FPR, will be from the second time node
Figure BDA0003034480930000097
Using a previous time node
Figure BDA0003034480930000098
Instead, the time-varying characteristics of the truth finding model are calibrated using second-order markov.
Step (3) Att fact (attack tag): for each attribute to which an entity belongs, F ∈ F, which is the set of observations (i.e., the set of collected values) for all attributes under that entity. Generating a priori true probability θfObeying a hyper-parameter of β ═ β (β)10) Beta distribution of (a), wherein1Is a prior entity attribute correct sample count, β0Is a prior entity attribute error sample count:
θf~Beta(β10)
similar to FPR and Sensitivity above, θ from the second time nodefUsing theta of the previous time nodefInstead, the time-varying characteristics of the truth finding model are calibrated using second-order markov.
Step (4), Truth label: and the attribute true value label is used for generating a true value label of each entity attribute, namely whether the observed value is correct or not. t is tfIs an attribute truth label with a compliance parameter thetafIn which t isfIs a binary Boolean variable, the prior probability θfIs to represent an attribute tag tfProbability of being correct:
tf~Bernoulli(θf)
step (5) observer: an entity attribute observation value label, wherein the observation value C of each entity attribute belongs to CfIts source uses scRepresents; generating a distribution of observation labels c is a compliance parameter
Figure BDA0003034480930000099
Bernoulli distribution.
Figure BDA00030344809300000910
Wherein if tf=0,ocCompliance parameter of
Figure BDA00030344809300000911
The false positive rate of the distribution of bernoulli is sc
Figure BDA00030344809300000912
If t isf=1,ocCompliance parameter of
Figure BDA00030344809300000913
The false positive rate of the distribution of bernoulli is also sc
The model solution is as follows: from the above description, the conditional probability of the model given the observed value c of each entity attribute is as follows:
Figure BDA0003034480930000101
in the above formula: p represents the prior probability theta when the parameter truth value is givenfSource sensitivity rate
Figure BDA0003034480930000102
And
Figure BDA0003034480930000103
when the observed value of the entity o is c, the conditional probability is obtained. Where c is the observation, f is the attack tag, scA source representing the occurrence of an observation c;
the full likelihood function containing all variables and hyperparameters can be written as:
Figure BDA0003034480930000104
in the above formula: p represents the hyperparameter alpha when given the parameter false positive rate0,α1And the prior true probability over-parameter beta, the entity o, the source s, the true label t, the prior probability parameter set theta and the sensitivity parameter set phi0,φ1The conditional probability of (2).Where S represents a set of all sources, F represents an attack tag set, F represents each attack tag element belonging to F, θfDenotes the f prior probability, tfDenotes the true value of f, CfA set of observations representing f, c representing each observation element in the set of observations.
Given the observed value data of the attribute, the likelihood function can be solved using the Gibbs Sampling algorithm in the MCMC algorithm:
Figure BDA0003034480930000105
tmapthe result of maximum posterior estimation of the above formula is shown, and the rest parameters have the same meanings as the parameters with the same names in the above.
The following formula can be solved:
Figure BDA0003034480930000106
wherein: p represents when given a parameter t-fTrue value t of f for entity o and source sfThe value is the conditional probability of i, i represents the attack label value of f, and the value range is {0,1}, t-fIs the set of all values in F except F,
Figure BDA0003034480930000107
source s representing observation j, attack tag not f, and truth tag icThe number of the cells. C-fRepresenting a set of attack tags without attack tag f, C' being C-fEach of the elements in the set is,
Figure BDA0003034480930000108
the value of f is the true value when c', and the rest parameters have the same meanings as the parameters with the same names.
To obtain p (t)f=i|t-fO, s), the FPR (false positive rate) and Sensitivit (sensitivity) at the next time can be estimatedRate) y, which solves for:
Figure BDA0003034480930000109
Figure BDA0003034480930000111
wherein
Figure BDA0003034480930000112
Observation set C representing all attack labels as ffElement c of (a), source s making observations of the observation value ccAnd attack tag o of entity ofThe truth label for j takes the value of the probability sum of i, wherein i belongs to {0,1}, j belongs to {0, 1., | F | }, | F | represents the number of elements of the attack set F, the rest parameters have the same meanings as the parameters with the same names in the above, and finally the accuracy of each source can be estimated as well:
Figure BDA0003034480930000113
where precision represents the accuracy of each source, and the remaining parameters are synonymous with the above-mentioned parameters.
Entities and relationships are defined as follows:
first, defining the concept of relationship between network security entities and entities. The knowledge graph reflects the specific information and the associated relation between the information, and the entity is an abstract expression of the concept and the relation between the concepts, so that good entity definition can be helpful for clearly expressing the information and the relation contained in the knowledge graph. Here, an atomic attack is used to describe a network security entity, and the atomic attack represents the smallest attack unit in a single attack and can be understood as the smallest step in the attack.
As shown in fig. 4, fig. 4 is a schematic diagram of the atomic attack entities and their relationships defined in the present invention. In the atomic attack graph, an atomic attack is represented by a vertex, and the actual meaning represents a once-exploit attack. Exploits are tied to software and hardware. The implementation of the attack depends on the attack condition, the attack mode, the attack effect and the like. The invention designs 4 entities of software, hardware, bugs and attacks for atomic attack, wherein the attacks have 3 attributes of attack conditions, attack modes and attack effects. Wherein the relationship between entities is defined as "existence" and "utilization" 2 relationships.
And 4, carrying out network security entity identification on the manufactured network security threat intelligence data set.
As described above, for the APT report in step 2, the sentence X ═ X in the APT report document is annotated by the BIO notation method]N=[x1,...,xi,...xN]Wherein x isiIs the ith character in sentence X. In the BIO labeling method, identifying the network security entity in the sentence X is equivalent to giving a standard sequence LX=[l]N
The invention uses a BilSTM-CRF (bidirectional long and short term memory artificial neural network-conditional random field algorithm) model to carry out model training on the labeled APT report document, as shown in figure 5, and figure 5 is a structural schematic diagram of the BilSTM-CRF model used for network security entity identification in the invention. In the figure, CRF represents a conditional random field; bi represents the output of the ith backward network; fi denotes an output of the ith forward network; ci represents the ith text vector; B-LOC, E-LOC, O in the CRF layer represents: start, end, outside. The model can simultaneously extract the word characteristics before the ith character and the word characteristics after the word through a forward process, thereby improving the learning ability of the word. A CRF (conditional random field) model is used to obtain the conditional probability distribution of one set of output random variables given a set of input random variables.
Wherein the CRF model is: given an input sentence, X ═ X]N=[x1,...,xi,...xN]Let S be the output score matrix of the BiLSTM (bidirectional Long-short term memory artificial neural network) network with dimension NxK, K being the number of labeled species, Si,jIs the jth label score of the ith word, the predicted label y ═ y1,...,yi,...,yN]The judgment score Z of (a) defines:
Figure BDA0003034480930000114
where T is a K +2 dimensional probability transition matrix, the probability of the generated tag sequence y:
Figure BDA0003034480930000121
then, solving the correctly labeled log-likelihood probability by utilizing maximum likelihood estimation:
Figure BDA0003034480930000122
and 5, extracting the network security entity relationship.
Network security entity relationship extraction adopts an attention mechanism-bidirectional long-short term memory artificial neural network (BILSTM) model. The model is mainly divided into 5 layers: an input layer, a word embedding layer, a BilSTM layer, an Attention layer, and an output layer (the CRF layer in the BilSTM-CRF model is replaced by the Attention layer, and the output layer becomes a softmax layer). As shown in FIG. 6, FIG. 6 is a schematic structural diagram of an Att-BilSTM model for network security entity relationship extraction in the present invention. Wherein Si represents the ith text vector; o, B-A and I-A in the output layer represent: exterior, beginning of a, interior of a.
Wherein the word embedding layer is used to characterize a sentence in the APT report, X ═ X]N=[x1,...,xi,...xN]The sentences are expressed into a matrix, and words with similar meanings are adjacent in the space of the matrix to indicate that the sentences possibly have relations.
The significance of the output result of the Attention layer salient part introduces a weighting idea. Wherein the output of the BilSTM layer is B ═ B]T=[b1,...,bj,...,bT]Then the parameter matrix W satisfies the following formula:
S=tanh(B)
α=softmax(WTS)
r=BαT
α is an attention weight coefficient, r is a result of weighted summation of the BiLSTM output B, and finally a characterization vector B ═ tanh (r) is generated by a nonlinear function. Then B is put*And inputting a full-connection neural network, mapping the full-connection neural network to a label vector, and obtaining a prediction label through a softmax function.
And 6, organizing data.
Because threat intelligence data presents the characteristic of multi-source isomerism, the method adopts a non-relational database, namely a Mongobb database, to store data organization, and stores all data in a key value pair mode. The Mongodb database has extremely high performance and flexible data storage characteristics, and is suitable for storing threat intelligence and a generated network security knowledge graph model.
In the implementation steps of the invention, the software environment is a Windows10 system, the implementation language is Python3, the deep learning framework is Pythrch, and the database is a non-relational database Mongodb.
Example 2
The embodiment provides a network security knowledge graph generating method based on threat intelligence, and the method is used for testing a distributed threat intelligence crawling system.
The invention verifies that the developed distributed threat information crawling system has higher superiority in efficiency compared with the single-machine threat information collecting system by comparing the developed distributed threat information crawling system with the single-machine threat information collecting system. Taking a common open source threat information source as an example, the distributed crawler system is provided with 1 main node and 2 slave nodes, and after the continuous operation is carried out for 5 days, 11 thousands of pieces of webpage data are stored in the database in a coexisting manner. The number of pages crawled at various points in time is shown in fig. 7, where fig. 7 is a time chart of data collection by the distributed crawler system for threat intelligence data collection developed in the present invention. In the figure, the position of the upper end of the main shaft,
in the experiment, the total number of pages crawled by the 2 Slave nodes in a certain time is far higher than that of pages crawled by the single-machine operation, and the distributed system is fully demonstrated to improve the operation efficiency indeed. And the distributed crawler system and the crawler running in the single-machine environment perform comparison test, and record the number of pages crawled by the distributed crawler system and the single-machine environment. Respectively deploying distributed crawler projects in a Docker container cluster and a virtual machine cluster, wherein the hardware configuration is as follows: master1 and Slave2 are Ubuntu 16.04 and Python2.7, and the memories are 8G. Operational efficiency vs. time as shown in fig. 8, fig. 8 is a graph comparing the effectiveness of a distributed crawler system for threat intelligence data collection developed in the present invention with a stand-alone crawler system. The number of pages grabbed by the crawler at each time point is known, and the distributed crawler system is obviously superior to a single-machine crawler system.
Example 3
The embodiment provides a network security knowledge graph generating method based on threat intelligence, and comparison is carried out aiming at the effect of a threat intelligence data quality improvement algorithm.
The method and the device perform comparison of the entity attribute quality improvement effect of the threat intelligence data by using the algorithm provided by the invention for the threat intelligence data and other truth value discovery algorithms. The test criteria used were the accuracy, recall and F1 values commonly used in the true discovery model. The true value of the comparison finds that the algorithm is 3-Estimates, Voting, LTM. The comparative effects are shown in table 1. It can be seen that the quality improvement algorithm of the invention has better effect on the quality improvement of threat intelligence data than the existing algorithm.
Table 1 is a table of comparison results of different data quality improvement algorithm effects in the embodiment of the present invention.
Algorithm Rate of accuracy Recall rate F1 value
proposal 0.935 0.960 0.987
3-Estimates 0.874 0.903 0.927
Voting 0.840 0.867 0.913
LTM 0.924 0.865 0.966
In the table: the Proposal represents the algorithm provided by the invention, the 3-Estimates represents the 3 sequence parameter estimation algorithm, the Voting represents the Voting algorithm, and the LTM represents the hidden truth model algorithm.
Example 4
The embodiment provides a network security knowledge graph generating method based on threat intelligence, and the network security knowledge graph generating method is used for comparing the network security entity identification effects in the threat intelligence.
The invention tests the effect of the network security entity identification model and the existing entity identification model through the marked remaining 10 APT report documents. The test criteria used were the accuracy, precision, recall and F1 values commonly used in entity identification. The compared entity recognition models are CRF, LSTM and LSTM-CRF. The comparative effect is shown in table 2. It can be seen that the network security entity identification model provided by the invention has better network security entity identification effect than the existing model in threat intelligence.
Table 2 is a table comparing the effects of different network security entity identification models in the embodiment of the present invention.
Figure BDA0003034480930000131
Figure BDA0003034480930000141
In the table: CRF represents a conditional random field algorithm, LSTM represents a long-short term memory artificial neural network algorithm, BilSTM represents a bidirectional long-short term memory artificial neural network algorithm, and BilSTM-CRF represents a bidirectional long-short term memory artificial neural network-conditional random field algorithm.
Example 5
The embodiment provides a network security knowledge graph generating method based on threat intelligence, and the network security knowledge graph generating method is used for comparing network security entity relation extraction effects in the threat intelligence.
The invention tests the effect of the network security entity relationship extraction model and the existing entity relationship extraction model through the remaining 10 APT report documents. The test criteria select entity relationships to extract commonly used precision, recall, and F1 values. The entity relationship extraction models of comparison are CRF, LSTM, BilSTM and BilSTM-CRF. The comparative effect is shown in table 3. It can be seen that the network security entity relationship extraction model provided by the invention has better network security entity relationship extraction effect than the existing model in threat intelligence.
Table 3 is a table comparing the effects of the different network security entity relationship extraction models in the embodiment of the present invention.
Model (model) Rate of accuracy Rate of accuracy Recall rate F1 value
CRF 0.9041 0.8084 0.7963 0.7892
LSTM 0.9163 0.8162 0.8046 0.8018
BiLSTM 0.9265 0.8339 0.8262 0.8491
BiLSTM-CRF 0.9374 0.8674 0.8344 0.8411
BiLSTM-CRF-Attention 0.9405 0.8652 0.8748 0.8751
In the table: BilSTM-CRF-Attention represents a bidirectional long-short term memory artificial neural network-conditional random field-Attention mechanism algorithm.
Example 6
The embodiment provides a network security knowledge graph generating method based on threat intelligence, and relates to a network security knowledge graph example based on the threat intelligence.
As shown in fig. 9, fig. 9 is a diagram of an example of the organization of threat intelligence data related to the Windows system in embodiment 5 of the present invention.
After network security entity identification and relationship extraction are carried out on various threat information data, the network security knowledge graph based on threat information can effectively organize entity data and relationship in various threat information and carry out association analysis on the data. The data associated with the storage and Mongodb is visually displayed in FIG. 9 using the grapeviz module in Python 3. The system shows that the Win10 system in the Windows system has remote desktop service remote code execution bugs, and can utilize four bugs, namely CVE-2019-. CVE represents a generic vulnerability disclosure number.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims (10)

1. A network security knowledge map generation method based on threat intelligence is characterized by comprising the following steps: the method comprises the following steps:
step 1, collecting high-efficiency distributed threat information data, wherein a distributed threat information data crawling system is built by a script framework, and a script-redis scheduling crawler program is used for extracting data to be structured and then storing the data into a redis and mongodb database;
step 2, a network security threat information data set is made through a distributed threat information crawling system;
step 3, improving the data quality of the network security threat information;
step 4, utilizing the threat intelligence data to manufacture a network security threat intelligence data set for network security entity identification;
step 5, extracting the network security entity relationship;
and 6, organizing data.
2. The method for generating network security knowledge graph based on threat intelligence as claimed in claim 1, wherein: the high efficiency distributed threat intelligence data collection comprises: distributed crawler system architecture, crawler strategy, crawler implementation and data storage.
3. The method of claim 2, wherein the method comprises: the distributed crawler system architecture comprises: the threat information collection system framework is formed by the deployment of a distributed crawler system and a bottom environment; the distributed crawler system is formed by reconstructing a traditional crawler frame by Scapy, a Redis database is added, a multi-node distributed system is adopted in a bottom environment, a Docker container cluster is adopted, and Kubernets are used as a cluster management tool; the distributed crawler system adopts a Master/Slave structure, a Master (Master) terminal and a plurality of Slave terminals are arranged, the Master terminal deploys a Redis database to store and be scheduled to request to be crawled, the Slave terminals deploy crawler main programs to crawl webpages and analyze extracted data, and each Slave terminal stores the analyzed webpage data in the same MongoDB database.
4. The method of claim 2, wherein the method comprises: the crawler strategy comprises: for a Master terminal, storing an initial link in Redis, wherein Key is a next crawled page in a scheduling queue, and URL is a link of a certain page generally; then, a crawler is started, a starting URL is obtained from the Redis, and data of a webpage corresponding to the URL are downloaded; analyzing according to a defined relevant rule from response to obtain page data or a detail page link, analyzing according to a page format for the condition of directly being the page data, restarting a crawler for the condition of the detail page link, modifying the link into the detail page link, and acquiring final detail data; the crawler program continuously acquires the URL from the scheduling queue and crawls the next URL; if no URL exists, entering a waiting state; for the Slave end, a downloader executes a downloading task and analyzes an extracted field; the crawler program acquires URL from a scheduling queue of Key of Redis, and then downloads a corresponding webpage; and resolving the response according to the well-defined field rule, and storing the corresponding field into the MongoDB database after the corresponding field is processed by the text duplication removal module until the Key value is null.
5. The method of claim 2, wherein the method comprises: the crawler implementation includes: for the scheduler module, the scheduler module is responsible for scheduling tasks of the whole system, and mainly has the following functions: receiving a request sent by an engine; returning the URL to the downloading module; the URL is stored in a Redis database after being deduplicated; each crawler subtask transmits the obtained URL to a scheduler through an engine, and the URL is stored in a Redis queue after being subjected to deduplication processing by the scheduler; receiving the request of the engine, and returning the URL to the downloader; for the crawling downloader module, the crawling module integrates the functions of a spider and a downloader, the spider processes and extracts data of webpage information returned by the downloader, and extracts a directory URL and a detail page URL in the webpage information; key fields in the webpage information are extracted and then stored in a MongoDB database; the downloader downloads the URL returned by the scheduler and transmits the downloaded webpage information to the spider; the system is responsible for crawling corresponding websites, and comprises a starting URL (uniform resource locator), a duplicate removal module, a URL extraction module and a duplicate removal module, wherein the starting URL is taken at first, and the URL is extracted after crawling; then the scheduling module distributes URL to the Slave node from Redis;
the data storage includes: the storage module realizes two functions, wherein the URL is stored in Redis, and the Redis is deployed in a Master node; the webpage content obtained by analysis is stored in a MongoDB database and is deployed at a Master node; the stored webpage content information is extracted to be the final target of the system, and the distributed crawler crawls the webpage content for a data processing program to extract the required information.
6. The method for generating network security knowledge graph based on threat intelligence as claimed in claim 1, wherein: the network security threat information data set is produced through a distributed threat information crawling system; the method comprises the following steps:
(1) vulnerability data: the vulnerability data is collected from a main vulnerability publishing platform, and the data types comprise vulnerability occurrence system types, system versions and utilization methods;
(2) APT attack chain data: APT attack chain data is collected from an APTnodes platform; a total of 528 APT reports over the last 10 years;
(3) malware text data: the method comprises the name, the category, the common functions, the Hash and the utilization system platform of malicious software in threat intelligence; the partial data is collected in a threat intelligence source AlienVault;
(4) data discussion in the secure community: the part of data is collected in a StackExchange website and is the text of a recent security event;
(5) secure RSS subscription data: the part of data is collected in each large network security RSS and is recent network security news.
7. The method for generating network security knowledge graph based on threat intelligence as claimed in claim 1, wherein: the method for improving the quality of the network security threat information data comprises the following steps:
step (1), FPR false positive rate: for each source k ∈ S, a corresponding false positive rate is generated
Figure FDA0003034480920000021
The value is (1-specificity)Obedience hyper-parameter is alpha0=(α0,10,0) Beta distribution of (a), wherein0,1Is the count of false positive samples a priori per source, alpha0,0Is the count of true negative samples per source prior:
Figure FDA0003034480920000022
in the following, from the second time node
Figure FDA0003034480920000023
Using a previous time node
Figure FDA0003034480920000024
Instead, the time-varying characteristic of the truth finding model is calibrated by using second-order markov;
step (2) Sensitivity: for each source k ∈ S, a corresponding sensitivity rate is generated
Figure FDA0003034480920000025
Obeying a hyper-parameter of alpha1=(α1,11,0) Beta distribution of (a), wherein1,1Is the true positive sample count per source prior, α1,0Is the per source a priori false negative sample count:
Figure FDA0003034480920000026
from a second time node
Figure FDA0003034480920000027
Using a previous time node
Figure FDA0003034480920000028
Instead, the time-varying characteristic of the truth finding model is calibrated by using second-order markov;
step (3), attack tag of Att fact: for the attribute of each entity, F belongs to F, and F is a set of observed values of all attributes under the entity; generating a priori true probability θfObeying a hyper-parameter of β ═ β (β)10) Beta distribution of (a), wherein1Is a prior entity attribute correct sample count, β0Is a prior entity attribute error sample count:
θf~Beta(β10)
will be from the second time nodefUsing theta of the previous time nodefInstead, the time-varying characteristic of the truth finding model is calibrated by using second-order markov;
step (4), Truth label: the attribute true value label is used for generating a true value label of each entity attribute, namely whether the observed value is correct or not; t is tfIs an attribute truth label with a compliance parameter thetafIn which t isfIs a binary Boolean variable, the prior probability θfIs to represent an attribute tag tfProbability of being correct:
tf~Bernoulli(θf)
step (5) observer: an entity attribute observation value label, wherein the observation value C of each entity attribute belongs to CfIts source uses scRepresents; generating a distribution of observation labels c is a compliance parameter
Figure FDA0003034480920000031
Bernoulli distribution of (a):
Figure FDA0003034480920000032
wherein if tf=0,ocCompliance parameter of
Figure FDA0003034480920000033
The false positive rate of the distribution of bernoulli is sc
Figure FDA0003034480920000034
If t isf=1,ocCompliance parameter of
Figure FDA0003034480920000035
The false positive rate of the distribution of bernoulli is also sc
The model solution is as follows:
the conditional probability of the model given the observed value c of each entity attribute is as follows:
Figure FDA0003034480920000036
in the above formula: p represents the prior probability theta when the parameter truth value is givenfSource sensitivity rate
Figure FDA0003034480920000037
And
Figure FDA0003034480920000038
when the observed value of the entity o is c, the conditional probability; where c is the observation, f is the attack tag, scA source representing the occurrence of an observation c;
the full likelihood function containing all variables and hyperparameters is written as:
Figure FDA0003034480920000039
in the above formula: p represents the hyperparameter alpha when given the parameter false positive rate0,α1And the prior true probability over-parameter beta, the entity o, the source s, the true label t, the prior probability parameter set theta and the sensitivity parameter set phi0,φ1The conditional probability of (a); where S represents a set of all sources, F represents a set of attack tags, and F represents each attack belonging to FHit against the label element, θfDenotes the f prior probability, tfDenotes the true value of f, CfA set of observations representing f, c representing each observation element in the set of observations;
and (3) given the observation value data of the attribute, solving the likelihood function by using a Gibbs Sampling algorithm in the MCMC algorithm:
Figure FDA00030344809200000310
tmapthe result of maximum posterior estimation is obtained by the formula, and the rest parameters have the same meanings as the parameters with the same names in the formula;
the following formula is solved:
Figure FDA00030344809200000311
wherein: p represents when given a parameter t-fTrue value t of f for entity o and source sfThe value is the conditional probability of i, i represents the attack label value of f, and the value range is {0,1}, t-fIs the set of all values in F except F,
Figure FDA0003034480920000041
source s representing observation j, attack tag not f, and truth tag icThe number of (2); c-fRepresenting a set of attack tags without attack tag f, C' being C-fEach of the elements in the set is,
Figure FDA0003034480920000042
the truth value when f is c' is shown, and the rest parameters have the same meanings as the parameters with the same names in the above;
to obtain p (t)f=i|t-fO, s), estimating to obtain the FPR false positive rate and the Sensitivity rate of the next moment, and solving the following steps:
Figure FDA0003034480920000043
Figure FDA0003034480920000044
wherein
Figure FDA0003034480920000045
Observation set C representing all attack labels as ffElement c of (a), source s making observations of the observation value ccAnd attack tag o of entity ofThe truth label for j takes the value of the probability sum of i, wherein i belongs to {0,1}, j belongs to {0, 1., | F | }, | F | represents the number of elements of the attack set F, the rest parameters have the same meanings as the parameters with the same names in the above, and finally the accuracy of each source can be estimated as well:
Figure FDA0003034480920000046
where precision represents the accuracy of each source.
8. The method for generating network security knowledge graph based on threat intelligence as claimed in claim 1, wherein: the network security entity identification is carried out on the manufactured network security threat information data set, and the sentence X in the APT report document is [ X ] by adopting a BIO marking method for the APT report]N=[x1,...,xi,...xN]Wherein x isiIs the ith character in sentence X; in the BIO labeling method, identifying the network security entity in the sentence X is equivalent to giving a standard sequence LX=[l]N
Performing model training on the labeled APT report document by using a BilSTM-CRF model, and extracting word characteristics before the ith character and word characteristics after the word by a forward process; the CRF model is used for acquiring the conditional probability distribution of another group of output random variables under the condition of giving a group of input random variables;
the CRF model is: given an input sentence, X ═ X]N=[x1,...,xi,...xN]Assuming S is the output score matrix of the BiLSTM network of dimension NxK, K is the number of label categories, Si,jIs the jth label score of the ith word, the predicted label y ═ y1,...,yi,...,yN]The judgment score Z of (a) defines:
Figure FDA0003034480920000051
where T is a K +2 dimensional probability transition matrix, the probability of the generated tag sequence y:
Figure FDA0003034480920000052
then, solving the correctly labeled log-likelihood probability by utilizing maximum likelihood estimation:
Figure FDA0003034480920000053
9. the method for generating network security knowledge graph based on threat intelligence as claimed in claim 1, wherein: the extracting of the network security entity relationship comprises:
extracting a network security entity relationship by adopting an attention-based BilSTM (Att-BilSTM) model; the system comprises an input layer, a word embedding layer, a BilSTM layer, an Attention layer and an output layer;
wherein the word embedding layer is used to characterize a sentence in the APT report, X ═ X]N=[x1,...,xi,...xN]Sentences are expressed into a matrix, words with similar meanings are adjacent in the matrix space, and the expressions can have relations;
wherein the significance of the output result of the salient part of the Attention layer introduces a weighting thought;
wherein the output of the BilSTM layer is B ═ B]T=[b1,...,bj,...,bT]Then the parameter matrix W satisfies the following formula:
S=tanh(B)
α=softmax(WTS)
r=BαT
alpha is an attention weight coefficient, r is a result of weighted summation of the BilSTM output B, finally a characterization vector B ═ tanh (r) is generated through a nonlinear function, and then B is used*And inputting a full-connection neural network, mapping the full-connection neural network to a label vector, and obtaining a prediction label through a softmax function.
10. The method for generating network security knowledge graph based on threat intelligence as claimed in claim 1, wherein: and the data organization adopts a non-relational database, namely a Mongodb database to store, and stores all data in a key-value pair mode.
CN202110439459.1A 2021-04-23 2021-04-23 Threat information-based network security knowledge graph generation method Active CN113282759B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110439459.1A CN113282759B (en) 2021-04-23 2021-04-23 Threat information-based network security knowledge graph generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110439459.1A CN113282759B (en) 2021-04-23 2021-04-23 Threat information-based network security knowledge graph generation method

Publications (2)

Publication Number Publication Date
CN113282759A true CN113282759A (en) 2021-08-20
CN113282759B CN113282759B (en) 2024-02-20

Family

ID=77277242

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110439459.1A Active CN113282759B (en) 2021-04-23 2021-04-23 Threat information-based network security knowledge graph generation method

Country Status (1)

Country Link
CN (1) CN113282759B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113746838A (en) * 2021-09-03 2021-12-03 杭州安恒信息技术股份有限公司 Threat information sensing method, device, equipment and medium
CN113746832A (en) * 2021-09-02 2021-12-03 华中科技大学 Multi-method mixed distributed APT malicious flow detection defense system and method
CN113934914A (en) * 2021-12-20 2022-01-14 成都橙视传媒科技股份公司 Method for collecting batch encrypted data of news media
CN114065767A (en) * 2021-11-29 2022-02-18 北京航空航天大学 Method for analyzing classification and evolution relation of threat information
CN114222293A (en) * 2021-12-21 2022-03-22 中国电信股份有限公司 Network data security protection method and device, storage medium and terminal equipment
CN114257420A (en) * 2021-11-29 2022-03-29 中国人民解放军63891部队 Method for generating network security test based on knowledge graph
CN114697110A (en) * 2022-03-30 2022-07-01 杭州安恒信息技术股份有限公司 Network attack detection method, device, equipment and storage medium
CN115208684A (en) * 2022-07-26 2022-10-18 中国电子科技集团公司第十五研究所 Hypergraph association-based APT attack clue expansion method and device
CN115412372A (en) * 2022-11-01 2022-11-29 中孚安全技术有限公司 Network attack tracing method, system and equipment based on knowledge graph
CN115622805A (en) * 2022-12-06 2023-01-17 南宁重望电子商务有限公司 Artificial intelligence-based safety payment protection method and AI system
CN115618857A (en) * 2022-09-09 2023-01-17 中国电信股份有限公司 Threat information processing method, threat information pushing method and device
CN115795058A (en) * 2023-02-03 2023-03-14 北京安普诺信息技术有限公司 Threat modeling method, system, electronic device and storage medium
CN116723042A (en) * 2023-07-12 2023-09-08 北汽蓝谷信息技术有限公司 Data packet security protection method and system
CN117354065A (en) * 2023-12-05 2024-01-05 国网四川省电力公司电力科学研究院 Industrial control network threat information analysis method and system based on big data
CN117792801A (en) * 2024-02-28 2024-03-29 贵州华谊联盛科技有限公司 Network security threat identification method and system based on multivariate event analysis

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH075892A (en) * 1993-04-29 1995-01-10 Matsushita Electric Ind Co Ltd Voice recognition method
CN102932147A (en) * 2012-10-09 2013-02-13 上海大学 Elliptic curve cipher timing attacking method based on hidden markov model (HMM)
US8489635B1 (en) * 2010-01-13 2013-07-16 Louisiana Tech University Research Foundation, A Division Of Louisiana Tech University Foundation, Inc. Method and system of identifying users based upon free text keystroke patterns
WO2016061586A1 (en) * 2014-10-17 2016-04-21 Cireca Theranostics, Llc Methods and systems for classifying biological samples, including optimization of analyses and use of correlation
CN106021608A (en) * 2016-06-22 2016-10-12 广东亿迅科技有限公司 Distributed crawler system and implementing method thereof
CN109922075A (en) * 2019-03-22 2019-06-21 中国南方电网有限责任公司 Network security knowledge map construction method and apparatus, computer equipment
CN110177114A (en) * 2019-06-06 2019-08-27 腾讯科技(深圳)有限公司 The recognition methods of network security threats index, unit and computer readable storage medium
CN110717049A (en) * 2019-08-29 2020-01-21 四川大学 Text data-oriented threat information knowledge graph construction method
CN110929128A (en) * 2019-12-11 2020-03-27 北京启迪区块链科技发展有限公司 Data crawling method, device, equipment and medium
CN111831905A (en) * 2020-06-19 2020-10-27 中国科学院计算机网络信息中心 Recommendation method and device based on team scientific research influence and sustainability modeling
CN111881622A (en) * 2020-07-27 2020-11-03 南京睿辰欣创网络科技股份有限公司 Method for deductive evaluation of combat plan by person in loop
CN112115331A (en) * 2020-09-21 2020-12-22 朱彤 Capital market public opinion monitoring method based on distributed web crawler and NLP
US20210042619A1 (en) * 2019-08-05 2021-02-11 Intuit Inc. Finite rank deep kernel learning with linear computational complexity

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH075892A (en) * 1993-04-29 1995-01-10 Matsushita Electric Ind Co Ltd Voice recognition method
US8489635B1 (en) * 2010-01-13 2013-07-16 Louisiana Tech University Research Foundation, A Division Of Louisiana Tech University Foundation, Inc. Method and system of identifying users based upon free text keystroke patterns
CN102932147A (en) * 2012-10-09 2013-02-13 上海大学 Elliptic curve cipher timing attacking method based on hidden markov model (HMM)
WO2016061586A1 (en) * 2014-10-17 2016-04-21 Cireca Theranostics, Llc Methods and systems for classifying biological samples, including optimization of analyses and use of correlation
CN106021608A (en) * 2016-06-22 2016-10-12 广东亿迅科技有限公司 Distributed crawler system and implementing method thereof
CN109922075A (en) * 2019-03-22 2019-06-21 中国南方电网有限责任公司 Network security knowledge map construction method and apparatus, computer equipment
CN110177114A (en) * 2019-06-06 2019-08-27 腾讯科技(深圳)有限公司 The recognition methods of network security threats index, unit and computer readable storage medium
US20210042619A1 (en) * 2019-08-05 2021-02-11 Intuit Inc. Finite rank deep kernel learning with linear computational complexity
CN110717049A (en) * 2019-08-29 2020-01-21 四川大学 Text data-oriented threat information knowledge graph construction method
CN110929128A (en) * 2019-12-11 2020-03-27 北京启迪区块链科技发展有限公司 Data crawling method, device, equipment and medium
CN111831905A (en) * 2020-06-19 2020-10-27 中国科学院计算机网络信息中心 Recommendation method and device based on team scientific research influence and sustainability modeling
CN111881622A (en) * 2020-07-27 2020-11-03 南京睿辰欣创网络科技股份有限公司 Method for deductive evaluation of combat plan by person in loop
CN112115331A (en) * 2020-09-21 2020-12-22 朱彤 Capital market public opinion monitoring method based on distributed web crawler and NLP

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
O. YOUSIF 等: "Improving SAR-Based Urban Change Detection by Combining MAP-MRF Classifier and Nonlocal Means Similarity Weights", 《IN IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING》, vol. 7, no. 10, pages 4288 - 4300, XP011568812, DOI: 10.1109/JSTARS.2014.2347171 *
曹玉琳 等: "基于状态空间模型和概率矩阵分解的推荐算法", 《计算机应用研究》, vol. 37, no. 11, pages 1001 - 3695 *
邵昊阳 等: "基于多域先验的乳腺超声图像协同分割", 《自动化学报》, vol. 42, no. 4, pages 580 - 592 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113746832B (en) * 2021-09-02 2022-04-29 华中科技大学 Multi-method mixed distributed APT malicious flow detection defense system and method
CN113746832A (en) * 2021-09-02 2021-12-03 华中科技大学 Multi-method mixed distributed APT malicious flow detection defense system and method
CN113746838A (en) * 2021-09-03 2021-12-03 杭州安恒信息技术股份有限公司 Threat information sensing method, device, equipment and medium
CN113746838B (en) * 2021-09-03 2022-12-13 杭州安恒信息技术股份有限公司 Threat information sensing method, device, equipment and medium
CN114257420B (en) * 2021-11-29 2024-01-09 中国人民解放军63891部队 Knowledge graph-based network security test generation method
CN114257420A (en) * 2021-11-29 2022-03-29 中国人民解放军63891部队 Method for generating network security test based on knowledge graph
CN114065767A (en) * 2021-11-29 2022-02-18 北京航空航天大学 Method for analyzing classification and evolution relation of threat information
CN114065767B (en) * 2021-11-29 2024-05-14 北京航空航天大学 Threat information classification and evolution relation analysis method
CN113934914A (en) * 2021-12-20 2022-01-14 成都橙视传媒科技股份公司 Method for collecting batch encrypted data of news media
CN114222293A (en) * 2021-12-21 2022-03-22 中国电信股份有限公司 Network data security protection method and device, storage medium and terminal equipment
CN114697110A (en) * 2022-03-30 2022-07-01 杭州安恒信息技术股份有限公司 Network attack detection method, device, equipment and storage medium
CN115208684A (en) * 2022-07-26 2022-10-18 中国电子科技集团公司第十五研究所 Hypergraph association-based APT attack clue expansion method and device
CN115208684B (en) * 2022-07-26 2023-03-14 中国电子科技集团公司第十五研究所 Hypergraph association-based APT attack clue expansion method and device
CN115618857A (en) * 2022-09-09 2023-01-17 中国电信股份有限公司 Threat information processing method, threat information pushing method and device
CN115618857B (en) * 2022-09-09 2024-03-01 中国电信股份有限公司 Threat information processing method, threat information pushing method and threat information pushing device
CN115412372A (en) * 2022-11-01 2022-11-29 中孚安全技术有限公司 Network attack tracing method, system and equipment based on knowledge graph
CN115622805B (en) * 2022-12-06 2023-08-25 深圳慧卡科技有限公司 Safety payment protection method and AI system based on artificial intelligence
CN115622805A (en) * 2022-12-06 2023-01-17 南宁重望电子商务有限公司 Artificial intelligence-based safety payment protection method and AI system
CN115795058A (en) * 2023-02-03 2023-03-14 北京安普诺信息技术有限公司 Threat modeling method, system, electronic device and storage medium
CN116723042A (en) * 2023-07-12 2023-09-08 北汽蓝谷信息技术有限公司 Data packet security protection method and system
CN116723042B (en) * 2023-07-12 2024-01-26 北汽蓝谷信息技术有限公司 Data packet security protection method and system
CN117354065A (en) * 2023-12-05 2024-01-05 国网四川省电力公司电力科学研究院 Industrial control network threat information analysis method and system based on big data
CN117792801A (en) * 2024-02-28 2024-03-29 贵州华谊联盛科技有限公司 Network security threat identification method and system based on multivariate event analysis
CN117792801B (en) * 2024-02-28 2024-05-14 贵州华谊联盛科技有限公司 Network security threat identification method and system based on multivariate event analysis

Also Published As

Publication number Publication date
CN113282759B (en) 2024-02-20

Similar Documents

Publication Publication Date Title
CN113282759B (en) Threat information-based network security knowledge graph generation method
Le et al. Deep learning at the shallow end: Malware classification for non-domain experts
CN115563610B (en) Training method, recognition method and device for intrusion detection model
Carlin et al. A cost analysis of machine learning using dynamic runtime opcodes for malware detection
Dionísio et al. Towards end-to-end cyberthreat detection from Twitter using multi-task learning
CN112115326B (en) Multi-label classification and vulnerability detection method for Etheng intelligent contracts
Herath et al. Cfgexplainer: Explaining graph neural network-based malware classification from control flow graphs
US20220318387A1 (en) Method and Computer for Learning Correspondence Between Malware and Execution Trace of the Malware
CN112287199A (en) Big data center processing system based on cloud server
CN115358397A (en) Parallel graph rule mining method and device based on data sampling
CN111400713A (en) Malicious software family classification method based on operation code adjacency graph characteristics
US20220277219A1 (en) Systems and methods for machine learning data generation and visualization
Haile et al. Identifying ubiquitious third-party libraries in compiled executables using annotated and translated disassembled code with supervised machine learning
Klassen et al. Web document classification by keywords using random forests
Eken et al. Predicting defects with latent and semantic features from commit logs in an industrial setting
CN110740111B (en) Data leakage prevention method and device and computer readable storage medium
CN111026940A (en) Network public opinion and risk information monitoring system and electronic equipment for power grid electromagnetic environment
Sharif et al. Function identification in android binaries with deep learning
CN116361788A (en) Binary software vulnerability prediction method based on machine learning
Tuhin et al. Smart cybercrime classification for digital forensics with small datasets
CN113934813A (en) Method, system and equipment for dividing sample data and readable storage medium
Yuan et al. Research of intelligent reasoning system of Arabidopsis thaliana phenotype based on automated multi-task machine learning
Tenenboim et al. Multi-label classification by analyzing labels dependencies
Rodriguez et al. A multi-core computing approach for large-scale multi-label classification
Düzgün et al. Benchmark Static API Call Datasets for Malware Family Classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant