CN115237978A

CN115237978A - Open source threat information aggregation platform

Info

Publication number: CN115237978A
Application number: CN202210796520.2A
Authority: CN
Inventors: 何清林; 杨黎斌; 胡金灿; 王梦涵; 崔琳; 蔡晓妍; 戴航
Original assignee: Northwestern Polytechnical University; National Computer Network and Information Security Management Center
Current assignee: Northwestern Polytechnical University; National Computer Network and Information Security Management Center
Priority date: 2022-07-06
Filing date: 2022-07-06
Publication date: 2022-10-25

Abstract

The invention discloses an open source threat information aggregation platform, which is provided with the following components: the system comprises a multi-source heterogeneous information data acquisition module, a multi-source heterogeneous information data fusion evaluation module and a threat information deep mining module; the multi-source heterogeneous information data fusion evaluation module is used for identifying, extracting, fusing, evaluating and normalizing the information data acquired by the multi-source heterogeneous information data acquisition module, wherein the identification and extraction is used for generating structured threat information; screening threat information by fusion evaluation; the normalized storage is to expand the information dimension of the screened threat intelligence to form an IOT vulnerability aggregation base and a threat intelligence database; and the threat information deep mining module is used for constructing a fuzzy pattern matching model by combining the existing matching relation between the threat data and the flow information, matching the threat information and the flow information in the vulnerability aggregation base and the threat information database of the Internet of things and realizing the mining of the potential attack behavior.

Description

Open source threat information aggregation platform

Technical Field

The invention belongs to the technical field of information security, and particularly relates to an open source threat information aggregation platform.

Background

With the arrival of the age of interconnection of everything, a new industry represented by the internet of things has become one of seven strategic industries vigorously supported and developed in China. According to the forecast of the industry research institute of the future, the scale of the Internet of things industry in China will exceed 2.7 trillion yuan by 2025 years. On the other hand, due to the inherent characteristics of multi-source heterogeneity, ubiquitous openness and the like, the internet of things enables people to enjoy the convenience of novel technologies such as 'cloud large object moving intelligence', meanwhile, the new network threats faced by the internet of things are increasingly complex and changeable, and various novel security attack events occur frequently. In particular, the security threat of the internet of things device mainly has the following challenges: (1) The environment is open, and a user interacts with the terminal through an open wireless communication environment, so that the communication of the Internet of things is easily attacked by malicious attacks such as eavesdropping, tampering and replaying, and the privacy information such as the identity and the position of an equipment owner is easily leaked; (2) The method has the advantages that the method is various in types and frequent in alternation, a large amount of new Internet of things equipment is available every day, hardware manufacturers generally have poor software and upgrade capabilities, and vulnerabilities cannot be avoided in safety protection; (3) The resources are limited, the storage resources and the computing capacity of the internet of things equipment are often limited, so that the system environment of the internet of things is simplified, and the traditional internet-based complex security guarantee technology cannot be directly used in the internet of things equipment.

The existing internet of things system lacks a targeted monitoring and protecting means in safety protection, and mainly depends on a traditional internet safety defense method, for example, each attack vector is regarded as an independent path to be independently checked in stages by depending on safety equipment such as a firewall and an intrusion detection system which are deployed at a boundary or a special node, and by static detection methods such as heuristic and signature, and the like, but lacks a global view angle, so that the system is difficult to deal with novel network threat attacks which are exquisite in attack plan and frequent in updating and iteration.

Aiming at the novel security threat faced by the Internet of things system, an important protection means is to deeply mine the information of the network threat and introduce the information into the whole period of the security detection of the Internet of things, so as to actively discover and defend malicious and extremely difficult-to-detect attack behaviors. The internet of things Threat information (CTI) mining technology collects, mines and identifies real-time Threat information aiming at internet of things equipment and converts the Threat information into Threat information. Generally, threat intelligence refers to knowledge which can be used for solving threats or dealing with hazards, and includes threat sources, attack intentions, attack techniques and attack target information, and the method has the characteristics of high knowledge density, high accuracy, strong relevance and the like, can provide powerful data support for each stage of security analysis of the internet of things equipment, and can make timely response and defense against polymorphic and complex high-intelligent threats and attacks.

The development of threat information system in China is still in the initial stage, although a lot of excellent threat information companies emerge in recent years, such as Qianxin, microstep online, interpersonal alliance and the like are gradually invested in and strengthen the research and development and construction of products related to the threat information, and the threat information system starts to fall to the ground for application in the actual scenes of some manufacturers. However, these companies mainly focus on the development and application of commercial threat intelligence, and their products mainly provide the capability of localized threat intelligence for vertical enterprise users, and the data sources and application scenarios collected by them are narrow. The depth and the breadth of the information need to be further deepened and strengthened urgently, the attention to the open source threat information is relatively less, the data dimension is limited, the acquisition source is single, the atomicity (such as IP and Hash) and instability are realized, effective and reliable means for mining, collecting and quality evaluation of the threat information are lacked, the corresponding network security analysis technology based on the open source threat information is laggard, and a threat information comprehensive service platform integrating information mining, analyzing, evaluating and utilizing is not formed. There are some open source threat intelligence gathering tools, such as threadrow, ostraca, etc., that provide interfaces to automatically gather information from public, internal, and commercial sources, but these tools have very limited sources of intelligence data available and dimensions of information available. As shown in the figure, taking threadingstor as an example, the data source can only be obtained from Twitter and a small amount of RSS source, and the data dimension only includes four or five dimensions, such as URL, reference link, etc., which is far from meeting the requirement of threat behavior association analysis.

Disclosure of Invention

The invention aims to provide an open source threat information aggregation platform, which is used for researching an acquisition and analysis technology of multi-source heterogeneous threat information aiming at the problems of insufficient information levels and the like of the existing Internet of things threat information platform and realizing the acquisition and aggregation of multi-data such as security-related social media accounts, loopholes POC (point of sales) disclosing platforms, front-edge security scientific and technological documents and the like; the method is characterized in that a multi-source heterogeneous information data fusion technology is researched, and integration, extraction, refinement and standardized output of threat information data are realized; and (3) researching a threat information deep mining technology based on flow association, realizing deep association, collision and analysis of information and flow data, and forming verifiable high-quality threat information.

The technical scheme adopted by the invention is as follows:

an open source threat information aggregation platform is provided with: the system comprises a multi-source heterogeneous information data acquisition module, a multi-source heterogeneous information data fusion evaluation module and a threat information deep mining module;

the multisource heterogeneous information data acquisition module comprises an Internet of things vulnerability aggregation platform and an Internet of things threat information mining platform; the method comprises the steps that an Internet of things vulnerability aggregation platform realizes the collection of open source Internet of things vulnerability data; the threat information mining platform of the Internet of things realizes the collection of threat information data;

the multi-source heterogeneous information data fusion evaluation module is used for identifying, extracting, fusing, evaluating and normalizing the information data acquired by the multi-source heterogeneous information data acquisition module, and the identification and extraction is used for generating structured threat information; screening threat information by fusion evaluation; the normalized storage is to expand the information dimension of the screened threat information to form an internet of things vulnerability aggregation base and a threat information database;

and the threat information deep mining module is used for constructing a fuzzy pattern matching model by combining the existing matching relation of threat data and flow information, matching threat information and flow information in the vulnerability aggregation base and the threat information database of the Internet of things and realizing the mining of potential attack behaviors.

Optionally, the multi-source heterogeneous intelligence data fusion evaluation module includes: firstly, multi-source heterogeneous threat information is defined in a structuralized mode, information source quality assessment scoring and information content quality level grading scoring are carried out on the structuralized threat information, and then information comprehensive credibility score is calculated by combining the information source and the information content.

Optionally, 3, the open-source threat intelligence aggregation platform according to claim 2, wherein the intelligence source quality assessment score specifically includes: the authority value and hub value of each intelligence source, and the intelligence source quality evaluation score S _ source are percentage conversion of the acquired authority value of each intelligence source;

initializing a content authority vector A and a link authority vector H to enable the length of the content authority vector A and the link authority vector H to be 1; in the k-th iteration, the authority value of the intelligence source S is determined by

Calculating; after obtaining new vector A, the hub value of the information source S is represented by formula

Calculating; and normalizing the vector A and the vector H obtained by calculation, and then circularly iterating until convergence.

Optionally, the intelligence content quality level grading assignment comprises: calculating the similarity of the source of the threat intelligence;

S(v _t ，v _i )＝θ ₁ ×S _source +θ ₂ ×S _time +θ ₃ ×S _category +θ ₄ ×S _tag ；

θ ₁ +θ ₂ +θ ₃ +θ ₄ ＝1；

in the formula, S (v) _t ，v _i ) Representing the similarity of two threat informations, S _source Is the similarity of the intelligence source, S _time Is the time similarity of the intelligence, S _category Is the similarity of threat categories, S _tag Describing the similarity of the tags for the threats; theta ₁ 、θ ₂ 、θ ₃ And theta ₄ The weight of each of the four factors is; calculating the similarity of information sources and the similarity of threat types of two threat intelligence: information source similarity S when sources of two threat informations are same _source Taking 1 at different times S _source Taking 0; threat category similarity S _category The same applies to the values of (1).

Optionally, the two threat intelligence v _t And v _i The time similarity calculation of (2) includes:

Δt(v _t ，v _i )＝|t(v _t ，v _i )|；

Min＝min(Δt(v _t ,v _i ))；

Max＝max(Δt(v _t ，v _i ))；

S _time (v _t ，v _i )＝1-D _time (v _t ，v _i )。

optionally, two threat intelligence v _t And v _i The threat description tag similarity calculation method of (1) is as follows:

in the formula, X _t And X _i Respectively threat intelligence v _t And vi a vector representation of the threat description label,

is the cosine similarity of two vectors, and the value range is [0,1 ]](ii) a When the two threat information labels are consistent, the cosine similarity is 1; if the threat tag of both threat intelligence is empty, the similarity between them is specified to be 0.5.

Optionally, for each threat intelligence, its base credibility is credibility based on intelligence source quality and credibility based on intelligence content, and its credibility is evaluated by weighted sum of its base credibility, as shown in the following formula:

wherein S (n) is the comprehensive credibility score of the nth information sample, S _i (n) the i-th basic confidence score of the intelligence sample; w is a _i For each factor s _i (n) weight is occupied.

Optionally, the identifying and extracting includes an open source intelligence data identifying and extracting module, and the open source intelligence data identifying and extracting module is provided with a content preprocessor, a relationship selector, a relationship checker and an IOC generator;

the content preprocessor module is mainly used for checking the source information content by using a topic analysis technology of an NLP technology, screening out text source articles related to a safety topic, and simultaneously filtering out information content which does not contain IOC; the relation selector module is used for identifying IOC information possibly contained in source information content, mainly using NER technology to position sentence positions possibly containing IOC words, and analyzing the relation between IOC entities by combining with a Stanford analyzer; the relationship checker checks each item in the IOC, particularly converts the IOC entity relationship into a dependency relationship graph during detection and identification, and determines whether the IOC relationship exists in the IOC entity and the IOC candidate object by mining the graph according to the relationship dependency graph; the IOC generator uses the content of such a tag to automatically create title and definition components according to the OpenIOC standard, including all the indicator entries for the identified IOC.

Optionally, the normalization storage includes:

marking the extracted IOC information with a quality evaluation score, performing collision association with the vulnerability information, expanding the dimensionality of threat information, increasing the dimensionality of the vulnerability information, the quality evaluation score, an attack method, the attribute of an attacker, the dimensionality of asset and activity trace information, and realizing normalized packaging and output of the analyzed and judged threat information data.

The invention has the beneficial effects that:

the method explores and realizes a set of domestic authoritative Internet of things advanced threat information platform with domestic leading high-low level combination, and has the capabilities of attack event reduction, continuous threat information output and the like.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 is a schematic diagram of an open source threat intelligence aggregation platform architecture of the present invention;

FIG. 2 is a flowchart of vulnerability source analysis of the Internet of things;

FIG. 3 is a logic diagram of a vulnerability analysis flow of the Internet of things;

FIG. 4 is a multi-source heterogeneous intelligence data acquisition framework;

FIG. 5 is a schematic diagram of an open source intelligence data identification and extraction structure;

FIG. 6 is a flow chart of quality assessment of an intelligence source;

FIG. 7 is a flow chart of an intelligence content quality level rating function;

FIG. 8 is a logic diagram of a threat intelligence statistical analysis function;

FIG. 9 is a logic diagram of a process for preliminary analysis of threat intelligence;

FIG. 10 is a logic diagram of a threat intelligence detection model generation flow.

Detailed Description

In order to make the technical problems solved, technical solutions adopted and technical effects achieved by the present invention clearer, the technical solutions of the embodiments of the present invention will be described in further detail below with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of embodiments of the invention, but not all embodiments. All other solutions, which can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present invention, belong to the protection scope of the present invention.

The invention is described in detail below with reference to the drawings and the detailed description.

The method is characterized in that the research of the Internet of things open source threat information extraction and correlation analysis technology is taken as the background, the machine learning attack behavior detection analysis technology is combined, and attack threats which possibly occur are analyzed and judged, so that the safety performance evaluation of a mainstream Internet of things system is realized, and the method specifically relates to an open source threat information aggregation platform which specifically comprises three functional modules, namely a multisource heterogeneous information data acquisition module, a multisource heterogeneous information data fusion evaluation module and a threat information deep mining module; the four stages respectively comprise four aspects of multi-source heterogeneous information data acquisition, multi-source heterogeneous information fusion evaluation, threat information data statistical analysis and threat information deep mining analysis, and specific contents are shown in figure 1.

A first part: multisource heterogeneous information data acquisition module

The traditional threat information acquisition generally has a fixed acquisition way, and mainly depends on the extraction from the past network threat attack data extracted from security equipment, for example, the threat information includes log data generated from an enterprise internal network, detection equipment deployed at a terminal or a high-interaction honeypot, and most threat information comes from the threat data generated by subscribed security manufacturers and industry organizations. With the rapid increase of the number and complexity of network attacks, the traditional approach-based internal threat information collection means and method has the defects of single acquisition source, atomicity (such as IP and Hash), instability and the like. The multisource heterogeneous data acquisition technology of open source threat intelligence provides an effective new path for solving the inherent defects of the traditional threat intelligence. The multisource heterogeneous data acquisition research mainly comprises two functions, namely an Internet of things vulnerability aggregation platform and an Internet of things threat information mining platform. The vulnerability aggregation platform of the Internet of things is mainly used for designing a security vulnerability normalization data structure of the Internet of things, researching a vulnerability website content extraction algorithm and constructing an open-source vulnerability aggregation platform of the Internet of things; the Internet of things threat information mining platform is designed with an automatic crawler and analysis technology, and realizes the collection and aggregation of multiple information related data including security-related social media accounts, loophole POC (point of sale) disclosure platforms, leading-edge security scientific and technical documents and the like.

(1) Vulnerability aggregation platform of Internet of things

The vulnerability aggregation platform of the Internet of things dynamically captures and collects vulnerability data of Internet of things products popular in the existing Internet by applying technologies such as dynamic crawler, update detection and similarity troubleshooting, so that timeliness and completeness of the vulnerability aggregation database of the Internet of things are guaranteed, and a comprehensive, accurate and detailed vulnerability database is better provided for security analysts of the Internet of things. The platform crawls and collects a plurality of home and abroad famous leak databases such as Secunia, explicit-db, OSVDB, securityFocus and the like as data sources to extract and download network vulnerabilities. At present, a vulnerability source website generally uses an XML/HTML format to release a lot of vulnerabilities, but because vulnerability data sources are more and vulnerability description formats of all vulnerability libraries are different, a traditional vulnerability downloading mode using a hard-coded extraction rule cannot meet extraction of vulnerability page information with different format specifications.

The vulnerability description basic information data of a common vulnerability library is generally limited, and by taking vulnerability information published by an explicit-db as an example, the vulnerability description basic information data only generally contain brief information such as ID, author and platform, and are difficult to provide enough useful information for subsequent threat information and traffic correlation operation. The POC code generally contains rich threat behavior content, and information extraction may be performed by extracting relevant threat content in the POC code in each leak library description by using a web page information extraction technique and using a regular expression in combination with an XML path language (XPath) technique, where a specific operation flow is shown in fig. 2.

The XML/HTML document format has the characteristic that basic constituent unit tags are nested layer by layer, a Named Entity Recognition (NER) technology can be used for recognizing and extracting a Document Object Model (DOM) to be extracted, and the DOM tree of the vulnerability information page can be obtained through structured modeling, and then the nodes in the document tree can be edited and other operations can be carried out by utilizing an API (application programming interface) provided by the DOM. And then, an XPath technology can be utilized, different regular expression rules are combined to construct a specific XPath expression aiming at the vulnerability publishing page of each vulnerability library, each node in the document tree is matched and converted into a character string, and then the character string is filtered through the configuration file to obtain the required vulnerability information.

In the process of node matching extraction and filtering, fields such as keywords, vulnerability names, vulnerability types, vulnerability descriptions and the like can be screened by using an NER technology, and entry information belonging to the vulnerabilities of the Internet of things is extracted. Meanwhile, the vulnerability attributes of each vulnerability are extracted by using a regular expression, so that the arrangement and maintenance of vulnerability information by an administrator are more convenient. In addition, the vulnerability publishing page formats and vulnerability descriptions of different vulnerability libraries are different, and the XPath technology can be used for compiling corresponding configuration files for each vulnerability library so as to match XML/HTML documents with different formats. The flow logic diagram is shown in fig. 3.

(2) Acquisition of open source threat information

With reference to fig. 4, the acquisition of open source information mainly designs an automated crawler and an analysis technique, and realizes the collection and aggregation of multiple information related data including security-related social media accounts and leading-edge security scientific and technical literature. The intelligence sources crawled by the platform comprise social media accounts (network security related bloggers) and blog classes. The overall acquisition framework of the system is shown in fig. 4, since open source information acquisition relates to a plurality of media platforms, the characteristics of each media website are different, the web page structures are different, and a crawler strategy adapted to the web page structure of a specific media website is compiled when a project is specifically developed and implemented.

A second part: multi-source heterogeneous information data fusion evaluation module

Most of the existing open source threat information is of multi-source isomerism, the quality of the information is not uniform, storage and sharing of the open source threat information are also hindered, and uncontrollable problems such as missing report and false report can be caused when the open source threat information is applied to actual scene detection. The research of the multisource heterogeneous intelligence data acquisition and fusion technology mainly comprises three parts of identification and extraction, fusion evaluation and normalized storage of intelligence data, wherein the identification and extraction of threat intelligence data is realized by automatically extracting and generating structured threat intelligence. The fusion evaluation of the open source threat intelligence provides a data fusion method and a quality evaluation mechanism for screening high-quality open source threat intelligence, and can meet practical requirements of threat detection and the like. The normalized storage will expand the information dimension of the intelligence and improve the availability of the intelligence. This section is briefly described below.

(1) Open source intelligence data identification extraction

The open source information data identification and extraction function mainly automatically extracts and generates structured threat information, saves the steps of manual analysis and information arrangement, and has great value for improving safety capability.

As shown in fig. 5, the open source intelligence data identification and extraction module design architecture includes modules such as a content preprocessor, a relationship selector, a relationship checker, and an IOC generator. The content preprocessor module is mainly used for checking the source information content by using a topic analysis technology of an NLP technology, screening out text source articles related to a safety topic, and simultaneously filtering out information content which does not contain IOC; the relation selector module is used for identifying IOC information possibly contained in source information content, mainly using NER technology to position sentence positions possibly containing IOC words, and analyzing the relation between IOC entities by combining with a Stanford analyzer; the relationship checker checks each item in the IOC, particularly converts the IOC entity relationship into a dependency relationship graph during detection and identification, and determines whether the IOC entity and the IOC candidate object have the IOC relationship by mining the relationship dependency graph; the IOC generator uses the content of such a tag to automatically create title and definition components according to the OpenIOC standard, including all the indicator entries for the identified IOC.

The workflow of open source information data identification and extraction mainly comprises the following steps:

(1) preprocessing the acquired threat information to generate a text source article;

(2) performing theme analysis on the text source article, and screening out the text source article related to a safety theme;

(3) screening out security sentences related to IOC (internet operating center) in the text source articles related to the security subjects by using NER (network element identifier) identification technology;

(4) analyzing whether the safety statement is an IOC entry or not by combining with a Stanford parser; if the security statement is the IOC item, carrying out IOC relation mining judgment on the security statement;

(5) the security statements are standardized to generate standard IOC data.

(2) Threat intelligence quality assessment

When open source threat intelligence is used to assist in supporting security decisions or analysis, the credibility and availability of the intelligence will directly affect the security decision analysis results. The method is particularly important for screening and evaluating the intelligence quality. The project proposes a threat information quality evaluation method based on information sources and information contents. The method divides the credibility assessment process of threat intelligence into two stages according to a logic sequence, namely basic credibility assessment and comprehensive credibility assessment. The basic credibility comprises credibility assessment based on the quality of the information source and credibility assessment based on the information content. And evaluating the basic credibility score of the comprehensive information source and the information content by the comprehensive credibility to comprehensively evaluate the threat information quality.

Firstly, multi-source heterogeneous threat information is defined in a structuralization mode, information source quality assessment scoring and information content quality level grading scoring are carried out on structuralized threat information to obtain an information source quality assessment score and an information content quality assessment score, and then information quality comprehensive dynamic assessment is carried out by combining the information source and the information content. FIGS. 6 and 7;

a. structured definition of threat intelligence

And aiming at multi-source heterogeneous threat intelligence data, extracting specified attributes of the multi-source heterogeneous threat intelligence data to form structured threat intelligence, and writing the structured threat intelligence into a file for storage so as to facilitate subsequent intelligence quality evaluation. The function can carry out the structuralized integration aiming at threat data from different information sources and different structure standards to generate structuralized threat information. The flow logic: 1. reading a file; 2. traversing each piece of data, and extracting specified attributes; 3. add a structured intelligence list.

b. Information source quality assessment score

The method is used for constructing a threat intelligence source relation graph and iteratively calculating the quality evaluation score of the intelligence source.

The algorithm is as follows:

the flow logic: 1. reading the threat intelligence defined by the structuralization; 2. judging the reference relationship between information sources according to attributes such as the information values, the timestamps and the like; 3. calculating au value and hub value of the intelligence source in an iterative mode until convergence; 4. an intelligence source quality assessment score is calculated.

a. Information content quality level grading assignment

The method is used for extracting the information characteristics and grading the information quality level by applying a machine learning algorithm. The algorithm is as follows: including threat intelligence source similarity calculation and KNN algorithm

(1) Calculating the similarity of the source of the threat intelligence;

θ ₁ +θ ₂ +θ ₃ +θ ₄ ＝1；

in the formula, S (v) _t ，v _i ) Representing the similarity of two threat informations, S _source Is the similarity of the intelligence source, S _time Is the time similarity of the intelligence, S _category Is the similarity of threat categories, S _tag Describing the similarity of the tags for the threats; theta ₁ 、θ ₂ 、θ ₃ And theta ₄ The four factors respectively account for the weight; calculating the similarity of information sources and the similarity of threat types of two threat intelligence: when the sources of two threat intelligence are the sameSource similarity S _source Taking 1 when S is not the same _source Taking 0; threat category similarity S _category The same applies to the values of (1).

In order to calculate the time similarity of two threat informations, the value of the information release time difference is calculated firstly, then the normalization is carried out to obtain the time distance between 0 and 1, and finally the time similarity is obtained:

two threat intelligence v _t And v _i The time similarity calculation of (2) includes:

Δt(v _t ，v _i )＝|t(v _t ，v _i )|；

Min＝min(Δt(v _t ，v _i ))；

Max＝max(Δt(v _t ，v _i ))；

S _time (v _t ，v _i )＝1-D _time (v _t ，v _i )。

cosine similarity is adopted to calculate similarity between words of the threat tag, namely, the cosine value of an included angle between two vectors in space is used for measuring the difference between two individuals, and the closer the cosine value is to 1, the closer the included angle is to 0 degree, namely the more similar the two vectors are. Two threat intelligence v _t And v _i The threat description tag similarity calculation method of (1) is as follows:

in the formula, X _t And X _i Respectively threat information v _t And v _i The threat of (a) describes a vector representation of the tag,

the cosine similarity of the two vectors is in a value range of [0,1 ]](ii) a When the two threat information labels are consistent, the cosine similarity is 1; if two threats are informativeIf the threat tag of (1) is empty, the similarity between the threat tag and the threat tag is set to 0.5.

With R _pos (support) and R _neg (objection) to represent the result of threat intelligence validation. According to the optimal parameter k obtained by experiments, when S (v) _t ，v _i ) K or more, i.e. other information v _t To v is to v _i Indicating the attitude supported, whereas if S (v) _t ，v _i ) If < k, other information v _i Representing an inverse attitude.

Suppose threat intelligence v for a target _t All verifiable intelligence, namely IP value and v, can be collected _t The number of messages having the same IP value is N _verify Wherein the number of information supporting attitude after verification is Sigma R _pos Then the ratio of the support attitude intelligence is;

and constructing quality evaluation indexes based on the information content, wherein the quality evaluation indexes comprise multiple dimensions such as content richness, content multi-source verification, timeliness and the like:

TABLE 1 summary of evaluation indices based on information content

The preprocessed feature data is used as the input of a classifier, a KNN algorithm is applied to divide the intelligence samples into five types according to the credibility, and each type is respectively endowed with an intelligence quality score s based on the content _content 。

a. Comprehensive evaluation of intelligence quality

And the comprehensive information source and the information content are used for comprehensively evaluating the quality of threat information. The flow logic: 1. reading the intelligence source quality evaluation score and the intelligence content evaluation score; 2. determining the weight occupied by the two scores according to a coefficient of variation method; 3. and calculating an intelligence quality comprehensive evaluation score.

The algorithm is as follows:

for each threat intelligence, its base confidence is the confidence based on the intelligence source quality and the confidence based on the intelligence content, and its confidence is evaluated by weighting its base confidence sum, as shown in the following formula:

wherein S (n) is the comprehensive credibility score of the nth information sample, S _i (n) the i-th basic confidence score of the intelligence sample; w is a _i For each factor s _i (n) weight is occupied; such as s ₁ (1) Confidence score s based on intelligence source quality for a first intelligence sample _source ，s ₂ (1) Confidence score S based on information content for first information sample _content ；w _i For each factor s _i And (n) the occupied weight, and the value of the weight needs to be adjusted according to the actual situation.

Weight occupied by each factor w _i Can be determined by the coefficient of variation method. The coefficient of variation method is an objective weighting method, which directly uses the information contained in each index and obtains the weight of the index through calculation. Because the dimensions of each index in the evaluation index system are different, it is not suitable to directly compare the difference degrees. In order to eliminate the influence of different dimensions of each evaluation index, the variation coefficient of each index is needed to measure the difference degree of the value of each index. The reason why the weight of the coefficient of variation can be determined is that in the evaluation system, if the value difference of a certain factor is large, the factor is difficult to realize, and is a key factor reflecting the difference of the evaluated objects, namely, a factor with large value difference is given higher weight.

The coefficient of variation is numerically equal to the quotient of the standard deviation and the mean, i.e.

Wherein v is _i Is the coefficient of variation, σ, of the i-th basic confidence score _i Is the standard deviation of the i-th item's base confidence score,

is the average of the i-th term base confidence scores.

The weight of each factor being equal to the coefficient of variation of that factor divided by the sum of the coefficients of variation of each factor, i.e.

To summarize: the information source quality and the information content influence the credibility of the information, and the importance of each factor on the comprehensive credibility score needs to be considered; a weighted average model is employed to calculate a composite confidence score that combines both the quality of the intelligence source and the intelligence content. By using the credibility score based on the quality of the information source and the credibility score based on the information content, the comprehensive credibility score can be calculated, and the score is finally divided into five grades: a (80-100), B (60-80), C (40-60), D (20-40), E (0-20).

(3) Normalized storage output

The research of the part is mainly to mark quality evaluation scores on the extracted IOC information and perform collision association with vulnerability information, in the execution of specific projects, the dimensionality which can be covered by comprehensively researched and judged open source information, the scene and elements of threat attack are expanded in the dimensionality of the threat information, the information dimensionality such as vulnerability information, quality evaluation scores, attack methods, attacker attributes, influence assets, activity traces and the like is mainly increased, and normalized packaging and output of analyzed and researched threat information data are realized. This information will serve as a standardized intelligence form that provides input clues to the central subsequent association mining of threats based on the existing threat knowledge base and traffic context.

a. Tagging IOCs with mass scores

And marking the calculated quality evaluation score into the extracted IOC information. The flow logic: 1. reading the comprehensive evaluation score of the intelligence quality; 2. the intelligence quality comprehensive assessment score is written into the IOC information.

b. Collision association with vulnerability information

The method is used for comprehensively researching and judging the dimensionality which can be covered by open source information and the scene and elements of threat attack to expand the dimensionality of the threat information, mainly increases the information dimensionality such as vulnerability information, quality evaluation score, attack technique, attacker attribute, influence asset, activity trace and the like, and realizes the normalized packaging and output of the analyzed and researched threat information data. The flow logic: 1. reading information such as quality evaluation scores, vulnerability information, attack techniques, attacker attributes, influence assets, activity total and the like; 2. the enriched information is written into the information; 3. and (5) carrying out normalized encapsulation on the threat intelligence data.

And a third part: threat information deep mining module

(1) Open source threat intelligence statistical analysis

Comprehensive statistical analysis is carried out on the quantity, distribution, frequency and the like of various types of threat information from different dimensions such as time, space, attributes and the like, and the historical traceability and trend prediction of the security threat are preliminarily mastered. When the design is implemented, the normalized and packaged data is stored in an SQL database, and when the data is displayed, the data is read from the tables and returned to a front-end page.

The functional logic: an attempt is made to establish communication with the target node by continuously sending DHT find node information to the target node. The information returned may be a configuration file sent by the mozi node (unified collection) or may be a routing table stored by the current node itself, which contains relevant information of other mozi nodes (ip address, port number, node id, etc.). And judging whether the target node is a mozi node or not through the analysis processing of the returned information, and continuously sending find _ node information to other nodes according to the routing information. The functional logic diagram is as shown in fig. 8, and sends a plurality of DHT find _ node requests to a target node; judging whether the target node is a mozi node or not according to the mark field; analyzing the information returned by the target node;

(2) Open source threat intelligence correlation analysis

With reference to fig. 9 and fig. 10, statistical analysis of dimensions such as quantity, distribution, frequency and the like of various threat data information from different dimensions such as time, space, attribute and the like is realized, and a history traceability and trend prediction technology of the security threat is researched; the method is characterized in that a deep mining technology based on flow associated threat information is researched, a networking vulnerability aggregation base and threat information database information are combined, a relevant model is researched and applied, a fuzzy pattern matching model is constructed by combining the existing threat data and flow information of a center, a flow associated matching algorithm (such as the algorithm mentioned in the patent application with the application number of 2021110988474) is designed, potential attack behavior mining is achieved, and hidden threat information such as attack chains is revealed through reasoning mining.

a. Collecting and analyzing primarily threat information (FIG. 9)

Because the processing resources for threat intelligence are limited, the amount of threat intelligence that can be analyzed at one time is constant, and the process of collecting threat intelligence from multiple intelligence sources is scheduled based on freshness in order to reasonably utilize resources. When the threat situation report needs to be analyzed, the following scheduling algorithm is provided: setting weight according to the credibility of the information sources, designing a priority formula during scheduling, and selecting the information sources according to the priority formula for obtaining threat information in each scheduling until the threat information needing to be analyzed is collected. And finally judging whether the scheduling algorithm is good or not, and according to whether the threat information obtained by scheduling can better reflect the characteristics of the current mainstream threat information or not.

b. Generating threat information detection model (FIG. 10)

And carrying out subsequent analysis on threat intelligence obtained by scheduling, and adopting multi-step characteristic change detection. Through a scheduling algorithm, the data center can determine its priority by identifying samples collected by different intelligence sources of the intelligent device. In the training module, the detection system learns statistical characteristics of threats and trains models that can identify similar threats. In feature change detection, feature changes are distinguished by comparing feature differences of threats. After threat intelligence is obtained, further exploratory analysis of the intelligence can be performed. In addition, a threat detection model may be generated from threat intelligence for detecting threats. Meanwhile, the module also detects threat intelligence characteristics, and retrains the threat detection model if the threat information characteristics change so as to improve the detection rate of the threat.

The above description is only a preferred embodiment of the present invention, and the scope of the present invention is not limited thereto, and various simple modifications can be made to the technical solution of the present disclosure within the technical idea scope of the present disclosure, and these simple modifications all belong to the scope of the present disclosure.

Claims

1. An open source threat intelligence aggregation platform, characterized by the setting: the system comprises a multi-source heterogeneous information data acquisition module, a multi-source heterogeneous information data fusion evaluation module and a threat information deep mining module;

and the threat information deep mining module is used for constructing a fuzzy pattern matching model by combining the existing matching relation between the threat data and the flow information, matching the threat information and the flow information in the vulnerability aggregation base and the threat information database of the Internet of things and realizing the mining of the potential attack behavior.

2. The open source threat intelligence aggregation platform of claim 1, wherein the multi-source heterogeneous intelligence data fusion evaluation module comprises: firstly, multi-source heterogeneous threat information is defined in a structuralized mode, information source quality assessment scoring and information content quality level grading scoring are carried out on the structuralized threat information, and then information comprehensive credibility score is calculated by combining the information source and the information content.

3. The open source threat intelligence aggregation platform of claim 2, wherein the intelligence source quality assessment awarding specifically comprises: the authority value and hub value of each intelligence source, and the intelligence source quality evaluation score S _ source are percentage conversion of the acquired authority value of the intelligence source;

Calculating; after obtaining new vector A, the hub value of the information source S is expressed by the formula

4. The open source threat intelligence aggregation platform of claim 2, wherein the intelligence content quality level rating assignment comprises: calculating the similarity of the threat information source;

θ ₁ +θ ₂ +θ ₃ +θ ₄ ＝1；

in the formula, S (v) _t ，v _i ) Representing the similarity of two threat informations, S _source Is the similarity of the intelligence source, S _time Is the time similarity of the intelligence, S _category Is the similarity of threat categories, S _tag Describing similarity of tags for threats；θ ₁ 、θ ₂ 、θ ₃ And theta ₄ The weight of each of the four factors is; calculating the similarity of information sources and the similarity of threat types of two threat informations: information source similarity S when sources of two threat informations are same _source Taking 1 at different times S _source Taking 0; threat category similarity S _category The same applies to the values of (1).

5. The open source threat intelligence aggregation platform of claim 4, wherein the two threat intelligence v _t And v _i The time similarity calculation of (a) includes:

Δt(v _t ，v _i )＝|t(v _t ，v _i )|；

Min＝min(Δt(v _t ，v _i ))；

Max＝max(Δt(v _t ，v _i ))；

S _time (v _t ，v _i )＝1-D _time (v _t ，v _i )。

6. the open-source threat intelligence aggregation platform of claim 4, wherein two threat intelligence v _t And v _i The threat description tag similarity calculation method of (2) is as follows:

in the formula, X _t And X _i Respectively threat information v _t And v _i The threat description label of (a) describes a vector representation of the tag,

is the cosine similarity of two vectors and has a value range[0，1](ii) a When the two threat information labels are consistent, the cosine similarity is 1; if the threat tag of both threat intelligence is empty, the similarity between them is specified to be 0.5.

7. The open-source threat intelligence aggregation platform of any of claims 2-7, wherein for each piece of threat intelligence, its base confidence is a confidence based on the quality of the intelligence source and a confidence based on the content of the intelligence, and its confidence is evaluated by weighting its base confidence sum, as shown in the following equation:

wherein S (n) is the comprehensive credibility score of the nth information sample, S _i (n) the i-th basic confidence score of the intelligence sample; w is a _i For each factor s _i (n) is the weight occupied.

8. The open-source threat intelligence aggregation platform according to any one of claims 1 to 7, wherein the identification extraction includes an open-source intelligence data identification extraction module, and the open-source intelligence data identification extraction module is provided with a content preprocessor, a relationship selector, a relationship checker and an IOC generator;

the content preprocessor module is mainly used for checking the source information content by using a topic analysis technology of an NLP technology, screening out text source articles related to a safety topic, and simultaneously filtering out information content which does not contain IOC; the relation selector module is used for identifying IOC information possibly contained in source information content, mainly positioning sentence positions possibly containing IOC words by using NER technology, and analyzing the relation among IOC entities by combining with a Stanford parser; the relationship checker checks each item in the IOC, particularly converts the IOC entity relationship into a dependency relationship graph during detection and identification, and determines whether the IOC relationship exists in the IOC entity and the IOC candidate object by mining the graph according to the relationship dependency graph; the IOC generator uses the content of such a tag to automatically create title and definition components according to the OpenIOC standard, including all the indicator entries for the identified IOC.

9. The open source threat intelligence aggregation platform of any one of claims 1-7, wherein the normalized storage comprises:

marking the extracted IOC information with a quality evaluation score, performing collision association with the vulnerability information, expanding dimensions of threat information, increasing dimensions of vulnerability information, quality evaluation score, attack technique, attacker attribute, asset and activity trace information, and realizing normalized packaging and output of threat information data after analysis and judgment.