CN104166680B

CN104166680B - Parallel vulnerability mining method based on open source library and text mining

Info

Publication number: CN104166680B
Application number: CN201410332588.0A
Authority: CN
Inventors: 赵向辉; 刘林; 刘晖; 易锦; 陈海强; 张磊; 姚原岗; 李维杰; 偰赓; 刘彦钊
Original assignee: China Information Technology Security Evaluation Center
Current assignee: China Information Technology Security Evaluation Center
Priority date: 2014-07-12
Filing date: 2014-07-12
Publication date: 2017-05-17
Anticipated expiration: 2034-07-12
Also published as: CN104166680A

Abstract

The invention relates to a parallel vulnerability mining method based on an open source library and test mining, and belongs to the technical field of computer information safety. The parallel vulnerability mining method comprises the steps that vulnerability data are obtained from the open source library and pre-processed, a vulnerability set is extracted, text vectorization is conducted, the threshold is calculated and parallel vulnerabilities are discovered. The parallel vulnerability mining method has the advantages that on the basis of the open source library, relevant vulnerability information in the same attack mode is extracted, and therefore potential parallel relationships between the vulnerabilities can be analyzed conveniently; text description information of the vulnerabilities are vectorized, and therefore a computer system can conduct intelligent processing on vulnerability recording data conveniently; the method differs from query on the basis of keyword matching in that the similarity between the vulnerabilities is studied according to the threshold obtained through a training set; the parallel relationships between the vulnerabilities can be calculated, so that when it is found that one vulnerability is utilized, the other parallel vulnerabilities are made up rapidly, therefore, the vulnerability of a whole network is made up, the defense capacity is enhanced and great significance for information safety is achieved.

Description

A kind of parallel bug excavation method based on increase income storehouse and text mining

Technical field

The invention belongs to computer information safety technique field.Specifically, the present invention relates to it is a kind of based on increase income storehouse with The parallel bug excavation method of text mining.

Background technology

With Internet continuous popularization and develop rapidly, people are increasingly using and depend on internet.Together When, because the economic loss that internet information safety problem is produced is greatly improved, the harm for causing also is significantly increased.It is present in various Leak in software and operating system, is that hackers start network attack, steal user profile, even destroy industrial infrastructure There is provided condition.For the effective ways that potential safety hazard in guarding network is studied in this countries in the world with refusing surplus energy.

From the beginning of 1996, information security expert begins to that internet attack mode is concluded and summarized, it is intended to from Higher level understands network attack.And be then by Department of Homeland Security of the U.S. (United what is among these, be widely recognized as States Department of Homeland Security, DHS) support CAPEC (Common AttackPattern Enumeration and Classification) project.In CAPEC, each attack mode is designated unique ID, and There is provided following information：(platform for such as relying on and software version are attacked for attack mode description, attack step, precondition It is front need meet technical conditions), postcondition (such as attack after obtain administrator right), attack instance, correlation CWE (Common Weakness Enumeration) fragility and CVE (Common Vulnerabilities and Exposures) leak etc..CAPEC not only elaborates the detail of network attack mode, also indicate it with CWE, CVE it Between contact.CWE is the fragility set in a description software.For each fragility, it provides following information：It is crisp The CVE leaks of weak property description, applicable platform, result, example and the correlation for causing.One CWE fragility correspond to many Individual related CVE leaks.

Parallel leak (Parallel Vulnerabilities) according to the present invention, refers to attack basis with identical Condition, the leak combination that identical attack purpose and effect can be reached but difference attack paths are utilized.In brief, parallel leak Using the method for multipath come analytical attack possibly also with approach, so as to improve probability and the covering of successful defending against network attacks Rate.

Have close ties with parallel leak is then attack graph：Attack graph is described from the visual angle of attacker or defender How the purpose attacked or take precautions against is reached using leak present in system；System administration can they be by attacking graph evaluation The security of system, and determine to take that what kind of remedial measure is taken precautions against.Parallel leak is just derived from same in attack graph Branch parallel relation between starting point and end point, for finding that these have the leak of concurrency relation to combine, can find one Make up other parallel leaks when leak is utilized rapidly, be easy in time to corresponding leak patch installing, and then make up whole network Fragility, with higher cyber-defence using value.

The content of the invention

It is an object of the present invention to be based on information bank of increasing income, extract and associate under same attack mode fragility CWE, then from Fragility CWE is associated with leak CVE, then by text mining and natural language processing technique, excavates parallel leak, thus With higher cyber-defence using value.

The technical scheme is that：

A kind of parallel bug excavation method based on increase income storehouse and text mining, step includes：

1. data acquisition and pretreatment：

A) original aggressor pattern information, vulnerability information and vulnerability information are obtained from storehouse of increasing income

B) the storehouse information data of increasing income to obtaining carries out data scrubbing, including Uniform data format, data purification, filling lack Lose attribute and remove noise data

C) data after cleaning are set up with data acquisition system, and forms the mapping for being identified to non-structured text information, formed Three set Pattern { p₁,p₂,…,p_n, Weak { w₁,w₂,…,w_m, Vulnerability { v₁,v₂,…,v_s}.Wherein p_i Represent an attack mode, w_iRepresent a fragility, v_iRepresent a leak.

2. mapping relations and leak set are extracted

Parse attack mode p_iTo fragility w_iMapping relations (one-to-many), fragility w_iTo leak v_iMapping close System's (one-to-many), then forms attack mode p_iTo leak v_iMapping relations, so as to huge vulnerability database be attacked according to some Blow mode is concluded to different leak set.

3. the text description information of pair leak carries out mathematical modeling

A) the text description information of leak is extracted, participle is carried out, stem is extracted, text is carried out to description according to feature dictionary Vectorization, is represented by D_i=(w_1,i,w_2,i,w_3,i,…,w_n,i), wherein w_n,iRepresent entry n in document D_iIn quantitative index table Show.

B) TF/IDF (word frequency/frequency of the falling document) value of document word is calculated according to the description information of text vector, and then Document is represented by d_i=(w_1,i,w_2,i,w_3,i,…,w_n,i), wherein w_n,_iRepresent entry n in document d_iIn TF/IDF weighting Value.

4. the leak of the leak set extracted in pair step 2, the precondition, postcondition and threshold according to attack mode Value is screened, and obtains the set of parallel leak combination

Parrallel{(v_i,v_j)|i≠j,v_i∈vulnerability}

A) according to the parallel leak training set of artificial mark, the threshold value of the cosine similar value of the parallel leak of calculating (be shown in by method Step 4-a) and CVSS assessment fraction Euclidean distance threshold value

B) in the leak set that obtains in step 2, according to the leak vector information got in step 3, corresponding to The precondition and postcondition of attack mode calculates respectively the cosine similar value (method is shown in step 4-a) of two leaks

C) according to the cosine similar value and the Euclidean distance for assessing fraction of CVSS of two leaks, by two each Leak from threshold range is determined as parallel leak.

Step 3-a) described in segmenting method, its step is to be separated text flow by space, obtains word sequence.Go Except the character of particular meaning, replace //, ,/, for/, remove word tail number.

Step 3-b) described in extract stem method, text is processed using Krovetz Stemmer.

Step 3-b) described in calculate text TF/IDF values method,

TF (t, d)=0.5+ (0.5*f (t, d))/(max { f (w, d):w∈d})

IDF (t, D)=log (| D |)/(| { d ∈ D:t∈d}|)

Wherein f (t, d) is word frequency of the word t in document d, and f (w, d) is word frequency of the word w in document d, and w belongs to text Shelves d, it can represent any one word therein；| D | is all number of files for including in corpus D, | { d ∈ D:t∈d}| For the number of files comprising word t in corpus D.

Step 4-a) described in calculate cosine similarity method,

Wherein v_jRepresent the description text of leak j, v_kRepresent the description text of leak k.w_i,jRepresent entry i in document d_jIn TF/IDF weighted values, w_i,kRepresent entry i in document d_kIn TF/IDF weighted values.

v_jRepresent the description text of leak j, v_kRepresent the description text of leak k；w_i,jRepresent entry i in document d_jIn TF/IDF weighted values, w_i,kRepresent entry i in document d_kIn TF/IDF weighted values.

Advantages of the present invention mainly has：

1. the word description field value of existing leak record data is unstructured text data, and different leak Description source causes its descriptor format different with word, and computer cannot be automatically performed the automatic understanding to word description field. The present invention carries out numerical value vectorization process to word description field value according to feature lexicon, and indefinite word description is bright Trueization, is easy to intellectuality, automation, scale process of the computer system to leak record data.

2. when there was only consistent clear and definite classification, dangerous program height between the leak record of existing vulnerability scan, finding Between priority relation, leak record between security attack dependence cannot embody, the present invention can excavate spring a leak data it Between imply parallel linked model and rule, for leak data further application basis is provided.

3. existing attack mode only includes the attack mode of the network attack having occurred and that, corresponding vulnerability information is single solid Fixed, autgmentability is not strong, it is therefore foreseen that property is not strong, and the discovery of correspondence attack mode is highly difficult.The present invention is according to the attack mould increased income in storehouse Formula recycles text mining to find parallel leak, to the attack mode for having produced to fragility, then to the mapping relations of leak It is expansible go out the combination of various leaks, it is also contemplated that certain attack mode, is that Protection of Network Security provides the foundation.

Description of the drawings

Fig. 1 is the flow chart of system

Fig. 2 is the acquisition of the data of square frame 1001 and pretreatment process figure in Fig. 1

Fig. 3 is that square frame 1005 extracts leak set flow chart in Fig. 1

Fig. 4 is the text vector flow chart of square frame 1009 in Fig. 1

Fig. 5 is that square frame 1013 obtains threshold value flow chart in Fig. 1

Fig. 6 is that square frame 1017 obtains parallel leak result flow chart in Fig. 1

Fig. 7 is the data structure diagram that data form is used in Fig. 2

Specific embodiment

It is as follows that in conjunction with the accompanying drawings and embodiments the invention will be further described.

Fig. 1 is the flow chart of the present invention, illustrates one embodiment that parallel leak finds.The parallel leak finds to start from To leak data acquisition and pre-processing.Second step is extraction leak set.3rd step is the vectorization of vulnerability information.4th Step is to obtain threshold value using training set.5th step is the parallel leak result of acquisition.

Fig. 2 is data acquisition and pretreatment process figure in Fig. 1, illustrates how to obtain leak data and is pre-processed One embodiment.The flow process starts from obtaining data (MITRE and CVSS) from two public data sources, and next step is to original number According to input system is carried out, consolidation form operation forms four data acquisition systems (CAPEC, CWE, CVE and CVSSscores), finally Formation is identified to the mapping of data.

Fig. 7 is data structure used in Fig. 2.

Fig. 3 is that leak set flow chart is extracted in Fig. 1, is initially formed the mapping of CAPEC to CWE (according to Related_ Weakness), the mapping (Observed_Example) of CWE to CVE is subsequently formed, the mapping of CAPEC to CVE is eventually formed, So as to huge leak set be split according to pattern.

Fig. 4 is the flow chart of Fig. 1 Chinese version vectorizations, and the first step obtains dictionary, and method is traversal vulnerability database, extracts leak The stem information of description, according to Rules Filtering antistop list is gone out.Second step extracts word respectively by the description participle of each leak It is dry, and according to dictionary, non-key word is removed, constitute keyword sequence.3rd step uses the keyword sequence vectorization of leak TF/IDF analog values element vectors.

Fig. 5 is the flow chart that threshold value is obtained in Fig. 1, manually choose first training set and, according to former using cosine similarity Reason calculates similarity, and Euclidean distance, threshold value of ascending the throne of averaging can be calculated using fraction according to CVSS.

Fig. 6 is to obtain parallel leak result flow chart in Fig. 1, suitable using the leak set obtained in Fig. 3 as test set Two different leaks of selection of sequence, calculate first assessment fractional value, then continue if less than threshold value, then calculate cosine similar Value, is then parallel leak if greater than threshold value.

According to above-mentioned parallel bug excavation method thinking and realize step, by based on increase income storehouse and text mining and The operation result of row bug excavation prototype system knows, can finally find that the parallel leak number of combinations for obtaining is 2013 groups；According to real Trample and found after experience is manually verified, there are 192 groups not to be effective parallel leak in the parallel leak combination that system is obtained, just Really rate is about 90.5%.

In terms of the performance test to prototype system, obtain with System.currentTimeMillis () method herein Time difference calculates corresponding speed, is obtained and reaches tens of thousands of in data flow, when system is responded in the case of needing to extract tens of thousands of times Between.Such as, in terms of threshold value is calculated, complete to come from the CAPEC attack modes comprising two and its above relevant vulnerability and amount to The response time of the training set of 29 groups of data is 40ms；In terms of to its parallel bug excavation, in 400 attack modes The process response time of 62407 leaks is 35s.Incorporation engineering practical experience, the system possesses acceptable performance indications.

It is in addition to analyze the Result example of prototype system, such as attack mode CPAEC-157：Sniffing Attacks, by analyzing its attack step, the weakness (CWE) of association, precondition, postcondition, correlation technique and applicable The important informations such as technological accumulation and inheritance, according to said method step, the cosine similar value that can be calculated leak is about 0.26, CVSS and comments The value for estimating the Euclidean distance of fraction is 0, and combined training obtains the threshold value of the cosine similar value of parallel leak and CVSS assessments The threshold value of the Euclidean distance of fraction, analysis knows this two in threshold range, and we are obtained its leak CVE-2009- 1466 and CVE-2008-1567 is one group of parallel leak.

By above-mentioned experimental analysis, corresponding parallel leak can be relatively accurately excavated using context of methods, and With preferable systematic function.

This paper Security-Orienteds field, studies a question for information security leak association mining, herein based on CAPEC, CWE and CVE etc. increases income storehouse, attempts introducing the thinking of parallel bug excavation, it is proposed that a kind of based on storehouse and the text mining of increasing income Parallel bug excavation method.First it is to be based on information bank of increasing income, association fragility CWE under the same attack mode of extraction, then from Fragility CWE is associated with leak CVE, consequently facilitating potential concurrency relation between analysis leak；Then by the word description of leak Information vector, intelligent processing method is carried out in order to software systems to leak record data；It is different from based on keyword match Inquiry, according to the threshold value that training set draws the similarity between leak is investigated, and calculates the concurrency relation between leak, excavates parallel leakage Hole；Finally, by experimental verification and practice, the correctness and validity of context of methods is demonstrated.Obtain according to context of methods As a result, other parallel leaks can be rapidly made up when one leak of discovery is utilized, and then makes up the fragility of whole network, tool There is higher cyber-defence using value.

Claims

1. it is a kind of based on parallel bug excavation method of the storehouse with text mining of increasing income, it is characterised in that step includes：

1) data acquisition and pretreatment；

A) original aggressor pattern information, vulnerability information and vulnerability information are obtained from storehouse of increasing income；

B) the storehouse information data of increasing income to obtaining carries out data scrubbing, including Uniform data format, data purification, filling disappearance category Property and remove noise data；

C) data after cleaning are set up with data acquisition system, and forms the mapping for being identified to non-structured text information, form three Set Pattern { p₁, p₂..., p_n, Weak { w₁, w₂..., w_m, Vulnerability { v₁, v₂..., v_s}；Wherein p_iRepresent One attack mode, w_iRepresent a fragility, v_iRepresent a leak；

2) mapping relations and leak set are extracted；

Parse attack mode p_iTo fragility w_jMapping relations, fragility w_jTo leak v_kMapping relations, then formed attack Blow mode p_iTo leak v_kMapping relations, so as to huge vulnerability database is concluded to different leakages according to some attack modes Gather in hole；

3) mathematical modeling is carried out to the text description information of leak；

D) the text description information of leak is extracted, participle is carried out, stem is extracted, text description information is carried out according to feature dictionary Text vector；

E) it is the word frequency/frequency of falling document value according to the tf/idf of the description information calculating document word of text vector；

4) to step 2) in the leak of leak set that extracts, the precondition, postcondition and threshold value according to attack mode Screened, obtained the set of parallel leak combination；

F) according to the parallel leak training set of artificial mark, the threshold value and CVSS assessment point of the cosine similar value of parallel leak is calculated The threshold value of several Euclidean distances；

H) in step 2) in the leak set that obtains, according to step 3) in the leak vector information that gets, corresponding to attacking The precondition and postcondition of blow mode calculates respectively the cosine similar value of two leaks；

G) Euclidean distance of fraction is assessed according to the cosine similar value and CVSS of two leaks, by two in respective threshold value In the range of leak be determined as parallel leak.