CN104166680B - Parallel vulnerability mining method based on open source library and text mining - Google Patents

Parallel vulnerability mining method based on open source library and text mining Download PDF

Info

Publication number
CN104166680B
CN104166680B CN201410332588.0A CN201410332588A CN104166680B CN 104166680 B CN104166680 B CN 104166680B CN 201410332588 A CN201410332588 A CN 201410332588A CN 104166680 B CN104166680 B CN 104166680B
Authority
CN
China
Prior art keywords
leak
parallel
vulnerability
information
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410332588.0A
Other languages
Chinese (zh)
Other versions
CN104166680A (en
Inventor
赵向辉
刘林
刘晖
易锦
陈海强
张磊
姚原岗
李维杰
偰赓
刘彦钊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Information Technology Security Evaluation Center
Original Assignee
China Information Technology Security Evaluation Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Information Technology Security Evaluation Center filed Critical China Information Technology Security Evaluation Center
Priority to CN201410332588.0A priority Critical patent/CN104166680B/en
Publication of CN104166680A publication Critical patent/CN104166680A/en
Application granted granted Critical
Publication of CN104166680B publication Critical patent/CN104166680B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a parallel vulnerability mining method based on an open source library and test mining, and belongs to the technical field of computer information safety. The parallel vulnerability mining method comprises the steps that vulnerability data are obtained from the open source library and pre-processed, a vulnerability set is extracted, text vectorization is conducted, the threshold is calculated and parallel vulnerabilities are discovered. The parallel vulnerability mining method has the advantages that on the basis of the open source library, relevant vulnerability information in the same attack mode is extracted, and therefore potential parallel relationships between the vulnerabilities can be analyzed conveniently; text description information of the vulnerabilities are vectorized, and therefore a computer system can conduct intelligent processing on vulnerability recording data conveniently; the method differs from query on the basis of keyword matching in that the similarity between the vulnerabilities is studied according to the threshold obtained through a training set; the parallel relationships between the vulnerabilities can be calculated, so that when it is found that one vulnerability is utilized, the other parallel vulnerabilities are made up rapidly, therefore, the vulnerability of a whole network is made up, the defense capacity is enhanced and great significance for information safety is achieved.

Description

A kind of parallel bug excavation method based on increase income storehouse and text mining
Technical field
The invention belongs to computer information safety technique field.Specifically, the present invention relates to it is a kind of based on increase income storehouse with The parallel bug excavation method of text mining.
Background technology
With Internet continuous popularization and develop rapidly, people are increasingly using and depend on internet.Together When, because the economic loss that internet information safety problem is produced is greatly improved, the harm for causing also is significantly increased.It is present in various Leak in software and operating system, is that hackers start network attack, steal user profile, even destroy industrial infrastructure There is provided condition.For the effective ways that potential safety hazard in guarding network is studied in this countries in the world with refusing surplus energy.
From the beginning of 1996, information security expert begins to that internet attack mode is concluded and summarized, it is intended to from Higher level understands network attack.And be then by Department of Homeland Security of the U.S. (United what is among these, be widely recognized as States Department of Homeland Security, DHS) support CAPEC (Common AttackPattern Enumeration and Classification) project.In CAPEC, each attack mode is designated unique ID, and There is provided following information:(platform for such as relying on and software version are attacked for attack mode description, attack step, precondition It is front need meet technical conditions), postcondition (such as attack after obtain administrator right), attack instance, correlation CWE (Common Weakness Enumeration) fragility and CVE (Common Vulnerabilities and Exposures) leak etc..CAPEC not only elaborates the detail of network attack mode, also indicate it with CWE, CVE it Between contact.CWE is the fragility set in a description software.For each fragility, it provides following information:It is crisp The CVE leaks of weak property description, applicable platform, result, example and the correlation for causing.One CWE fragility correspond to many Individual related CVE leaks.
Parallel leak (Parallel Vulnerabilities) according to the present invention, refers to attack basis with identical Condition, the leak combination that identical attack purpose and effect can be reached but difference attack paths are utilized.In brief, parallel leak Using the method for multipath come analytical attack possibly also with approach, so as to improve probability and the covering of successful defending against network attacks Rate.
Have close ties with parallel leak is then attack graph:Attack graph is described from the visual angle of attacker or defender How the purpose attacked or take precautions against is reached using leak present in system;System administration can they be by attacking graph evaluation The security of system, and determine to take that what kind of remedial measure is taken precautions against.Parallel leak is just derived from same in attack graph Branch parallel relation between starting point and end point, for finding that these have the leak of concurrency relation to combine, can find one Make up other parallel leaks when leak is utilized rapidly, be easy in time to corresponding leak patch installing, and then make up whole network Fragility, with higher cyber-defence using value.
The content of the invention
It is an object of the present invention to be based on information bank of increasing income, extract and associate under same attack mode fragility CWE, then from Fragility CWE is associated with leak CVE, then by text mining and natural language processing technique, excavates parallel leak, thus With higher cyber-defence using value.
The technical scheme is that:
A kind of parallel bug excavation method based on increase income storehouse and text mining, step includes:
1. data acquisition and pretreatment:
A) original aggressor pattern information, vulnerability information and vulnerability information are obtained from storehouse of increasing income
B) the storehouse information data of increasing income to obtaining carries out data scrubbing, including Uniform data format, data purification, filling lack Lose attribute and remove noise data
C) data after cleaning are set up with data acquisition system, and forms the mapping for being identified to non-structured text information, formed Three set Pattern { p1,p2,…,pn, Weak { w1,w2,…,wm, Vulnerability { v1,v2,…,vs}.Wherein pi Represent an attack mode, wiRepresent a fragility, viRepresent a leak.
2. mapping relations and leak set are extracted
Parse attack mode piTo fragility wiMapping relations (one-to-many), fragility wiTo leak viMapping close System's (one-to-many), then forms attack mode piTo leak viMapping relations, so as to huge vulnerability database be attacked according to some Blow mode is concluded to different leak set.
3. the text description information of pair leak carries out mathematical modeling
A) the text description information of leak is extracted, participle is carried out, stem is extracted, text is carried out to description according to feature dictionary Vectorization, is represented by Di=(w1,i,w2,i,w3,i,…,wn,i), wherein wn,iRepresent entry n in document DiIn quantitative index table Show.
B) TF/IDF (word frequency/frequency of the falling document) value of document word is calculated according to the description information of text vector, and then Document is represented by di=(w1,i,w2,i,w3,i,…,wn,i), wherein wn,iRepresent entry n in document diIn TF/IDF weighting Value.
4. the leak of the leak set extracted in pair step 2, the precondition, postcondition and threshold according to attack mode Value is screened, and obtains the set of parallel leak combination
Parrallel{(vi,vj)|i≠j,vi∈vulnerability}
A) according to the parallel leak training set of artificial mark, the threshold value of the cosine similar value of the parallel leak of calculating (be shown in by method Step 4-a) and CVSS assessment fraction Euclidean distance threshold value
B) in the leak set that obtains in step 2, according to the leak vector information got in step 3, corresponding to The precondition and postcondition of attack mode calculates respectively the cosine similar value (method is shown in step 4-a) of two leaks
C) according to the cosine similar value and the Euclidean distance for assessing fraction of CVSS of two leaks, by two each Leak from threshold range is determined as parallel leak.
Step 3-a) described in segmenting method, its step is to be separated text flow by space, obtains word sequence.Go Except the character of particular meaning, replace //, ,/, for/, remove word tail number.
Step 3-b) described in extract stem method, text is processed using Krovetz Stemmer.
Step 3-b) described in calculate text TF/IDF values method,
TF (t, d)=0.5+ (0.5*f (t, d))/(max { f (w, d):w∈d})
IDF (t, D)=log (| D |)/(| { d ∈ D:t∈d}|)
Wherein f (t, d) is word frequency of the word t in document d, and f (w, d) is word frequency of the word w in document d, and w belongs to text Shelves d, it can represent any one word therein;| D | is all number of files for including in corpus D, | { d ∈ D:t∈d}| For the number of files comprising word t in corpus D.
Step 4-a) described in calculate cosine similarity method,
Wherein vjRepresent the description text of leak j, vkRepresent the description text of leak k.wi,jRepresent entry i in document djIn TF/IDF weighted values, wi,kRepresent entry i in document dkIn TF/IDF weighted values.
vjRepresent the description text of leak j, vkRepresent the description text of leak k;wi,jRepresent entry i in document djIn TF/IDF weighted values, wi,kRepresent entry i in document dkIn TF/IDF weighted values.
Advantages of the present invention mainly has:
1. the word description field value of existing leak record data is unstructured text data, and different leak Description source causes its descriptor format different with word, and computer cannot be automatically performed the automatic understanding to word description field. The present invention carries out numerical value vectorization process to word description field value according to feature lexicon, and indefinite word description is bright Trueization, is easy to intellectuality, automation, scale process of the computer system to leak record data.
2. when there was only consistent clear and definite classification, dangerous program height between the leak record of existing vulnerability scan, finding Between priority relation, leak record between security attack dependence cannot embody, the present invention can excavate spring a leak data it Between imply parallel linked model and rule, for leak data further application basis is provided.
3. existing attack mode only includes the attack mode of the network attack having occurred and that, corresponding vulnerability information is single solid Fixed, autgmentability is not strong, it is therefore foreseen that property is not strong, and the discovery of correspondence attack mode is highly difficult.The present invention is according to the attack mould increased income in storehouse Formula recycles text mining to find parallel leak, to the attack mode for having produced to fragility, then to the mapping relations of leak It is expansible go out the combination of various leaks, it is also contemplated that certain attack mode, is that Protection of Network Security provides the foundation.
Description of the drawings
Fig. 1 is the flow chart of system
Fig. 2 is the acquisition of the data of square frame 1001 and pretreatment process figure in Fig. 1
Fig. 3 is that square frame 1005 extracts leak set flow chart in Fig. 1
Fig. 4 is the text vector flow chart of square frame 1009 in Fig. 1
Fig. 5 is that square frame 1013 obtains threshold value flow chart in Fig. 1
Fig. 6 is that square frame 1017 obtains parallel leak result flow chart in Fig. 1
Fig. 7 is the data structure diagram that data form is used in Fig. 2
Specific embodiment
It is as follows that in conjunction with the accompanying drawings and embodiments the invention will be further described.
Fig. 1 is the flow chart of the present invention, illustrates one embodiment that parallel leak finds.The parallel leak finds to start from To leak data acquisition and pre-processing.Second step is extraction leak set.3rd step is the vectorization of vulnerability information.4th Step is to obtain threshold value using training set.5th step is the parallel leak result of acquisition.
Fig. 2 is data acquisition and pretreatment process figure in Fig. 1, illustrates how to obtain leak data and is pre-processed One embodiment.The flow process starts from obtaining data (MITRE and CVSS) from two public data sources, and next step is to original number According to input system is carried out, consolidation form operation forms four data acquisition systems (CAPEC, CWE, CVE and CVSSscores), finally Formation is identified to the mapping of data.
Fig. 7 is data structure used in Fig. 2.
Fig. 3 is that leak set flow chart is extracted in Fig. 1, is initially formed the mapping of CAPEC to CWE (according to Related_ Weakness), the mapping (Observed_Example) of CWE to CVE is subsequently formed, the mapping of CAPEC to CVE is eventually formed, So as to huge leak set be split according to pattern.
Fig. 4 is the flow chart of Fig. 1 Chinese version vectorizations, and the first step obtains dictionary, and method is traversal vulnerability database, extracts leak The stem information of description, according to Rules Filtering antistop list is gone out.Second step extracts word respectively by the description participle of each leak It is dry, and according to dictionary, non-key word is removed, constitute keyword sequence.3rd step uses the keyword sequence vectorization of leak TF/IDF analog values element vectors.
Fig. 5 is the flow chart that threshold value is obtained in Fig. 1, manually choose first training set and, according to former using cosine similarity Reason calculates similarity, and Euclidean distance, threshold value of ascending the throne of averaging can be calculated using fraction according to CVSS.
Fig. 6 is to obtain parallel leak result flow chart in Fig. 1, suitable using the leak set obtained in Fig. 3 as test set Two different leaks of selection of sequence, calculate first assessment fractional value, then continue if less than threshold value, then calculate cosine similar Value, is then parallel leak if greater than threshold value.
According to above-mentioned parallel bug excavation method thinking and realize step, by based on increase income storehouse and text mining and The operation result of row bug excavation prototype system knows, can finally find that the parallel leak number of combinations for obtaining is 2013 groups;According to real Trample and found after experience is manually verified, there are 192 groups not to be effective parallel leak in the parallel leak combination that system is obtained, just Really rate is about 90.5%.
In terms of the performance test to prototype system, obtain with System.currentTimeMillis () method herein Time difference calculates corresponding speed, is obtained and reaches tens of thousands of in data flow, when system is responded in the case of needing to extract tens of thousands of times Between.Such as, in terms of threshold value is calculated, complete to come from the CAPEC attack modes comprising two and its above relevant vulnerability and amount to The response time of the training set of 29 groups of data is 40ms;In terms of to its parallel bug excavation, in 400 attack modes The process response time of 62407 leaks is 35s.Incorporation engineering practical experience, the system possesses acceptable performance indications.
It is in addition to analyze the Result example of prototype system, such as attack mode CPAEC-157:Sniffing Attacks, by analyzing its attack step, the weakness (CWE) of association, precondition, postcondition, correlation technique and applicable The important informations such as technological accumulation and inheritance, according to said method step, the cosine similar value that can be calculated leak is about 0.26, CVSS and comments The value for estimating the Euclidean distance of fraction is 0, and combined training obtains the threshold value of the cosine similar value of parallel leak and CVSS assessments The threshold value of the Euclidean distance of fraction, analysis knows this two in threshold range, and we are obtained its leak CVE-2009- 1466 and CVE-2008-1567 is one group of parallel leak.
By above-mentioned experimental analysis, corresponding parallel leak can be relatively accurately excavated using context of methods, and With preferable systematic function.
This paper Security-Orienteds field, studies a question for information security leak association mining, herein based on CAPEC, CWE and CVE etc. increases income storehouse, attempts introducing the thinking of parallel bug excavation, it is proposed that a kind of based on storehouse and the text mining of increasing income Parallel bug excavation method.First it is to be based on information bank of increasing income, association fragility CWE under the same attack mode of extraction, then from Fragility CWE is associated with leak CVE, consequently facilitating potential concurrency relation between analysis leak;Then by the word description of leak Information vector, intelligent processing method is carried out in order to software systems to leak record data;It is different from based on keyword match Inquiry, according to the threshold value that training set draws the similarity between leak is investigated, and calculates the concurrency relation between leak, excavates parallel leakage Hole;Finally, by experimental verification and practice, the correctness and validity of context of methods is demonstrated.Obtain according to context of methods As a result, other parallel leaks can be rapidly made up when one leak of discovery is utilized, and then makes up the fragility of whole network, tool There is higher cyber-defence using value.

Claims (1)

1. it is a kind of based on parallel bug excavation method of the storehouse with text mining of increasing income, it is characterised in that step includes:
1) data acquisition and pretreatment;
A) original aggressor pattern information, vulnerability information and vulnerability information are obtained from storehouse of increasing income;
B) the storehouse information data of increasing income to obtaining carries out data scrubbing, including Uniform data format, data purification, filling disappearance category Property and remove noise data;
C) data after cleaning are set up with data acquisition system, and forms the mapping for being identified to non-structured text information, form three Set Pattern { p1, p2..., pn, Weak { w1, w2..., wm, Vulnerability { v1, v2..., vs};Wherein piRepresent One attack mode, wiRepresent a fragility, viRepresent a leak;
2) mapping relations and leak set are extracted;
Parse attack mode piTo fragility wjMapping relations, fragility wjTo leak vkMapping relations, then formed attack Blow mode piTo leak vkMapping relations, so as to huge vulnerability database is concluded to different leakages according to some attack modes Gather in hole;
3) mathematical modeling is carried out to the text description information of leak;
D) the text description information of leak is extracted, participle is carried out, stem is extracted, text description information is carried out according to feature dictionary Text vector;
E) it is the word frequency/frequency of falling document value according to the tf/idf of the description information calculating document word of text vector;
4) to step 2) in the leak of leak set that extracts, the precondition, postcondition and threshold value according to attack mode Screened, obtained the set of parallel leak combination;
F) according to the parallel leak training set of artificial mark, the threshold value and CVSS assessment point of the cosine similar value of parallel leak is calculated The threshold value of several Euclidean distances;
H) in step 2) in the leak set that obtains, according to step 3) in the leak vector information that gets, corresponding to attacking The precondition and postcondition of blow mode calculates respectively the cosine similar value of two leaks;
G) Euclidean distance of fraction is assessed according to the cosine similar value and CVSS of two leaks, by two in respective threshold value In the range of leak be determined as parallel leak.
CN201410332588.0A 2014-07-12 2014-07-12 Parallel vulnerability mining method based on open source library and text mining Active CN104166680B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410332588.0A CN104166680B (en) 2014-07-12 2014-07-12 Parallel vulnerability mining method based on open source library and text mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410332588.0A CN104166680B (en) 2014-07-12 2014-07-12 Parallel vulnerability mining method based on open source library and text mining

Publications (2)

Publication Number Publication Date
CN104166680A CN104166680A (en) 2014-11-26
CN104166680B true CN104166680B (en) 2017-05-17

Family

ID=51910493

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410332588.0A Active CN104166680B (en) 2014-07-12 2014-07-12 Parallel vulnerability mining method based on open source library and text mining

Country Status (1)

Country Link
CN (1) CN104166680B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104615542B (en) * 2015-02-11 2017-12-01 中国科学院软件研究所 A kind of method of the fragility association analysis auxiliary bug excavation based on function call
CN107871078A (en) * 2016-09-27 2018-04-03 北京计算机技术及应用研究所 The method that vulnerability information is extracted in non-structured text
CN109347801B (en) * 2018-09-17 2021-03-16 武汉大学 Vulnerability exploitation risk assessment method based on multi-source word embedding and knowledge graph
CN111914259B (en) * 2019-05-09 2024-07-02 阿里巴巴集团控股有限公司 Data processing method and computing device
CN110378126B (en) * 2019-07-26 2021-03-26 北京中科微澜科技有限公司 Vulnerability detection method and system
CN110855676B (en) * 2019-11-15 2021-08-31 腾讯科技(深圳)有限公司 Network attack processing method and device and storage medium
CN111258968B (en) * 2019-12-30 2020-09-11 广州博士信息技术研究院有限公司 Enterprise redundant data cleaning method and device and big data platform
CN112528294A (en) * 2020-12-21 2021-03-19 网神信息技术(北京)股份有限公司 Vulnerability matching method and device, computer equipment and readable storage medium
CN112765031B (en) * 2021-01-25 2022-08-19 北京航空航天大学 Decomposition method of crowd-sourcing vulnerability mining task

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102446254A (en) * 2011-12-30 2012-05-09 中国信息安全测评中心 Similar loophole inquiry method based on text mining
CN103268279A (en) * 2013-02-27 2013-08-28 中国信息安全测评中心 Compound poisson process-based software reliability prediction method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8789192B2 (en) * 2011-05-23 2014-07-22 Lockheed Martin Corporation Enterprise vulnerability management

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102446254A (en) * 2011-12-30 2012-05-09 中国信息安全测评中心 Similar loophole inquiry method based on text mining
CN103268279A (en) * 2013-02-27 2013-08-28 中国信息安全测评中心 Compound poisson process-based software reliability prediction method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Ranking Attacks Based on Vulnerability Analysis;Ju An Wang et al;《Proceedings of the 43rd Hawaii International Conference on System Sciences》;20101231;第1-10页 *
基于关联性的漏洞评估方法;张凤荔等;《计算机应用研究》;20140331;第31卷(第3期);第811-814页 *
漏洞威胁的关联评估方法;谢丽霞等;《计算机应用》;20120301;第32卷(第3期);第679-682页 *

Also Published As

Publication number Publication date
CN104166680A (en) 2014-11-26

Similar Documents

Publication Publication Date Title
CN104166680B (en) Parallel vulnerability mining method based on open source library and text mining
CN110233849B (en) Method and system for analyzing network security situation
CN103559235B (en) A kind of online social networks malicious web pages detection recognition methods
CN105072089B (en) A kind of WEB malice scanning behavior method for detecting abnormality and system
CN109347801A (en) A kind of vulnerability exploit methods of risk assessment based on multi-source word insertion and knowledge mapping
CN103780614B (en) A kind of SQL injection loophole method for digging based on simulated strike extension
Perry et al. No-doubt: Attack attribution based on threat intelligence reports
CN110765459A (en) Malicious script detection method and device and storage medium
CN105072214A (en) C&C domain name identification method based on domain name feature
CN107437026B (en) Malicious webpage advertisement detection method based on advertisement network topology
CN105718795B (en) Malicious code evidence collecting method and system under Linux based on condition code
CN112492059A (en) DGA domain name detection model training method, DGA domain name detection device and storage medium
CN110191096A (en) A kind of term vector homepage invasion detection method based on semantic analysis
CN105138916A (en) Multi-track malicious program feature detecting method based on data mining
CN108985061A (en) A kind of webshell detection method based on Model Fusion
CN116405246A (en) Vulnerability exploitation chain construction technology based on attack and defense combination
CN112784279B (en) Software product safety risk assessment method based on dependency library version information
CN112817877B (en) Abnormal script detection method and device, computer equipment and storage medium
CN105045715A (en) Programming mode and mode matching based bug clustering method
CN103455754B (en) A kind of malicious searches keyword recognition methods based on regular expression
Bai et al. Dynamic k-gram based software birthmark
CN108171057B (en) Android platform malicious software detection method based on feature matching
CN117785993A (en) Graph mode mining method and device
CN111898126A (en) Android repackaging application detection method based on dynamically acquired user interface
CN116471098A (en) Method, device and storage medium for reconstructing vulnerability exploitation process based on traceability graph

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant