CN104166680B - Parallel vulnerability mining method based on open source library and text mining - Google Patents
Parallel vulnerability mining method based on open source library and text mining Download PDFInfo
- Publication number
- CN104166680B CN104166680B CN201410332588.0A CN201410332588A CN104166680B CN 104166680 B CN104166680 B CN 104166680B CN 201410332588 A CN201410332588 A CN 201410332588A CN 104166680 B CN104166680 B CN 104166680B
- Authority
- CN
- China
- Prior art keywords
- leak
- parallel
- vulnerability
- information
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/57—Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
- G06F21/577—Assessing vulnerabilities and evaluating computer system security
Landscapes
- Engineering & Computer Science (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to a parallel vulnerability mining method based on an open source library and test mining, and belongs to the technical field of computer information safety. The parallel vulnerability mining method comprises the steps that vulnerability data are obtained from the open source library and pre-processed, a vulnerability set is extracted, text vectorization is conducted, the threshold is calculated and parallel vulnerabilities are discovered. The parallel vulnerability mining method has the advantages that on the basis of the open source library, relevant vulnerability information in the same attack mode is extracted, and therefore potential parallel relationships between the vulnerabilities can be analyzed conveniently; text description information of the vulnerabilities are vectorized, and therefore a computer system can conduct intelligent processing on vulnerability recording data conveniently; the method differs from query on the basis of keyword matching in that the similarity between the vulnerabilities is studied according to the threshold obtained through a training set; the parallel relationships between the vulnerabilities can be calculated, so that when it is found that one vulnerability is utilized, the other parallel vulnerabilities are made up rapidly, therefore, the vulnerability of a whole network is made up, the defense capacity is enhanced and great significance for information safety is achieved.
Description
Technical field
The invention belongs to computer information safety technique field.Specifically, the present invention relates to it is a kind of based on increase income storehouse with
The parallel bug excavation method of text mining.
Background technology
With Internet continuous popularization and develop rapidly, people are increasingly using and depend on internet.Together
When, because the economic loss that internet information safety problem is produced is greatly improved, the harm for causing also is significantly increased.It is present in various
Leak in software and operating system, is that hackers start network attack, steal user profile, even destroy industrial infrastructure
There is provided condition.For the effective ways that potential safety hazard in guarding network is studied in this countries in the world with refusing surplus energy.
From the beginning of 1996, information security expert begins to that internet attack mode is concluded and summarized, it is intended to from
Higher level understands network attack.And be then by Department of Homeland Security of the U.S. (United what is among these, be widely recognized as
States Department of Homeland Security, DHS) support CAPEC (Common AttackPattern
Enumeration and Classification) project.In CAPEC, each attack mode is designated unique ID, and
There is provided following information:(platform for such as relying on and software version are attacked for attack mode description, attack step, precondition
It is front need meet technical conditions), postcondition (such as attack after obtain administrator right), attack instance, correlation CWE
(Common Weakness Enumeration) fragility and CVE (Common Vulnerabilities and
Exposures) leak etc..CAPEC not only elaborates the detail of network attack mode, also indicate it with CWE, CVE it
Between contact.CWE is the fragility set in a description software.For each fragility, it provides following information:It is crisp
The CVE leaks of weak property description, applicable platform, result, example and the correlation for causing.One CWE fragility correspond to many
Individual related CVE leaks.
Parallel leak (Parallel Vulnerabilities) according to the present invention, refers to attack basis with identical
Condition, the leak combination that identical attack purpose and effect can be reached but difference attack paths are utilized.In brief, parallel leak
Using the method for multipath come analytical attack possibly also with approach, so as to improve probability and the covering of successful defending against network attacks
Rate.
Have close ties with parallel leak is then attack graph:Attack graph is described from the visual angle of attacker or defender
How the purpose attacked or take precautions against is reached using leak present in system;System administration can they be by attacking graph evaluation
The security of system, and determine to take that what kind of remedial measure is taken precautions against.Parallel leak is just derived from same in attack graph
Branch parallel relation between starting point and end point, for finding that these have the leak of concurrency relation to combine, can find one
Make up other parallel leaks when leak is utilized rapidly, be easy in time to corresponding leak patch installing, and then make up whole network
Fragility, with higher cyber-defence using value.
The content of the invention
It is an object of the present invention to be based on information bank of increasing income, extract and associate under same attack mode fragility CWE, then from
Fragility CWE is associated with leak CVE, then by text mining and natural language processing technique, excavates parallel leak, thus
With higher cyber-defence using value.
The technical scheme is that:
A kind of parallel bug excavation method based on increase income storehouse and text mining, step includes:
1. data acquisition and pretreatment:
A) original aggressor pattern information, vulnerability information and vulnerability information are obtained from storehouse of increasing income
B) the storehouse information data of increasing income to obtaining carries out data scrubbing, including Uniform data format, data purification, filling lack
Lose attribute and remove noise data
C) data after cleaning are set up with data acquisition system, and forms the mapping for being identified to non-structured text information, formed
Three set Pattern { p1,p2,…,pn, Weak { w1,w2,…,wm, Vulnerability { v1,v2,…,vs}.Wherein pi
Represent an attack mode, wiRepresent a fragility, viRepresent a leak.
2. mapping relations and leak set are extracted
Parse attack mode piTo fragility wiMapping relations (one-to-many), fragility wiTo leak viMapping close
System's (one-to-many), then forms attack mode piTo leak viMapping relations, so as to huge vulnerability database be attacked according to some
Blow mode is concluded to different leak set.
3. the text description information of pair leak carries out mathematical modeling
A) the text description information of leak is extracted, participle is carried out, stem is extracted, text is carried out to description according to feature dictionary
Vectorization, is represented by Di=(w1,i,w2,i,w3,i,…,wn,i), wherein wn,iRepresent entry n in document DiIn quantitative index table
Show.
B) TF/IDF (word frequency/frequency of the falling document) value of document word is calculated according to the description information of text vector, and then
Document is represented by di=(w1,i,w2,i,w3,i,…,wn,i), wherein wn,iRepresent entry n in document diIn TF/IDF weighting
Value.
4. the leak of the leak set extracted in pair step 2, the precondition, postcondition and threshold according to attack mode
Value is screened, and obtains the set of parallel leak combination
Parrallel{(vi,vj)|i≠j,vi∈vulnerability}
A) according to the parallel leak training set of artificial mark, the threshold value of the cosine similar value of the parallel leak of calculating (be shown in by method
Step 4-a) and CVSS assessment fraction Euclidean distance threshold value
B) in the leak set that obtains in step 2, according to the leak vector information got in step 3, corresponding to
The precondition and postcondition of attack mode calculates respectively the cosine similar value (method is shown in step 4-a) of two leaks
C) according to the cosine similar value and the Euclidean distance for assessing fraction of CVSS of two leaks, by two each
Leak from threshold range is determined as parallel leak.
Step 3-a) described in segmenting method, its step is to be separated text flow by space, obtains word sequence.Go
Except the character of particular meaning, replace //, ,/, for/, remove word tail number.
Step 3-b) described in extract stem method, text is processed using Krovetz Stemmer.
Step 3-b) described in calculate text TF/IDF values method,
TF (t, d)=0.5+ (0.5*f (t, d))/(max { f (w, d):w∈d})
IDF (t, D)=log (| D |)/(| { d ∈ D:t∈d}|)
Wherein f (t, d) is word frequency of the word t in document d, and f (w, d) is word frequency of the word w in document d, and w belongs to text
Shelves d, it can represent any one word therein;| D | is all number of files for including in corpus D, | { d ∈ D:t∈d}|
For the number of files comprising word t in corpus D.
Step 4-a) described in calculate cosine similarity method,
Wherein vjRepresent the description text of leak j, vkRepresent the description text of leak k.wi,jRepresent entry i in document djIn
TF/IDF weighted values, wi,kRepresent entry i in document dkIn TF/IDF weighted values.
vjRepresent the description text of leak j, vkRepresent the description text of leak k;wi,jRepresent entry i in document djIn
TF/IDF weighted values, wi,kRepresent entry i in document dkIn TF/IDF weighted values.
Advantages of the present invention mainly has:
1. the word description field value of existing leak record data is unstructured text data, and different leak
Description source causes its descriptor format different with word, and computer cannot be automatically performed the automatic understanding to word description field.
The present invention carries out numerical value vectorization process to word description field value according to feature lexicon, and indefinite word description is bright
Trueization, is easy to intellectuality, automation, scale process of the computer system to leak record data.
2. when there was only consistent clear and definite classification, dangerous program height between the leak record of existing vulnerability scan, finding
Between priority relation, leak record between security attack dependence cannot embody, the present invention can excavate spring a leak data it
Between imply parallel linked model and rule, for leak data further application basis is provided.
3. existing attack mode only includes the attack mode of the network attack having occurred and that, corresponding vulnerability information is single solid
Fixed, autgmentability is not strong, it is therefore foreseen that property is not strong, and the discovery of correspondence attack mode is highly difficult.The present invention is according to the attack mould increased income in storehouse
Formula recycles text mining to find parallel leak, to the attack mode for having produced to fragility, then to the mapping relations of leak
It is expansible go out the combination of various leaks, it is also contemplated that certain attack mode, is that Protection of Network Security provides the foundation.
Description of the drawings
Fig. 1 is the flow chart of system
Fig. 2 is the acquisition of the data of square frame 1001 and pretreatment process figure in Fig. 1
Fig. 3 is that square frame 1005 extracts leak set flow chart in Fig. 1
Fig. 4 is the text vector flow chart of square frame 1009 in Fig. 1
Fig. 5 is that square frame 1013 obtains threshold value flow chart in Fig. 1
Fig. 6 is that square frame 1017 obtains parallel leak result flow chart in Fig. 1
Fig. 7 is the data structure diagram that data form is used in Fig. 2
Specific embodiment
It is as follows that in conjunction with the accompanying drawings and embodiments the invention will be further described.
Fig. 1 is the flow chart of the present invention, illustrates one embodiment that parallel leak finds.The parallel leak finds to start from
To leak data acquisition and pre-processing.Second step is extraction leak set.3rd step is the vectorization of vulnerability information.4th
Step is to obtain threshold value using training set.5th step is the parallel leak result of acquisition.
Fig. 2 is data acquisition and pretreatment process figure in Fig. 1, illustrates how to obtain leak data and is pre-processed
One embodiment.The flow process starts from obtaining data (MITRE and CVSS) from two public data sources, and next step is to original number
According to input system is carried out, consolidation form operation forms four data acquisition systems (CAPEC, CWE, CVE and CVSSscores), finally
Formation is identified to the mapping of data.
Fig. 7 is data structure used in Fig. 2.
Fig. 3 is that leak set flow chart is extracted in Fig. 1, is initially formed the mapping of CAPEC to CWE (according to Related_
Weakness), the mapping (Observed_Example) of CWE to CVE is subsequently formed, the mapping of CAPEC to CVE is eventually formed,
So as to huge leak set be split according to pattern.
Fig. 4 is the flow chart of Fig. 1 Chinese version vectorizations, and the first step obtains dictionary, and method is traversal vulnerability database, extracts leak
The stem information of description, according to Rules Filtering antistop list is gone out.Second step extracts word respectively by the description participle of each leak
It is dry, and according to dictionary, non-key word is removed, constitute keyword sequence.3rd step uses the keyword sequence vectorization of leak
TF/IDF analog values element vectors.
Fig. 5 is the flow chart that threshold value is obtained in Fig. 1, manually choose first training set and, according to former using cosine similarity
Reason calculates similarity, and Euclidean distance, threshold value of ascending the throne of averaging can be calculated using fraction according to CVSS.
Fig. 6 is to obtain parallel leak result flow chart in Fig. 1, suitable using the leak set obtained in Fig. 3 as test set
Two different leaks of selection of sequence, calculate first assessment fractional value, then continue if less than threshold value, then calculate cosine similar
Value, is then parallel leak if greater than threshold value.
According to above-mentioned parallel bug excavation method thinking and realize step, by based on increase income storehouse and text mining and
The operation result of row bug excavation prototype system knows, can finally find that the parallel leak number of combinations for obtaining is 2013 groups;According to real
Trample and found after experience is manually verified, there are 192 groups not to be effective parallel leak in the parallel leak combination that system is obtained, just
Really rate is about 90.5%.
In terms of the performance test to prototype system, obtain with System.currentTimeMillis () method herein
Time difference calculates corresponding speed, is obtained and reaches tens of thousands of in data flow, when system is responded in the case of needing to extract tens of thousands of times
Between.Such as, in terms of threshold value is calculated, complete to come from the CAPEC attack modes comprising two and its above relevant vulnerability and amount to
The response time of the training set of 29 groups of data is 40ms;In terms of to its parallel bug excavation, in 400 attack modes
The process response time of 62407 leaks is 35s.Incorporation engineering practical experience, the system possesses acceptable performance indications.
It is in addition to analyze the Result example of prototype system, such as attack mode CPAEC-157:Sniffing
Attacks, by analyzing its attack step, the weakness (CWE) of association, precondition, postcondition, correlation technique and applicable
The important informations such as technological accumulation and inheritance, according to said method step, the cosine similar value that can be calculated leak is about 0.26, CVSS and comments
The value for estimating the Euclidean distance of fraction is 0, and combined training obtains the threshold value of the cosine similar value of parallel leak and CVSS assessments
The threshold value of the Euclidean distance of fraction, analysis knows this two in threshold range, and we are obtained its leak CVE-2009-
1466 and CVE-2008-1567 is one group of parallel leak.
By above-mentioned experimental analysis, corresponding parallel leak can be relatively accurately excavated using context of methods, and
With preferable systematic function.
This paper Security-Orienteds field, studies a question for information security leak association mining, herein based on CAPEC,
CWE and CVE etc. increases income storehouse, attempts introducing the thinking of parallel bug excavation, it is proposed that a kind of based on storehouse and the text mining of increasing income
Parallel bug excavation method.First it is to be based on information bank of increasing income, association fragility CWE under the same attack mode of extraction, then from
Fragility CWE is associated with leak CVE, consequently facilitating potential concurrency relation between analysis leak;Then by the word description of leak
Information vector, intelligent processing method is carried out in order to software systems to leak record data;It is different from based on keyword match
Inquiry, according to the threshold value that training set draws the similarity between leak is investigated, and calculates the concurrency relation between leak, excavates parallel leakage
Hole;Finally, by experimental verification and practice, the correctness and validity of context of methods is demonstrated.Obtain according to context of methods
As a result, other parallel leaks can be rapidly made up when one leak of discovery is utilized, and then makes up the fragility of whole network, tool
There is higher cyber-defence using value.
Claims (1)
1. it is a kind of based on parallel bug excavation method of the storehouse with text mining of increasing income, it is characterised in that step includes:
1) data acquisition and pretreatment;
A) original aggressor pattern information, vulnerability information and vulnerability information are obtained from storehouse of increasing income;
B) the storehouse information data of increasing income to obtaining carries out data scrubbing, including Uniform data format, data purification, filling disappearance category
Property and remove noise data;
C) data after cleaning are set up with data acquisition system, and forms the mapping for being identified to non-structured text information, form three
Set Pattern { p1, p2..., pn, Weak { w1, w2..., wm, Vulnerability { v1, v2..., vs};Wherein piRepresent
One attack mode, wiRepresent a fragility, viRepresent a leak;
2) mapping relations and leak set are extracted;
Parse attack mode piTo fragility wjMapping relations, fragility wjTo leak vkMapping relations, then formed attack
Blow mode piTo leak vkMapping relations, so as to huge vulnerability database is concluded to different leakages according to some attack modes
Gather in hole;
3) mathematical modeling is carried out to the text description information of leak;
D) the text description information of leak is extracted, participle is carried out, stem is extracted, text description information is carried out according to feature dictionary
Text vector;
E) it is the word frequency/frequency of falling document value according to the tf/idf of the description information calculating document word of text vector;
4) to step 2) in the leak of leak set that extracts, the precondition, postcondition and threshold value according to attack mode
Screened, obtained the set of parallel leak combination;
F) according to the parallel leak training set of artificial mark, the threshold value and CVSS assessment point of the cosine similar value of parallel leak is calculated
The threshold value of several Euclidean distances;
H) in step 2) in the leak set that obtains, according to step 3) in the leak vector information that gets, corresponding to attacking
The precondition and postcondition of blow mode calculates respectively the cosine similar value of two leaks;
G) Euclidean distance of fraction is assessed according to the cosine similar value and CVSS of two leaks, by two in respective threshold value
In the range of leak be determined as parallel leak.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410332588.0A CN104166680B (en) | 2014-07-12 | 2014-07-12 | Parallel vulnerability mining method based on open source library and text mining |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410332588.0A CN104166680B (en) | 2014-07-12 | 2014-07-12 | Parallel vulnerability mining method based on open source library and text mining |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104166680A CN104166680A (en) | 2014-11-26 |
CN104166680B true CN104166680B (en) | 2017-05-17 |
Family
ID=51910493
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410332588.0A Active CN104166680B (en) | 2014-07-12 | 2014-07-12 | Parallel vulnerability mining method based on open source library and text mining |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104166680B (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104615542B (en) * | 2015-02-11 | 2017-12-01 | 中国科学院软件研究所 | A kind of method of the fragility association analysis auxiliary bug excavation based on function call |
CN107871078A (en) * | 2016-09-27 | 2018-04-03 | 北京计算机技术及应用研究所 | The method that vulnerability information is extracted in non-structured text |
CN109347801B (en) * | 2018-09-17 | 2021-03-16 | 武汉大学 | Vulnerability exploitation risk assessment method based on multi-source word embedding and knowledge graph |
CN111914259B (en) * | 2019-05-09 | 2024-07-02 | 阿里巴巴集团控股有限公司 | Data processing method and computing device |
CN110378126B (en) * | 2019-07-26 | 2021-03-26 | 北京中科微澜科技有限公司 | Vulnerability detection method and system |
CN110855676B (en) * | 2019-11-15 | 2021-08-31 | 腾讯科技(深圳)有限公司 | Network attack processing method and device and storage medium |
CN111258968B (en) * | 2019-12-30 | 2020-09-11 | 广州博士信息技术研究院有限公司 | Enterprise redundant data cleaning method and device and big data platform |
CN112528294A (en) * | 2020-12-21 | 2021-03-19 | 网神信息技术(北京)股份有限公司 | Vulnerability matching method and device, computer equipment and readable storage medium |
CN112765031B (en) * | 2021-01-25 | 2022-08-19 | 北京航空航天大学 | Decomposition method of crowd-sourcing vulnerability mining task |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102446254A (en) * | 2011-12-30 | 2012-05-09 | 中国信息安全测评中心 | Similar loophole inquiry method based on text mining |
CN103268279A (en) * | 2013-02-27 | 2013-08-28 | 中国信息安全测评中心 | Compound poisson process-based software reliability prediction method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8789192B2 (en) * | 2011-05-23 | 2014-07-22 | Lockheed Martin Corporation | Enterprise vulnerability management |
-
2014
- 2014-07-12 CN CN201410332588.0A patent/CN104166680B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102446254A (en) * | 2011-12-30 | 2012-05-09 | 中国信息安全测评中心 | Similar loophole inquiry method based on text mining |
CN103268279A (en) * | 2013-02-27 | 2013-08-28 | 中国信息安全测评中心 | Compound poisson process-based software reliability prediction method |
Non-Patent Citations (3)
Title |
---|
Ranking Attacks Based on Vulnerability Analysis;Ju An Wang et al;《Proceedings of the 43rd Hawaii International Conference on System Sciences》;20101231;第1-10页 * |
基于关联性的漏洞评估方法;张凤荔等;《计算机应用研究》;20140331;第31卷(第3期);第811-814页 * |
漏洞威胁的关联评估方法;谢丽霞等;《计算机应用》;20120301;第32卷(第3期);第679-682页 * |
Also Published As
Publication number | Publication date |
---|---|
CN104166680A (en) | 2014-11-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104166680B (en) | Parallel vulnerability mining method based on open source library and text mining | |
CN110233849B (en) | Method and system for analyzing network security situation | |
CN103559235B (en) | A kind of online social networks malicious web pages detection recognition methods | |
CN105072089B (en) | A kind of WEB malice scanning behavior method for detecting abnormality and system | |
CN109347801A (en) | A kind of vulnerability exploit methods of risk assessment based on multi-source word insertion and knowledge mapping | |
CN103780614B (en) | A kind of SQL injection loophole method for digging based on simulated strike extension | |
Perry et al. | No-doubt: Attack attribution based on threat intelligence reports | |
CN110765459A (en) | Malicious script detection method and device and storage medium | |
CN105072214A (en) | C&C domain name identification method based on domain name feature | |
CN107437026B (en) | Malicious webpage advertisement detection method based on advertisement network topology | |
CN105718795B (en) | Malicious code evidence collecting method and system under Linux based on condition code | |
CN112492059A (en) | DGA domain name detection model training method, DGA domain name detection device and storage medium | |
CN110191096A (en) | A kind of term vector homepage invasion detection method based on semantic analysis | |
CN105138916A (en) | Multi-track malicious program feature detecting method based on data mining | |
CN108985061A (en) | A kind of webshell detection method based on Model Fusion | |
CN116405246A (en) | Vulnerability exploitation chain construction technology based on attack and defense combination | |
CN112784279B (en) | Software product safety risk assessment method based on dependency library version information | |
CN112817877B (en) | Abnormal script detection method and device, computer equipment and storage medium | |
CN105045715A (en) | Programming mode and mode matching based bug clustering method | |
CN103455754B (en) | A kind of malicious searches keyword recognition methods based on regular expression | |
Bai et al. | Dynamic k-gram based software birthmark | |
CN108171057B (en) | Android platform malicious software detection method based on feature matching | |
CN117785993A (en) | Graph mode mining method and device | |
CN111898126A (en) | Android repackaging application detection method based on dynamically acquired user interface | |
CN116471098A (en) | Method, device and storage medium for reconstructing vulnerability exploitation process based on traceability graph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |