CN101853277A

CN101853277A - Vulnerability data mining method based on classification and association analysis

Info

Publication number: CN101853277A
Application number: CN201010173796A
Authority: CN
Inventors: 毕硕本; 朱斌; 乔文文; 梁静涛; 王启富
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2010-05-14
Filing date: 2010-05-14
Publication date: 2010-10-06

Abstract

The invention relates to a vulnerability data mining method based on classification and association analysis, which automatically converts the latest vulnerability information in HTML format in a post into regular vulnerability to be recorded into a database, establishes a vulnerability information management system, and operates the affairs of the vulnerability record information in the database through a database (DB) interface; extracts vulnerability eigenvectors according to the record information in the vulnerability database; uses the extracted multi-dimensional eigenvectors, automatically classifies a vulnerability document model through K-adjacency matrix text classification algorithm, and mines the data and discovers the knowledge from each type of vulnerability information; evaluates or explains the classification results of a classification model by the index and presents a visual way to a user; extracts key words from all groups of vulnerability classification with Apriori association rule mining algorithm to form a frequent item set so as to produce association rules among the keywords with mining algorithm; and finally finds out the implicit association relationship of the vulnerability data.

Description

A kind of vulnerability data mining method based on classification and association analysis

Technical field

The invention belongs to the vulnerability data mining field, for the hiding rule of finding that leak exists, the prediction so that factors such as the position type that new leak is produced, leak harm timeliness are classified, proposition utilizes data mining technology that leak information is handled and analyzed.In conjunction with the web crawlers technology, and set up leak details Database Systems by the DB interface, utilize the KNN sorting technique that disaggregated model is handled and set up to leak information again, also utilize Apriori association mining method to find inherent law knowledge such as the reason of leak generation, time, harm, and the rule knowledge of finding included in the leak knowledge base, thereby help leak harm carrying out early warning and strick precaution.

Background technology

At present, the quantity of system vulnerability is in continuous increase, some inexorable law or rule have wherein been implied, by data mining, means such as Knowledge Discovery are to the leak information bank, leak issue relevant information (leak issue source, reprint, bulletin, exchange, discuss) and leak utilize information (to utilize the virus of leak propagation, wooden horse, leak utilizes too development, the damage that leak brings) etc. multi-aspect information is handled, obtain the implicit rule in the existing leak information, so that action effective to known bugs, the zone, information such as hazard rating are added up and are classified, accomplish time to new leak generation, predict at the position, perhaps the harm type of new generation leak is accurately classified.

In addition, also there is mutual implication relation between the leak, how finds timely and effectively that these relations become the emphasis and the difficult point of following vulnerability data mining.For example 10 programs having write of the programmer of certain company found afterwards that 4 had security breaches, utilized this information can predict that in short-term, the possibility that the program that this programmer newly finishes starts a leak is 40%; Occur finding have 3 to be the buffer memory Overflow Vulnerability in the program of security breaches 4 of this programmers, then can instruct the Hole Detection personnel that the product of the said firm is focused on buffer memory and overflow in the detection; To the excavation of similar these information, the hiding rule that finds leak to produce can produce factors such as position type, leak harm timeliness and predict to new leak, leak, and harm is carried out early warning and prevention has great importance to leak.

Therefore, Knowledge Discovery be exactly from the database lot of data, extract by data mining algorithm implicit, novel, effectively and the rule that can be understood by the people or the processing procedure of pattern.Rule here or pattern promptly are our usual said knowledge.It provides some characteristic of data or the relation between the data, is the deeper information of obtaining after data are handled that can be used for decision support.

Have the leak of magnanimity to utilize information on the Web, describe, utilize zone, time of virus that leak makes, leak harm or the like as leak, how these data being carried out complicated Treatment Analysis is the research focus of vulnerability data mining.Therefore, native system is taked to collect mechanism automatically from the enterprising line data collection of Web, generates the leak information bank, and with the text mining method leak information bank is carried out data mining.At last, disaggregated model is estimated classification results by each index or explained, and present to the user, make the user can browse the vulnerability classification result clearly with visual way.

List of references

[1]Common?vulnerabilities?and?exposures[EB/OL].http://www.cve.mitre.org.

[2] Dan Guodong, Dai Yingxia, Wang Hang. Study on Computer Vulnerability Taxonomy [J]. computer engineering, 2002,28 (10): 3-6.

[3] Zhai Yu opens Yuqin, Wu Weishan etc. and research of security of system leak and database are realized [J]. computer engineering, 2004,30 (8): 68-70.

[4]Jiawei?Han，Micheline?Kamber.Data?Mining?Concepts?and?Techniques，Second?Edition.China?Machine?Press.

[5]Sheyner?O，Haines?S?Jha，Lippmann?R，et?al.Automated?genera-tion?and?analysis?of?attackgraphs[C].Oakland，CA：Proceedingsof?the?2002IEEE?Symposium?on?Security?andPrivacy，2002.

[6] Zhang Ning, Jia Ziyan, Shi Zhongzhi. use the text classification [J] of KNN algorithm. computer engineering, 2005,31 (8): 171-182.

[7] Huang Jiaman opens the winter jasmine. the research [J] of text based correlation rule extracting method. and Computer Simulation, 2008,25 (1): 96-99.

Summary of the invention

The present invention seeks at a large amount of rambling leak information among the Web, a kind of vulnerability data mining system based on classification and association analysis is provided.Utilize KNN sorting technique and Apriori association mining method to find inherent law knowledge such as the reason of leak generation, time, harm, and the rule knowledge of finding is included in the leak knowledge base.This invention has good classification and association analysis ability for vulnerability data mining.

The present invention adopts following technical scheme for achieving the above object:

A kind of vulnerability data mining method based on classification and association analysis of the present invention is as follows:

1. leak information gathering system, the leak information that each security knowledge website is announced is collected automatically and handled is the web crawlers method for digging, Internet is gone up the magnanimity information that disperses download to this locality and carry out data processing, and set up original leak information database;

2. leak data management system, realize existing original leak database is carried out information management by the DB interface, comprise the leak inquiry, revise, delete, import, upgrade, and utilize the leak crawler technology, whether monitoring in real time announces up-to-date leak, immediately upgrades leak information;

3. vulnerability data mining system, according to the leak information that writes down in the leak information database, the structurized training document sets that foundation is made of the leak document model, extract the leak proper vector of leak information, adopt the KNN sorting algorithm that the leak document model is classified automatically and obtain many group vulnerability classification models, the vulnerability classification model of each class is carried out data mining and Knowledge Discovery; The vulnerability classification model is estimated classification results by each index or explained, and present to the user with visual way; Utilize the Apriori association rules mining algorithm that each group vulnerability classification is extracted key word, form frequent item set, and then the correlation rule that the utilization mining algorithm produces between keyword is the corresponding one group of correlation rule of each class leak document model; Each data item with the leak record is analyzed at last, finds out incidence relation implicit between the leak data, and incidence relation is included in the leak knowledge base.

Preferably, the information search of described leak information gathering system comprises the steps:

The collection rule of the pre-defined leak webpage of A comprises initial chained address, the navigation keyword of leak tabulation father webpage, the scope and the increment of collection webpage, and the collection of leak details sub-pages address identifies in the also leaky tabulation webpage;

The pre-defined collection field rule relevant with leak information of B comprises key messages such as field name, type, preceding identifier, back identifier;

C creates the socket object that is used for network service, the linking objective server, and, receive web page content information in the flow data mode to server transmission HTTP download request;

D as index, locatees the sign key word of leak details sub-pages for chained address, the detailed place of leak fast, and the address of all-ones subnet page or leaf in the page is added in the waiting list set;

E adopts multithreading, extracts the data in the leak details subpage frame, carries out information filtering with key word in the user-defined key word library, according to the main part of front and back identifier location leak information;

After F collected the required field information of user, each field information is imported in the leak information database.

Preferably, the data management of described leakage data management system comprises the steps:

1), realizes the transaction operation that the leak information database is carried out various fuzzy queries and retrieval by DB interface middleware;

2) realize the transaction operation of making amendment of the record in the leak information database, and be updated in the leak information database;

3) realize transaction operation deleted in the record in the leak information database, and be updated in the leak information database;

4) the collection rule of the pre-defined leak webpage of employing, upgrade leak information in time, with up-to-date leak recording storage in local leak information database, the collection rule of described leak webpage, comprise initial chained address, the navigation keyword of leak tabulation father webpage, the scope and the increment of collection webpage, the collection of leak details sub-pages address identifies in the also leaky tabulation webpage.

Preferably, the data mining of described vulnerability data mining system comprises the steps:

A sets up structurized training document sets according to the recorded information in the leak information database, extracts the leak proper vector based on valuation functions and statistical method;

B utilizes the multidimensional characteristic vectors that has extracted, adopts KNN text classification algorithm that the leak document model is classified automatically, and the leak information of each class is carried out data mining and Knowledge Discovery, includes in the leak knowledge base;

C estimates by each index disaggregated model or explains to classification results, and presents to the user with visual way, makes the user can browse the vulnerability classification result clearly;

D uses association rules mining algorithm, and each vulnerability classification is carried out association rule mining, forms the correlation rule storehouse, and includes in the leak knowledge base;

E,, classifies and result's output to new leak information according to existing vulnerability classification model if there is up-to-date single leak information to produce.

Preferably, described step c specifically comprises:

C1 contrast classification and object information are checked the leak information classification result in each classification, comprise its record name, current class, should belong to information such as class;

C2 checks the performance of this vulnerability classification model, and presents to the user in patterned mode, specifically comprises precision ratio, recall ratio and both the overall target information of each classification.

Preferably, described steps d specifically comprises:

D1 utilizes the Apriori association rules mining algorithm that each group vulnerability classification is extracted key word, forms frequent item set, and then the correlation rule that the utilization mining algorithm produces between keyword is the corresponding one group of correlation rule of each class document;

D2 is analyzed each data item of leak record, finds out incidence relation implicit between the leak data;

D3 includes the association analysis result in the leak knowledge base in.

The present invention is based on the classification and the vulnerability data mining method of association analysis and extract pattern implicit in the leak information, novel, that effectively also can be understood by the people by the KNN text classification algorithm in the data mining on the one hand, promptly is usual said knowledge.On the other hand, providing some characteristic of data or the regular or relation between the data by Apriori association mining algorithm, is the deeper information that can be used for decision support to obtaining after the leak information processing.For simple leak information retrieval and data digging system, native system can excavate leak knowledge more effectively, will excavate performance in the mode of imagery and represent to the user, helps leak harm carrying out early warning and strick precaution.

Description of drawings

Fig. 1 is a process flow diagram of the present invention;

Fig. 2 is a process flow diagram of the leak among the Web being collected automatically module;

Fig. 3 is the functional diagram of leak information management module;

Fig. 4 is the process flow diagram of structure leak document model;

Fig. 5 is the process flow diagram of KNN classified excavation method;

Fig. 6 is the process flow diagram of Apriori association mining algorithm;

Fig. 7 is a systematic schematic diagram of the present invention.

Embodiment

Be elaborated below in conjunction with the technical scheme of accompanying drawing to invention:

2. leak data management system, realize existing original leak database is carried out information management by the DB interface, comprise leak inquiry, modification, deletion, importing etc., and utilize the leak crawler technology, whether monitoring in real time announces up-to-date leak, immediately upgrades leak information;

3. vulnerability data mining system, according to the leak information that writes down in the leak information database, the structurized training document sets that foundation is made of the leak document model, extract the leak proper vector of leak information, adopt the KNN sorting algorithm that the leak document model is classified automatically and obtain many group vulnerability classification models, the vulnerability classification model of each class is carried out data mining and Knowledge Discovery; The vulnerability classification model is estimated classification results by each index or explained, and present to the user with visual way; Utilize the Apriori association rules mining algorithm that each group vulnerability classification is extracted key word, form frequent item set, and then the correlation rule that the utilization mining algorithm produces between keyword is the corresponding one group of correlation rule of each class leak document model; Each data item with the leak record is analyzed at last, finds out incidence relation implicit between the leak data, and incidence relation is included in the leak knowledge base.(as shown in Figure 7)

The present invention at first needs the up-to-date leak information of html format in the network announcement is converted in the regular leak database of record automatically, and sets up the leak information management system; Set up structurized training document sets again, adopt K-adjacency matrix (KNN) text classification algorithm that the leak document model is classified automatically, the leak information of each class is carried out data mining and Knowledge Discovery; Utilize the Apriori association rules mining algorithm that each group vulnerability classification is extracted key word then, form frequent item set, and then the utilization mining algorithm produces the correlation rule (the corresponding one group of correlation rule of each class document) between keyword; At last disaggregated model and correlation rule are estimated, evaluation result is included in the knowledge base.

As shown in Figure 1, provide the vulnerability data mining system based on classification and association analysis, this system may further comprise the steps:

The leak information that announce step 10 pair security knowledge website (the green science and technology www.nsfocus.net of alliance is an example) is collected processing automatically, it is usually said web crawlers method for digging, the last magnanimity information that disperses of Internet is downloaded to this locality carry out data processing, and setting up original leak information database, the user can effectively observe comprehensively and leak information timely.

As shown in Figure 2, provide the treatment scheme of the automatic acquisition program of leak, specifically may further comprise the steps:

The collection rule objects TaskRule of the pre-defined leak webpage of step 101, initial chained address (StartLink), navigation keyword (NavigationTag), the scope (Range) of gathering webpage and the increment (Rise) etc. that comprise leak tabulation father webpage, wherein the type sign is meant the type that leak is affiliated, the collection of leak details sub-pages address sign (CollectionTag) in the also leaky tabulation webpage, and with this initialization navigation link URL formation;

Whether step 101a judges each address URL (i) in the navigation link URL formation, extracted and gather URL (j), and initial formation goes out after team finishes, execution in step 106;

The head of the queue of step 102 navigation URL formation goes out team, adopts multithreading, and each network address is created a thread and handled;

Step 103 is created the socket object that is used for network service, the linking objective server, and, receive the content that URL (i) points to webpage in the flow data mode to server transmission HTTP download request;

Step 104 as index, is located the sign key word of leak details sub-pages for the detailed place of leak chained address URL (j) fast;

Step 105 adds the address URL (j) of all-ones subnet page or leaf in the page in the waiting list set, and returns step 101a;

The pre-defined collection field rule relevant of step 106 with leak information, comprise field name (name), type (type), preceding identifier (frontID), back identifier key messages such as (backID), as table 1, and definition rule is saved in the data structure (InfoRule);

Step 106a judges each address URL (j) in the navigation link URL formation, whether has been extracted leak information, and initial formation goes out after team finishes, and execution in step finishes;

The head of the queue of step 107 navigation URL formation goes out team, adopts multithreading to handle;

Step 108 is created the socket object that is used for network service, the linking objective server, and, receive the content that URL (j) points to webpage in the flow data mode to server transmission HTTP download request;

Step 109 is extracted the data in the leak details subpage frame, carries out information filtering with key word in the user-defined key word library, according to the main part of front and back identifier location leak information;

After step 20 was collected the required leak field information of user, each field information is imported in the leak information database;

Table 1 leak information field

Field name	Describe
Field name	Describe	" sequence number "	Unique number in the leak type at this leak place

Field name	Describe
Field name	Describe	" title "	The Chinese of this leak
" the leak type "	The type name that this leak belongs to	" title "	The Chinese of this leak
" the leak type "	The type name that this leak belongs to	" date issued " with " update date "	The developing stage of this leak
" BUGTRA be numbered "	The description of this leak in SecurityFocus Vulnerabilities vulnerability database.	" date issued " with " update date "	The developing stage of this leak
" BUGTRA be numbered "		" CVE numbering "	This leak unified numbering in the world is the unique number in the CVE vulnerability database.
" risk class "	The danger coefficient of this leak.	" CVE numbering "
" risk class "	The danger coefficient of this leak.	" leak description "	With reference to various leak bulletins, to the detailed description of this leak.
" method of testing "	In order to allow the user can further understand this leak, provide leak method of testing for reference.	" leak description "
" method of testing "		" suggestion "	If think prevention and avoid this leak, proposed some suggestions that can be for reference.
" influence system "	The operating system version that this leak influenced.	" suggestion "
" influence system "	The operating system version that this leak influenced.	" influence software "	Various dbases and version that this leak influenced.
" patch download "	If software vendor provides patch to download, then provide the patch download address of this leak.	" influence software "	Various dbases and version that this leak influenced.

As shown in Figure 3, provide the processing procedure of leak information management program, specifically comprised with the lower part: leak inquiry, modification, deletion etc., and utilize the leak crawler technology, whether monitoring in real time announces up-to-date leak, immediately upgrades leak information.

Step 30 converts the record data in the leak information database to corresponding text data, the title of text corresponding leak by name.Have 6 class leak text datas in this use-case, create the corresponding file catalogue according to table 2.

Table 2 leak type list

Classification number	The leak type	Describe
Classification number	The leak type	Describe	??1	The long-range system that enters	The assailant directly obtains the administrator right of remote system

Classification number	The leak type	Describe
Classification number	The leak type	Describe	??2	Denial of Service attack	The assailant directly carries out Denial of Service attack
??3	The WEB data-interface	The assailant utilizes the leak of server, obtains domestic consumer's access right of system	??2	Denial of Service attack	The assailant directly carries out Denial of Service attack
??3	The WEB data-interface		??4	Embed malicious code	The assailant can utilize leak to embed dangerous code in system
??5	Local unauthorized access	Originally the file that does not have authority in the reading system	??4	Embed malicious code
??5	Local unauthorized access		??6	Other types	The leak that other comparative types are fuzzy

Step 40 preprocessing process mainly comprises: the one, remove semantic empty general stop word in the document according to the forbidding word set, for example " though, the, as " etc.; The 2nd, utilize feature lexicon collection (comprise general collection and specialty collection) to carry out participle, if the speech that does not have in the word set, then with its integral body as a speech, and need record so that artificial participle.

Step 50 need be set up the leak document model before using KNN sorting algorithm excavation knowledge, represent the leak document information with vector space model.

As shown in Figure 4, provide the treatment scheme of structure leak document model, specifically may further comprise the steps:

Step 501 is mapped as identical concept according to concept set with word, as " computing machine " and " computer ", need be mapped as " computing machine "; For unregistered word, then select the co-occurrence rate is maximum with it speech as its notion;

Step 502 extracts the higher general features collection of frequency according to the frequency height that the feature speech occurs in a certain document;

Step 503 is utilized 5 kinds of feature evaluation functions, comprises information gain, expectation cross entropy, text evidence power, probability ratio, word frequency method etc., and the general features collection is reduced, and will reduce the result to deposit the document vectors storehouse in;

Step 504 adopts the automatic generating mode based on statistics, and its basic thought is picking out with the closely-related sentence of classification in the literary composition, and such sentence often is positioned at special part or contains more characteristic item, is evaluation criterion with the sentence weighting function generally.

Step 505 is constructed the leak document model with the information that extracts;

Step 60 is classified the leak document model automatically with KNN classified excavation method, and the leak information of each class is carried out data mining and Knowledge Discovery;

As shown in Figure 5, provide the treatment scheme of KNN classified excavation algorithm, specifically may further comprise the steps:

Step 601 is redescribed training leak text vector according to characteristic item set;

Step 602 according to the new text of feature speech participle, is determined the vector representation of new text after new leak arrives;

Step 603 is selected and new K the most similar leak text of leak in training leak text set, and computing formula is

Sim (d_{i}, d_{j}) = \frac{Σ_{k = 1}^{m} W_{ik} \times W_{jk}}{\sqrt{Σ_{k = 1}^{m} {W_{ik}}^{2} Σ_{k = 1}^{m} {W_{jk}}^{2}}}

Wherein, the K value adopts decides an initial value earlier, and the result according to experiment test adjusts the K value then, and general initial value is decided to be hundreds of and arrives between several thousand.

Step 604 is calculated the weight of every class successively in K neighbours of new leak, computing formula is

p (\overset{&RightArrow;}{x}, C_{j}) = \underset{d_{i} &Element; KNN}{Σ} Sim (\overset{&RightArrow;}{x}, {\overset{&RightArrow;}{d}}_{i}) y ({\overset{&RightArrow;}{d}}_{i}, C_{j})

Wherein,

Be the proper vector of new leak, Be calculating formula of similarity, the computing formula rapid with previous step is identical, and Be the category attribute function, if promptly

Belong to class C _i, functional value is 1 so, otherwise is 0.

The weight of step 605 comparing class is assigned to this leak document in that classification of weight maximum.

Step 70 utilizes the Apriori association rules mining algorithm that each vulnerability classification is extracted key word, forms frequent item set, and then the utilization mining algorithm produces the correlation rule between keyword;

As shown in Figure 6, provide the treatment scheme of Apriori association rules mining algorithm, specifically may further comprise the steps:

Step 701 is carried out initialization operation, mainly comprises traversal leak transaction database D (being equivalent to Transaction Information at this leak file), and minimum support threshold values min_sup is set.

Step 702 finds the item collection of all supports greater than minimum support, and these collection are called frequent collection I={I1, I2 ..., Im} establishes

Then

It is desired correlation rule.

In the step 703 rule digging process, beta pruning is a significant process.Having a subclass in the Candidate Set at least is not frequent data item item collection, then deletes this collection of data items;

Step 704 is calculated support and degree of confidence, and formula is as follows:

Support

support (A &DoubleRightArrow; B) = P (A \cup B)

Degree of confidence

confidence (A &DoubleRightArrow; B) = P (B | A) = \frac{support (A \cup B)}{sup port (A)}

For each frequent data item item collection A, if

B ≠ Φ; (and confidence (B → (A-B)) 〉=minconf then constitutes correlation rule B → (A-B);

Step 704a judges whether support reaches minimum support threshold values min_sup, if do not reach, then returns step 702; If reach, then finish;

Step 80 pair vulnerability classification is excavated effect and is estimated.Wherein the index in the KNN classification is precision ratio and recall rate, and precision ratio is the shared ratio of text that coincide with manual sort result in the leak of all judgements.Its mathematical formulae is

Recall ratio is the shared ratio of text that categorizing system is coincide in the due leak of manual sort result, and its mathematical formulae is

Step 90 is with the proper vector of each class leak of extracting in the vulnerability classification process, and the frequent collection and the correlation rule that extract in the association mining process, brings in the leak knowledge base.

Claims

1. one kind based on the classification and the vulnerability data mining method of association analysis, it is characterized in that described method is drawn together as follows:

2. a kind of vulnerability data mining method based on classification and association analysis as claimed in claim 1 is characterized in that the information search of described leak information gathering system, comprises the steps:

3. a kind of vulnerability data mining method based on classification and association analysis as claimed in claim 1 is characterized in that the data management of described leakage data management system comprises the steps:

4. as claimed in claim 1 a kind of based on the classification and the vulnerability data mining method of association analysis, it is characterized in that the data mining of described vulnerability data mining system comprises the steps:

5. a kind of vulnerability data mining method based on classification and association analysis as claimed in claim 4 is characterized in that described step c specifically comprises:

6. a kind of vulnerability data mining method based on classification and association analysis as claimed in claim 4 is characterized in that described steps d specifically comprises:

D3 includes the association analysis result in the leak knowledge base in.