CN101814098B

CN101814098B - Method for obtaining software security defects based on vertical search and semantic annotation

Info

Publication number: CN101814098B
Application number: CN2010101688044A
Authority: CN
Inventors: 李晓红; 刘丰煦; 杜洪伟; 许光全; 徐超
Original assignee: Tianjin University
Current assignee: Jiangsu Yongda power telecommunication installation engineering Co., Ltd
Priority date: 2010-05-11
Filing date: 2010-05-11
Publication date: 2012-05-02
Anticipated expiration: 2030-05-11
Also published as: CN101814098A

Abstract

The invention belongs to the field of dependable computing, in particular to a method for obtaining software security defects based on vertical search and semantic annotation, which comprises the steps of: firstly, creeping a security defect information webpage published on a World-Wide-Web by using a field search creeper, providing a forceful filter support for the search creeper completing the task by a security defect filed filter trainer; secondly, carrying out semantic annotation on downloaded webpages to enable the webpages to be carried out with semantic information and allow a machine to be understood; thirdly, designing an annotation analysis tool for further carrying out information extraction on the annotated information; and fourthly, providing an interface for carrying out a software security knowledge base and a software security hole analysis by using the information. The invention can provide a large quantity of data and the forceful support for the software security knowledge base and the software security hole analysis.

Description

Obtain the method for software safety defect based on vertical search and semantic tagger

Technical field

The invention belongs to the Trusted Computing field, relate to a kind of method of obtaining software safety defect.

Background technology

In today that computing machine develops rapidly, the security feature of computer software has been not only the adeditive attribute of software, intrinsic propesties especially.And developing rapidly of current network; The requirement of the reliability and security of software is increasingly high especially; This is because network makes computer applied algorithm and computer system demonstrate highly interconnected development trend; This trend means that also these highly interconnected computing machines possibility under attack is also big more when bringing surprising chance for the IT industry.Computer system is destroyed and paralyses, and key message leaks and causes huge economic loss and other invisible losses, the maintenance of software and fault recovery expense to become more and more expensive, therefrom can find out very heaviness of loss that the attack to computing machine causes.

The safety problem that computing machine faces is more and more serious, and people are also increasingly high to the concern and the requirement of safety.On January 15th, 2002; Chief architect Bill Mr. Gates of Microsoft has proposed the notion of " Trusted Computing "; Put it briefly and understand a kind of high-level policy; How that is to say provides more safe and reliable novel computer system to the user, and also lets imperceptible their existence of people when these equipment are constantly facilitated for people with service.The proposition of this notion means that Microsoft has brought up to remarkable position with an important indicator of this measurement software product of safety.For more safe and reliable application program and system is provided to the user, Microsoft has carried out a series of security activities of Windows security activity theme, is intended to teach analysis to vast developer; Design; Exploitation, test, note; The software of issue and maintenance safe---the software code of design and the healthy and strong safety of exploitation in order to keep out malicious code to attack, and no longer only be the software code of realizing security feature.

At present, the method for area research safety defect classification both at home and abroad has nothing in common with each other.Classification of defects will be formulated based on the development object of organizing self and environment.Along with the development of Software Engineering, when particularly beginning one's study software development process in academia, defective just no longer has been confined to program and code, and in the software development process, the management structure of tissue has all produced material impact to software quality.The researchist begins more and more to pay close attention to software development process to defect influence.Bear the tissue of large-scale, complex software exploitation, the Capability Maturity Model for Software that company needs higher level.Implement Defect prevention for satisfying software development organization, improve software process and improve the needs of Capability Maturity Model for Software, a urgent important job is exactly to set up a software defect storehouse, dynamically collects and the management software defective.The content in software defect storehouse should comprise defective that each of SDLC produces in stage and to its classification, from the attack mode of angle-of-attack research and the mitigation scheme of mitigation defective or attack.

The domestic and international research achievement emerges in an endless stream, and also constantly finds new software safety defect simultaneously; Along with developing rapidly of network, a lot of tissues, company are published to the software safety defect of finding on the WWW.How these are distributed in software safety defect information gathering fragmentary on the WWW? How these are comprised information safety defect information wide and that disperse and carry out data mining, information extraction? This also is the key problem that the present invention attempts to solve.

Summary of the invention

Problem in view of above-mentioned prior art existence; The present invention proposes a kind of method of obtaining the existing software safety defect of having announced; The present invention is applied to the achievement in research of vertical search, semantic tagger information extraction technique in the Trusted Computing field; Adopt the vertical search technology to obtain software safety defect information from the WWW; And further based on semantic tagger to its information extraction, can be used for excavating the software safety defect of having announced from Internet search, comprise classification of defects that each stage of SDLC produces, from the attack mode of angle-of-attack research and the mitigation scheme of mitigation defective or attack.For this reason, the present invention adopts following technical scheme:

A kind ofly obtain the method for software safety defect, comprise the following steps: based on vertical search and semantic tagger

1) use search reptile in safety defect information related web page, to climb the webpage of getting a group or more based on the vertical search technology; Wherein include abundant safety defect related web page; Again with these webpages be divided into therewith the field relevant with uncorrelated two types, obtain two types of training webpage collection: the uncorrelated training webpage of security fields relevant training webpage collection collection with security fields;

2) in the relevant training in security fields webpage collection, select potential keyword, listening to the potential keyword of interpolation under the situation of expert opinion, and choosing keyword than formula according to following probability:

Wherein,

In the formula, w representes potential keyword, and c representes security fields related web page training set,

The irrelevant webpage training set in expression security fields is set a threshold value of selecting keyword, should select p _wFor just and value greater than the speech of preset threshold as keyword, value is given big more weights more greatly;

3) utilize selected keyword to set up the safety defect field and filter training aids;

4) use search reptile, download webpage in other safety defect information related web pages from the internet automatically based on the vertical search technology;

5) utilize the safety defect field to filter training aids, adopt following Webpage filtering method to carry out home page filter: webpage is divided into based on the keyword weights<title>With<body>Two parts award different weight titleweight respectively to these two parts, bodyweight, from<title>With<body>Extract each keyword respectively, T in two parts _ValueFor<title>The weights of certain keyword that part occurs, B _ValueFor＜body) weights of certain keyword of occurring of part, each keyword that will in two parts, occur is according to formula webpage weight=titleweight* ∑ T _Value+ bodyweight* ∑ B _ValueCarry out weighted sum, obtain the webpage weight, if this value, thinks then that this webpage is that the field is relevant greater than pre-set threshold, otherwise, filter this webpage;

6) webpage after filtering is carried out semantic tagger;

7) resolve mark and the relevant information of extraction safety defect.

It is target that the present invention obtains the software safety defect of on network, having announced with search, establishes foundation stone for making up software safety defect knowledge base and software security flaw analysis.Act on the theory of excavating safety defect, the safe and reliable software of exploitation; Utilize vertical search and semantic tagger Study on Technology achievement; A kind of vertical search network security defect information and mining data, extraction information are provided; To obtain fragmentary safety defect information of disperseing in the network, lot of data and strong support are provided for making up software safety defect knowledge base and software security flaw analysis.The present invention has following beneficial effect:

1. utilize the automatic searching of vertical search technology and download the software safety defect webpage of issuing on the WWW, effectively filter the irrelevant webpage in security fields.

2. automatic training of safety defective field keyword also calculates its weights, for the home page filter algorithm provides basic data accurately.

3. automatically find the new software safety defect of announcing and more newly downloaded, increment type obtain software safety defect.

4. the semantic tagger of semi-automatic completion safety defect information is realized its further information extraction, data mining.

5. obtain a large amount of formative software safety defect information automatically, comprise classification of defects, attack mode, mitigation scheme; For making up security defect knowledge base and software security flaw analysis the data support is provided.

Description of drawings

Accompanying drawing 1: systematic schematic diagram.

Accompanying drawing 2: the vertical search reptile is downloaded synoptic diagram.

Accompanying drawing 3: the keyword generative process is filtered in the field.

Accompanying drawing 4: semantic tagger information extraction structural drawing.

Embodiment

As shown in Figure 1, overall technological scheme flow process of the present invention does, at first search reptile in use field climbs and gets the safety defect Intelligence Page of having announced on the WWW, and the safety defect field is filtered training aids and then accomplished this task for the search reptile strong filtration support is provided; Secondly carry out semantic tagger to these web pages downloaded, make webpage have semantic information and let machine be appreciated that; Design mark analytical tool carries out further information extraction to the information of mark then; Be that software safety defect knowledge base and software security flaw analysis use these information that interface is provided at last, for they provide the data support.Be elaborated in the face of the present invention down.

1. vertical search

Vertical search reptile as shown in Figure 2 is downloaded synoptic diagram.Vertical search is specialty or dedicated search, is the specialty search to certain industry or a certain theme, and the target of this vertical search reptile is to obtain the software safety defect Intelligence Page.At first specify initial URL and add in the obstruction formation to be downloaded by the keeper; Next adopts the corresponding webpage of multithreading download URL, filters needed webpage in the storage art, the incoherent webpage in deletion field for the webpage of downloading; The page of analyze downloading then extracts in the page other links and adds in the obstruction formation to be downloaded; Vertical search reptile assembly does not stop when in obstruction formation to be downloaded, having URL, can artificially interrupt yet.Last reptile is finished or writes down web pages downloaded and web pages downloaded and write daily record not when interrupting.

2. set up the relevant training in security fields webpage collection

In front in the search engine reptile of design; Do not introduce earlier strobe utility; The initial URL that belongs to the safety defect field that needs from us begins to download a web pages, and this group network document has comprised abundant related web page, again with document be divided into therewith the field relevant with uncorrelated two types; Just obtain training the webpage collection, comprise two big types: the uncorrelated training webpage of security fields relevant training webpage collection collection with security fields.Also can constitute such as pages such as directly downloading some news for security fields uncorrelated training webpage collection by other mode.

3. select to filter keyword

Through a series of webpage pre-service, remove html page label etc., select potential keyword.Potential safety defect keyword comprises two aspects: one, the relevant training in security fields webpage is concentrated each word that occurs, and two, phrase as much as possible is provided listening under the situation of expert opinion.Be in these potential keywords, to select real keyword than (P) method then through probability; Remove the potential keyword that does not have the property distinguished; Probability is than being the probability that in certain field, occurs of a word or speech and the logarithm of the ratio of the probability that in other non-these fields, occurs, and the probability specific energy is well described the separating capacity of certain words to the field.The probability ratio method is specially adapted to the binary classification device.We hope to identify many and accurate as far as possible positive type in binary classification, identify negative type and be indifferent to.The probability ratio method is following:

P_{W} = Log \frac{p (w / c) (1 - p (w / \overset{&OverBar;}{c}))}{p (w / \overset{&OverBar;}{c}) (1 - p (w / c))}

Formula (3-2)

formula (3-3) wherein

In the above in two formula; W representes potential keyword; C representes security fields related web page training set,

the irrelevant webpage training set in expression security fields.

| p _w| big more, explain that word w distinguishes two types and trains the ability of webpage collection strong more.Work as p _w>0 and when big, the page major part that comprises word w belongs to this field, should select as keyword; Work as p _w＜0 o'clock, the page major part that comprises word w did not belong to this field, not as this keyword; | p _w| more little, explain that ability that word w distinguishes the field page more a little less than, can not be as keyword.So set a threshold value of selecting keyword, should select p _wFor just and value greater than the speech of preset threshold as keyword, value is given big more weights more greatly.

4. training aids is filtered in the structure field

The keyword generative process is filtered in field as shown in Figure 3.Target is for the vertical search reptile safety defect field keyword and weights thereof to be provided.Home page filter is based on field keyword weights analyzing web page content realization; The keyword weights are meant: the keyword or the phrase keyword that can represent domain features, and the representational field ability that is to say the level value value in differentiation field and non-field.< keyword, value>binary entity of the two composition is right.Through searching the keyword that comprises in the webpage, the weights addition that these keywords are corresponding promptly is the weights of webpage when calculating the webpage weights, but because the different piece of webpage has different significance levels; The present invention has taken into full account this point when design webpage filter algorithm; Webpage is divided into two different portions, and one is webpage < title>part, and one is <body>part; Give different significance level titleweight, bodyweight respectively for these two parts.The weight of webpage then is transformed into the keyword weights T that partly occurs at title _ValueWith multiply by titleweight and add the keyword weights B that partly occurs at body _ValueWith multiply by bodyweight with.Shown in the following formula:

Webpage weight=titleweight* ∑ T _Value+ bodyweight* ∑ B _Value

Read in and download web page text and < keyword, value>binary entity file.For each keyword if its in title, then corresponding T _ValueAdd the value that keyword is corresponding; If it appears among the body, corresponding B _ValueAdd the value that keyword is corresponding.Calculate the webpage weight according to top formula, with the result with rule of thumb or the predetermined threshold value filterValue that obtains of experiment compare.If it is relevant that the webpage weight, is then thought the field greater than threshold value, otherwise think field independence.

5. use the search reptile to download safety defect information related web page

Use the vertical search reptile of designing in the first step to swash and get the relevant webpage in safety defect field at Internet; The safety defect field filtration training aids that utilizes last step to make up has again been downloaded webpage to all and has been filtered, and purpose is that the home page filter of safety defect field independence is fallen.

6. semantic tagger

Semantic tagger information extraction structural drawing as shown in Figure 4.To satisfy the requirement that software safety defect knowledge base and software security flaw analysis are used in order obtaining, to need vertical search climbed and get web pages downloaded and carry out the semantic tagger information extraction.This module then is used for accomplishing the impenetrable webpage extraction of the machine information that does not have fixed sturcture from these.Semantic tagger (Semantic Annotation) is exactly that raw data is marked (literal or symbol), makes it have semantic information, and not only the people is appreciated that but also machine also is appreciated that.This module is based on the GATE annotation tool, through design intelligible mark rule base of GATE and vocabulary, and is applied to the mark of safety defect Intelligence Page according to Gate mark principle.The GATE mark is to use the english information extraction system ANNIE of rule-based approach to realize; ANNIE is to one piece or one group of pending document; Processing through similar streamline; Carry out semantic tagger in strict accordance with the regulation order, comprise after English participle, the inquiry of English vocabulary, English subordinate sentence, English part-of-speech tagging, English mark rule definition, English named entity recognition and the English coreference resolution processing, realize information extraction entire chapter or whole group document.The vocabulary storehouse is used to set up the list of types collection, and every type of tabulation comprises the instance that the type comprises, and the mark rule---the JAPE rule comprises left and right sides two parts, and the left side is used for the field of matched text invention part, and the right is concrete mark type.

Vocabulary is used for searching the significant phrase phrase of handling the document needs, and this module adopts the attribute of software safety defect constituent, and promptly the attribute of attack mode, defective classification instance, mitigation scheme is as the vocabulary content.The specific object of these software safety defects has provided and has described their implication in the definition of safety defect field, description.The function of JAPE (a Java AnnotationPatterns Engine) is to set up rule base, with the rule of the information in the regular expression matched text, is used to realize participle subordinate sentence and named entity mark more accurately.The rule that JAPE builds is one group of rule syntax file, and all kinds safety defect information that finds as required designs various types of JAPE rules.

What obtain after the semantic tagger operation is to contain the XML format file that semantic information also contains other irrelevant informations in the webpage simultaneously.

7. information extraction

Mark resolve to use the DOM technology of JAXP that the XML document that a last step generates is resolved, and accomplishes the information extraction of safety defect, for the use of safety defect information provides noiseless formatted data.The use interface of safety defect information is provided at last, and target is that the formatted data that obtains is stored according to the storage organization of database.

Claims

1. one kind is obtained the method for software safety defect based on vertical search and semantic tagger, comprises the following steps:

1) use search reptile in safety defect information related web page, to climb the webpage of getting a group or more based on the vertical search technology; Wherein include abundant safety defect related web page; Again these webpages are divided into relevant with security fields and uncorrelated two types, obtain two types of training webpage collection: the uncorrelated training webpage of security fields relevant training webpage collection collection with security fields;

P_{W} = Log \frac{p (w / c) (1 - p (w / \overset{&OverBar;}{c}))}{p (w / \overset{&OverBar;}{c}) (1 - p (w / c))}

Wherein,

In the formula, w representes potential keyword, and c representes the relevant training in security fields webpage collection,

Expression security fields uncorrelated training webpage collection is set a threshold value of selecting keyword, should select P _WFor just and value greater than the speech of preset threshold as keyword, value is given big more weights more greatly;

5) utilize the safety defect field to filter training aids; Adopt following Webpage filtering method to carry out home page filter: webpage is divided into title and two parts of body based on the keyword weights; These two parts are awarded different weight titleweight respectively; Bodyweight extracts each keyword respectively, T in title and two parts of body _ValueThe weights of certain keyword that partly occurs for title, B _ValueThe weights of certain keyword that partly occurs for body, each keyword that will in two parts, occur is according to formula webpage weight=titleweight* ∑ T _Value+ bodyweight* ∑ B _ValueCarry out weighted sum, obtain the webpage weight, if this value thinks then that greater than pre-set threshold this webpage is relevant with security fields, otherwise, filter this webpage;

6) webpage after filtering is carried out semantic tagger;

7) resolve mark and the relevant information of extraction safety defect.