CN101814098B - Method for obtaining software security defects based on vertical search and semantic annotation - Google Patents

Method for obtaining software security defects based on vertical search and semantic annotation Download PDF

Info

Publication number
CN101814098B
CN101814098B CN2010101688044A CN201010168804A CN101814098B CN 101814098 B CN101814098 B CN 101814098B CN 2010101688044 A CN2010101688044 A CN 2010101688044A CN 201010168804 A CN201010168804 A CN 201010168804A CN 101814098 B CN101814098 B CN 101814098B
Authority
CN
China
Prior art keywords
webpage
keyword
value
safety defect
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2010101688044A
Other languages
Chinese (zh)
Other versions
CN101814098A (en
Inventor
李晓红
刘丰煦
杜洪伟
许光全
徐超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Yongda power telecommunication installation engineering Co., Ltd
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN2010101688044A priority Critical patent/CN101814098B/en
Publication of CN101814098A publication Critical patent/CN101814098A/en
Application granted granted Critical
Publication of CN101814098B publication Critical patent/CN101814098B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention belongs to the field of dependable computing, in particular to a method for obtaining software security defects based on vertical search and semantic annotation, which comprises the steps of: firstly, creeping a security defect information webpage published on a World-Wide-Web by using a field search creeper, providing a forceful filter support for the search creeper completing the task by a security defect filed filter trainer; secondly, carrying out semantic annotation on downloaded webpages to enable the webpages to be carried out with semantic information and allow a machine to be understood; thirdly, designing an annotation analysis tool for further carrying out information extraction on the annotated information; and fourthly, providing an interface for carrying out a software security knowledge base and a software security hole analysis by using the information. The invention can provide a large quantity of data and the forceful support for the software security knowledge base and the software security hole analysis.

Description

Obtain the method for software safety defect based on vertical search and semantic tagger
Technical field
The invention belongs to the Trusted Computing field, relate to a kind of method of obtaining software safety defect.
Background technology
In today that computing machine develops rapidly, the security feature of computer software has been not only the adeditive attribute of software, intrinsic propesties especially.And developing rapidly of current network; The requirement of the reliability and security of software is increasingly high especially; This is because network makes computer applied algorithm and computer system demonstrate highly interconnected development trend; This trend means that also these highly interconnected computing machines possibility under attack is also big more when bringing surprising chance for the IT industry.Computer system is destroyed and paralyses, and key message leaks and causes huge economic loss and other invisible losses, the maintenance of software and fault recovery expense to become more and more expensive, therefrom can find out very heaviness of loss that the attack to computing machine causes.
The safety problem that computing machine faces is more and more serious, and people are also increasingly high to the concern and the requirement of safety.On January 15th, 2002; Chief architect Bill Mr. Gates of Microsoft has proposed the notion of " Trusted Computing "; Put it briefly and understand a kind of high-level policy; How that is to say provides more safe and reliable novel computer system to the user, and also lets imperceptible their existence of people when these equipment are constantly facilitated for people with service.The proposition of this notion means that Microsoft has brought up to remarkable position with an important indicator of this measurement software product of safety.For more safe and reliable application program and system is provided to the user, Microsoft has carried out a series of security activities of Windows security activity theme, is intended to teach analysis to vast developer; Design; Exploitation, test, note; The software of issue and maintenance safe---the software code of design and the healthy and strong safety of exploitation in order to keep out malicious code to attack, and no longer only be the software code of realizing security feature.
At present, the method for area research safety defect classification both at home and abroad has nothing in common with each other.Classification of defects will be formulated based on the development object of organizing self and environment.Along with the development of Software Engineering, when particularly beginning one's study software development process in academia, defective just no longer has been confined to program and code, and in the software development process, the management structure of tissue has all produced material impact to software quality.The researchist begins more and more to pay close attention to software development process to defect influence.Bear the tissue of large-scale, complex software exploitation, the Capability Maturity Model for Software that company needs higher level.Implement Defect prevention for satisfying software development organization, improve software process and improve the needs of Capability Maturity Model for Software, a urgent important job is exactly to set up a software defect storehouse, dynamically collects and the management software defective.The content in software defect storehouse should comprise defective that each of SDLC produces in stage and to its classification, from the attack mode of angle-of-attack research and the mitigation scheme of mitigation defective or attack.
The domestic and international research achievement emerges in an endless stream, and also constantly finds new software safety defect simultaneously; Along with developing rapidly of network, a lot of tissues, company are published to the software safety defect of finding on the WWW.How these are distributed in software safety defect information gathering fragmentary on the WWW? How these are comprised information safety defect information wide and that disperse and carry out data mining, information extraction? This also is the key problem that the present invention attempts to solve.
Summary of the invention
Problem in view of above-mentioned prior art existence; The present invention proposes a kind of method of obtaining the existing software safety defect of having announced; The present invention is applied to the achievement in research of vertical search, semantic tagger information extraction technique in the Trusted Computing field; Adopt the vertical search technology to obtain software safety defect information from the WWW; And further based on semantic tagger to its information extraction, can be used for excavating the software safety defect of having announced from Internet search, comprise classification of defects that each stage of SDLC produces, from the attack mode of angle-of-attack research and the mitigation scheme of mitigation defective or attack.For this reason, the present invention adopts following technical scheme:
A kind ofly obtain the method for software safety defect, comprise the following steps: based on vertical search and semantic tagger
1) use search reptile in safety defect information related web page, to climb the webpage of getting a group or more based on the vertical search technology; Wherein include abundant safety defect related web page; Again with these webpages be divided into therewith the field relevant with uncorrelated two types, obtain two types of training webpage collection: the uncorrelated training webpage of security fields relevant training webpage collection collection with security fields;
2) in the relevant training in security fields webpage collection, select potential keyword, listening to the potential keyword of interpolation under the situation of expert opinion, and choosing keyword than formula according to following probability:
Figure GDA0000021233300000021
Wherein,
Figure GDA0000021233300000022
In the formula, w representes potential keyword, and c representes security fields related web page training set,
Figure GDA0000021233300000023
The irrelevant webpage training set in expression security fields is set a threshold value of selecting keyword, should select p wFor just and value greater than the speech of preset threshold as keyword, value is given big more weights more greatly;
3) utilize selected keyword to set up the safety defect field and filter training aids;
4) use search reptile, download webpage in other safety defect information related web pages from the internet automatically based on the vertical search technology;
5) utilize the safety defect field to filter training aids, adopt following Webpage filtering method to carry out home page filter: webpage is divided into based on the keyword weights<title>With<body>Two parts award different weight titleweight respectively to these two parts, bodyweight, from<title>With<body>Extract each keyword respectively, T in two parts ValueFor<title>The weights of certain keyword that part occurs, B ValueFor<body) weights of certain keyword of occurring of part, each keyword that will in two parts, occur is according to formula webpage weight=titleweight* ∑ T Value+ bodyweight* ∑ B ValueCarry out weighted sum, obtain the webpage weight, if this value, thinks then that this webpage is that the field is relevant greater than pre-set threshold, otherwise, filter this webpage;
6) webpage after filtering is carried out semantic tagger;
7) resolve mark and the relevant information of extraction safety defect.
It is target that the present invention obtains the software safety defect of on network, having announced with search, establishes foundation stone for making up software safety defect knowledge base and software security flaw analysis.Act on the theory of excavating safety defect, the safe and reliable software of exploitation; Utilize vertical search and semantic tagger Study on Technology achievement; A kind of vertical search network security defect information and mining data, extraction information are provided; To obtain fragmentary safety defect information of disperseing in the network, lot of data and strong support are provided for making up software safety defect knowledge base and software security flaw analysis.The present invention has following beneficial effect:
1. utilize the automatic searching of vertical search technology and download the software safety defect webpage of issuing on the WWW, effectively filter the irrelevant webpage in security fields.
2. automatic training of safety defective field keyword also calculates its weights, for the home page filter algorithm provides basic data accurately.
3. automatically find the new software safety defect of announcing and more newly downloaded, increment type obtain software safety defect.
4. the semantic tagger of semi-automatic completion safety defect information is realized its further information extraction, data mining.
5. obtain a large amount of formative software safety defect information automatically, comprise classification of defects, attack mode, mitigation scheme; For making up security defect knowledge base and software security flaw analysis the data support is provided.
Description of drawings
Accompanying drawing 1: systematic schematic diagram.
Accompanying drawing 2: the vertical search reptile is downloaded synoptic diagram.
Accompanying drawing 3: the keyword generative process is filtered in the field.
Accompanying drawing 4: semantic tagger information extraction structural drawing.
Embodiment
As shown in Figure 1, overall technological scheme flow process of the present invention does, at first search reptile in use field climbs and gets the safety defect Intelligence Page of having announced on the WWW, and the safety defect field is filtered training aids and then accomplished this task for the search reptile strong filtration support is provided; Secondly carry out semantic tagger to these web pages downloaded, make webpage have semantic information and let machine be appreciated that; Design mark analytical tool carries out further information extraction to the information of mark then; Be that software safety defect knowledge base and software security flaw analysis use these information that interface is provided at last, for they provide the data support.Be elaborated in the face of the present invention down.
1. vertical search
Vertical search reptile as shown in Figure 2 is downloaded synoptic diagram.Vertical search is specialty or dedicated search, is the specialty search to certain industry or a certain theme, and the target of this vertical search reptile is to obtain the software safety defect Intelligence Page.At first specify initial URL and add in the obstruction formation to be downloaded by the keeper; Next adopts the corresponding webpage of multithreading download URL, filters needed webpage in the storage art, the incoherent webpage in deletion field for the webpage of downloading; The page of analyze downloading then extracts in the page other links and adds in the obstruction formation to be downloaded; Vertical search reptile assembly does not stop when in obstruction formation to be downloaded, having URL, can artificially interrupt yet.Last reptile is finished or writes down web pages downloaded and web pages downloaded and write daily record not when interrupting.
2. set up the relevant training in security fields webpage collection
In front in the search engine reptile of design; Do not introduce earlier strobe utility; The initial URL that belongs to the safety defect field that needs from us begins to download a web pages, and this group network document has comprised abundant related web page, again with document be divided into therewith the field relevant with uncorrelated two types; Just obtain training the webpage collection, comprise two big types: the uncorrelated training webpage of security fields relevant training webpage collection collection with security fields.Also can constitute such as pages such as directly downloading some news for security fields uncorrelated training webpage collection by other mode.
3. select to filter keyword
Through a series of webpage pre-service, remove html page label etc., select potential keyword.Potential safety defect keyword comprises two aspects: one, the relevant training in security fields webpage is concentrated each word that occurs, and two, phrase as much as possible is provided listening under the situation of expert opinion.Be in these potential keywords, to select real keyword than (P) method then through probability; Remove the potential keyword that does not have the property distinguished; Probability is than being the probability that in certain field, occurs of a word or speech and the logarithm of the ratio of the probability that in other non-these fields, occurs, and the probability specific energy is well described the separating capacity of certain words to the field.The probability ratio method is specially adapted to the binary classification device.We hope to identify many and accurate as far as possible positive type in binary classification, identify negative type and be indifferent to.The probability ratio method is following:
P W = Log p ( w / c ) ( 1 - p ( w / c &OverBar; ) ) p ( w / c &OverBar; ) ( 1 - p ( w / c ) ) Formula (3-2)
Figure GDA0000021233300000042
formula (3-3) wherein
In the above in two formula; W representes potential keyword; C representes security fields related web page training set,
Figure GDA0000021233300000043
the irrelevant webpage training set in expression security fields.
| p w| big more, explain that word w distinguishes two types and trains the ability of webpage collection strong more.Work as p w>0 and when big, the page major part that comprises word w belongs to this field, should select as keyword; Work as p w<0 o'clock, the page major part that comprises word w did not belong to this field, not as this keyword; | p w| more little, explain that ability that word w distinguishes the field page more a little less than, can not be as keyword.So set a threshold value of selecting keyword, should select p wFor just and value greater than the speech of preset threshold as keyword, value is given big more weights more greatly.
4. training aids is filtered in the structure field
The keyword generative process is filtered in field as shown in Figure 3.Target is for the vertical search reptile safety defect field keyword and weights thereof to be provided.Home page filter is based on field keyword weights analyzing web page content realization; The keyword weights are meant: the keyword or the phrase keyword that can represent domain features, and the representational field ability that is to say the level value value in differentiation field and non-field.< keyword, value>binary entity of the two composition is right.Through searching the keyword that comprises in the webpage, the weights addition that these keywords are corresponding promptly is the weights of webpage when calculating the webpage weights, but because the different piece of webpage has different significance levels; The present invention has taken into full account this point when design webpage filter algorithm; Webpage is divided into two different portions, and one is webpage < title>part, and one is <body>part; Give different significance level titleweight, bodyweight respectively for these two parts.The weight of webpage then is transformed into the keyword weights T that partly occurs at title ValueWith multiply by titleweight and add the keyword weights B that partly occurs at body ValueWith multiply by bodyweight with.Shown in the following formula:
Webpage weight=titleweight* ∑ T Value+ bodyweight* ∑ B Value
Read in and download web page text and < keyword, value>binary entity file.For each keyword if its in title, then corresponding T ValueAdd the value that keyword is corresponding; If it appears among the body, corresponding B ValueAdd the value that keyword is corresponding.Calculate the webpage weight according to top formula, with the result with rule of thumb or the predetermined threshold value filterValue that obtains of experiment compare.If it is relevant that the webpage weight, is then thought the field greater than threshold value, otherwise think field independence.
5. use the search reptile to download safety defect information related web page
Use the vertical search reptile of designing in the first step to swash and get the relevant webpage in safety defect field at Internet; The safety defect field filtration training aids that utilizes last step to make up has again been downloaded webpage to all and has been filtered, and purpose is that the home page filter of safety defect field independence is fallen.
6. semantic tagger
Semantic tagger information extraction structural drawing as shown in Figure 4.To satisfy the requirement that software safety defect knowledge base and software security flaw analysis are used in order obtaining, to need vertical search climbed and get web pages downloaded and carry out the semantic tagger information extraction.This module then is used for accomplishing the impenetrable webpage extraction of the machine information that does not have fixed sturcture from these.Semantic tagger (Semantic Annotation) is exactly that raw data is marked (literal or symbol), makes it have semantic information, and not only the people is appreciated that but also machine also is appreciated that.This module is based on the GATE annotation tool, through design intelligible mark rule base of GATE and vocabulary, and is applied to the mark of safety defect Intelligence Page according to Gate mark principle.The GATE mark is to use the english information extraction system ANNIE of rule-based approach to realize; ANNIE is to one piece or one group of pending document; Processing through similar streamline; Carry out semantic tagger in strict accordance with the regulation order, comprise after English participle, the inquiry of English vocabulary, English subordinate sentence, English part-of-speech tagging, English mark rule definition, English named entity recognition and the English coreference resolution processing, realize information extraction entire chapter or whole group document.The vocabulary storehouse is used to set up the list of types collection, and every type of tabulation comprises the instance that the type comprises, and the mark rule---the JAPE rule comprises left and right sides two parts, and the left side is used for the field of matched text invention part, and the right is concrete mark type.
Vocabulary is used for searching the significant phrase phrase of handling the document needs, and this module adopts the attribute of software safety defect constituent, and promptly the attribute of attack mode, defective classification instance, mitigation scheme is as the vocabulary content.The specific object of these software safety defects has provided and has described their implication in the definition of safety defect field, description.The function of JAPE (a Java AnnotationPatterns Engine) is to set up rule base, with the rule of the information in the regular expression matched text, is used to realize participle subordinate sentence and named entity mark more accurately.The rule that JAPE builds is one group of rule syntax file, and all kinds safety defect information that finds as required designs various types of JAPE rules.
What obtain after the semantic tagger operation is to contain the XML format file that semantic information also contains other irrelevant informations in the webpage simultaneously.
7. information extraction
Mark resolve to use the DOM technology of JAXP that the XML document that a last step generates is resolved, and accomplishes the information extraction of safety defect, for the use of safety defect information provides noiseless formatted data.The use interface of safety defect information is provided at last, and target is that the formatted data that obtains is stored according to the storage organization of database.

Claims (1)

1. one kind is obtained the method for software safety defect based on vertical search and semantic tagger, comprises the following steps:
1) use search reptile in safety defect information related web page, to climb the webpage of getting a group or more based on the vertical search technology; Wherein include abundant safety defect related web page; Again these webpages are divided into relevant with security fields and uncorrelated two types, obtain two types of training webpage collection: the uncorrelated training webpage of security fields relevant training webpage collection collection with security fields;
2) in the relevant training in security fields webpage collection, select potential keyword, listening to the potential keyword of interpolation under the situation of expert opinion, and choosing keyword than formula according to following probability:
P W = Log p ( w / c ) ( 1 - p ( w / c &OverBar; ) ) p ( w / c &OverBar; ) ( 1 - p ( w / c ) ) Wherein,
Figure FDA0000097454250000012
In the formula, w representes potential keyword, and c representes the relevant training in security fields webpage collection,
Figure FDA0000097454250000013
Expression security fields uncorrelated training webpage collection is set a threshold value of selecting keyword, should select P WFor just and value greater than the speech of preset threshold as keyword, value is given big more weights more greatly;
3) utilize selected keyword to set up the safety defect field and filter training aids;
4) use search reptile, download webpage in other safety defect information related web pages from the internet automatically based on the vertical search technology;
5) utilize the safety defect field to filter training aids; Adopt following Webpage filtering method to carry out home page filter: webpage is divided into title and two parts of body based on the keyword weights; These two parts are awarded different weight titleweight respectively; Bodyweight extracts each keyword respectively, T in title and two parts of body ValueThe weights of certain keyword that partly occurs for title, B ValueThe weights of certain keyword that partly occurs for body, each keyword that will in two parts, occur is according to formula webpage weight=titleweight* ∑ T Value+ bodyweight* ∑ B ValueCarry out weighted sum, obtain the webpage weight, if this value thinks then that greater than pre-set threshold this webpage is relevant with security fields, otherwise, filter this webpage;
6) webpage after filtering is carried out semantic tagger;
7) resolve mark and the relevant information of extraction safety defect.
CN2010101688044A 2010-05-11 2010-05-11 Method for obtaining software security defects based on vertical search and semantic annotation Active CN101814098B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010101688044A CN101814098B (en) 2010-05-11 2010-05-11 Method for obtaining software security defects based on vertical search and semantic annotation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010101688044A CN101814098B (en) 2010-05-11 2010-05-11 Method for obtaining software security defects based on vertical search and semantic annotation

Publications (2)

Publication Number Publication Date
CN101814098A CN101814098A (en) 2010-08-25
CN101814098B true CN101814098B (en) 2012-05-02

Family

ID=42621350

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010101688044A Active CN101814098B (en) 2010-05-11 2010-05-11 Method for obtaining software security defects based on vertical search and semantic annotation

Country Status (1)

Country Link
CN (1) CN101814098B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104636404B (en) * 2013-11-14 2019-02-19 华为技术有限公司 Large-scale data generation method and device for test
CN105938532B (en) * 2015-11-25 2018-03-16 北京匡恩网络科技有限责任公司 It is a kind of to firmware sample on a large scale sampling and leak analysis method
CN105608232B (en) * 2016-02-17 2019-01-15 扬州大学 A kind of bug knowledge modeling method based on graphic data base
CN109299381B (en) * 2018-10-31 2020-04-24 哈尔滨工程大学 Software defect retrieval and analysis system and method based on semantic concept

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271464A (en) * 2007-11-26 2008-09-24 北京九城网络软件有限公司 Search method of internet search engine
CN101625641A (en) * 2009-08-05 2010-01-13 天津大学 Trusted software development method based on security defect knowledge base

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6311194B1 (en) * 2000-03-15 2001-10-30 Taalee, Inc. System and method for creating a semantic web and its applications in browsing, searching, profiling, personalization and advertising

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271464A (en) * 2007-11-26 2008-09-24 北京九城网络软件有限公司 Search method of internet search engine
CN101625641A (en) * 2009-08-05 2010-01-13 天津大学 Trusted software development method based on security defect knowledge base

Also Published As

Publication number Publication date
CN101814098A (en) 2010-08-25

Similar Documents

Publication Publication Date Title
CN103399901B (en) A kind of keyword abstraction method
Kim et al. Research trends in vulnerability studies from 2000 to 2019: Findings from a bibliometric analysis
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN102750390B (en) Automatic news webpage element extracting method
CN101251855B (en) Equipment, system and method for cleaning internet web page
CN110188344A (en) A kind of keyword extracting method of multiple features fusion
CN106055541A (en) News content sensitive word filtering method and system
CN106066866A (en) A kind of automatic abstracting method of english literature key phrase and system
CN102436563B (en) Method and device for detecting page tampering
CN103927397B (en) Recognition method for Web page link blocks based on block tree
CN102591612B (en) General webpage text extraction method based on punctuation continuity and system thereof
CN106649260A (en) Product feature structure tree construction method based on comment text mining
CN104133855B (en) A kind of method and device of input method intelligent association
CN103246644B (en) Method and device for processing Internet public opinion information
CN103049532A (en) Method for creating knowledge base engine on basis of sudden event emergency management and method for inquiring knowledge base engine
CN103744905A (en) Junk mail judgment method and device
CN101814098B (en) Method for obtaining software security defects based on vertical search and semantic annotation
CN103853738A (en) Identification method for webpage information related region
CN101968801A (en) Method for extracting key words of single text
CN105975475A (en) Chinese phrase string-based fine-grained thematic information extraction method
CN103577404A (en) Microblog-oriented discovery method for new emergencies
CN102915361A (en) Webpage text extracting method based on character distribution characteristic
US10042827B2 (en) System and method for recognizing non-body text in webpage
CN103365879A (en) Method and device for obtaining page similarity
CN110399613A (en) A kind of internet news based on part-of-speech tagging are related to place name identification method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20201204

Address after: No.150 Pingdong Avenue, Pingchao Town, Tongzhou District, Nantong City, Jiangsu Province

Patentee after: Jiangsu Yongda power telecommunication installation engineering Co., Ltd

Address before: 300072 Tianjin City, Nankai District Wei Jin Road No. 92, Tianjin University

Patentee before: Tianjin University