CN103455754A

CN103455754A - Regular expression-based malicious search keyword recognition method

Info

Publication number: CN103455754A
Application number: CN201310401159XA
Authority: CN
Inventors: 邹福泰; 白巍; 潘道欣; 易平
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2013-09-05
Filing date: 2013-09-05
Publication date: 2013-12-18
Anticipated expiration: 2033-09-05
Also published as: CN103455754B

Abstract

The invention discloses a regular expression-based malicious search keyword recognition method, which comprises the steps of extracting characteristic fragments from a known malicious search keyword set by utilizing a classifier, the generalized suffix tree and a CSS (Color Set Size) algorithm; creating a keyword tree according to the occurrence frequency of the extracted characteristic fragments, wherein characteristic fragments, through which each path passes, of the keyword tree form one regular expression; performing screening and reduction to obtain a regular expression output set; creating a filter by taking the regular expression output set as a threshold value; identifying and extracting new malicious search keywords by using the filter. The method discloses by the invention identifies malicious search keywords by using the regular expression, has the advantages of quick speed, low false report rate and low missing report rate. New recently popular website loopholes can be timely discovered through the newly identified malicious search keywords, and websites containing the potential vulnerability as well as web security vulnerabilities can be found by using a website returned by malicious search keywords.

Description

A kind of malicious searches keyword recognition methods based on regular expression

Technical field

The present invention relates to a kind of recognition methods of malicious searches keyword, relate in particular to a kind of recognition methods of the malicious searches keyword based on regular expression.

Background technology

According to the analysis to hundreds of network security incidents and tracking, website is attacked intrusion event and often is accompanied by full spectrum information collection and the bug excavation to target of attack targetedly.Assault on the basis of having found system or some leak of network, constantly produces the new attack method for new leak often.In order to test new leak and attack method, the hacker often will utilize search engine to search on the internet the website that may have certain leak, and it is attacked.Also have the hacker for certain leak, write out the instrument of certain specific scanning and automatic invasion, by search engine, all websites that may have this leak on internet are scanned on a large scale and invaded.These several years, utilize the assault of the open search engine such as Google, Baidu to become a kind of important assault means.

This assault means are by skilled grasp of most of assailant institute.If the keyword that the analytical attack person uses in time, and find corresponding website, just can find in time the safe thin spot in website and the target web of easily being attacked, also can be by the analysis mining to these data, dope the direction of attack of assailant in different time sections, and leak is attacked in new website.

For example, find first " inurl:index.action ", " inurl:(.action) site:.edu.cn ", " inurl:edu.cn filetype:action ", " inurl:index.action ", " allinurl:+index.action " in the lists of keywords from search engine, all direct into certain several fixing several website.Keyword for the search engine of finding first all directs into certain several fixing website, by analyzing, this is the information search in early stage of attack attempting for Apache Struts2 framework leak, and at present a large amount of developer can use this framework when utilizing J2EE exploitation Web application.Therefore, to search engine, the examination of malice keyword used has meaning to safety precaution.。

Therefore, those skilled in the art is devoted to develop a kind of recognition methods of the malicious searches keyword based on regular expression, to identify known and unknown malicious searches keyword.Method is according to known malicious searches keyword, constantly identifies new malicious searches keyword, and new malicious searches keyword is constantly updated in known malicious searches keyword set, and it is synchronizeed with up-to-date hacking technique.

Summary of the invention

Because the above-mentioned defect of prior art, technical matters to be solved by this invention is to provide a kind of malicious searches keyword recognition methods based on regular expression

For achieving the above object, the invention provides a kind of malicious searches keyword recognition methods based on regular expression, it is characterized in that, comprise the following steps:

Step (101) is extracted characteristic fragment: according to known malicious searches keyword set, utilize sorter, broad sense suffix tree and CSS(Color Set Size) algorithm extraction characteristic fragment;

Step (102) is set up the keyword tree: arrange with being connected the described characteristic fragment extracted and set up a keyword tree, the characteristic fragment of the every paths process on described keyword tree all is linked to be a regular expression;

Step (103) is set up filtrator: screen and simplify all described regular expressions, obtain last regular expression output collection, the threshold value using described regular expression output collection as described filtrator, set up filtrator;

Step (104) identification and extract the malicious searches keyword: utilize described filtrator to comprise the searching request of the search engine that identifies according to HTTP Referer in network traffics keyword carry out the identification of canonical coupling, to find the malicious searches attack and to extract new malicious searches keyword, and described new malicious searches keyword is joined in described known malicious searches keyword set;

Step (105) finishes.

Further, described sorter is classified to described known malicious searches keyword set according to search attack purpose.

Further, described broad sense suffix tree and described CSS algorithm extract described characteristic fragment according to the keyword frequency of occurrences.

Further, described keyword tree only has a root node, and described characteristic fragment is as the child node of described keyword tree.

Further, the described child node of described keyword tree is to take frequency that described characteristic fragment occurs as according to arranging: the frequency of occurrences of described characteristic fragment is higher, and described characteristic fragment is the closer to described root node; The frequency of occurrences of described characteristic fragment is lower, and described characteristic fragment is more away from described root node.

Further, the screening of regular expression described in described step (103) and to simplify be that evaluation by described regular expression is carried out based on entropy completes.

Further, the described evaluation based on entropy comprises: the probability that calculates a random word string of described matching regular expressions; Set judgment threshold; Described probability and described judgment threshold are compared; Choose described probability and be less than the regular expression of described judgment threshold as described regular expression output collection.

Further, the scope of described judgment threshold is between 0 to 1.

The regular expression identification malicious searches keyword that utilizes of the present invention, have the advantage that speed is fast, rate of false alarm is low, rate of failing to report is low.And by the up-to-date malicious searches keyword identified, can know that up-to-date website attacks leak, the result that also can return by these malicious searches keywords, know the target web easily attacked and the safe thin spot in website.And the up-to-date malicious searches keyword identified constantly supplements into known malicious searches keywords database, making the malicious searches keywords database is in real-time renewal and enlarging, and has also just more ensured the security of network.

Technique effect below with reference to accompanying drawing to design of the present invention, concrete structure and generation is described further, to understand fully purpose of the present invention, feature and effect.

The accompanying drawing explanation

Fig. 1 is the process flow diagram of the recognition methods of a kind of malicious searches keyword based on regular expression of the present invention;

Fig. 2 is the keyword tree schematic diagram of a preferred embodiment of the recognition methods of a kind of malicious searches keyword based on regular expression of the present invention.

Embodiment

Below in conjunction with accompanying drawing, embodiments of the invention are elaborated: the present embodiment is implemented under with the technical solution of the present invention prerequisite, provided detailed embodiment and concrete operating process, but protection scope of the present invention is not limited to following embodiment.

Malicious searches keyword recognition methods based on regular expression of the present invention, be a kind of method of extracting the regular expression with malice feature from known malicious searches keywords database, specifically comprises the following steps, as shown in Figure 1:

Step 101 is extracted feature string fragment: the present invention adopts broad sense suffix tree and CSS(Color Set Size) algorithm extracts feature string fragment.

At first, use sorter to the processing of classifying of the principal ingredient of known malicious searches keyword set, in the present embodiment, to " inurl:index.action ", " inurl:(.action) site:.edu.cn ", " inurl:edu.cn filetype:action ", " inurl:index.action ", " allinurl:+index.action ", several malicious searches keywords utilize sorter to be processed.Utilizing sorter to attack purpose to these malicious searches keywords according to it is classified, be divided three classes: " inurl ", " site " and " filetype ", wherein keyword " index.action ", " .action ", " edu.cn " and " index.action " belong to type " inurl "; Keyword " edu.cn " belongs to type " site "; Keyword " action " belongs to type " filetype ".

Secondly, the known malicious searched key word that sorter has been classified (i.e. a word string) is operated, if directly operated, can repeat so the inquiry work of a lot of information, and the broad sense suffix tree is a kind of data structure of storage word string suffix in computer science, on the broad sense suffix tree, can carry out multiple string operation fast.Any word string can be set up the broad sense suffix tree, and the broad sense suffix tree only has a root node, and every limit is all a word word string of input string, connects the substring of the limit representative from the root node to the leaf node, will obtain all suffix of input string.While like this word string being operated, the broad sense suffix tree preserves Useful Information, has avoided the inquiry that repeats in the string operation process.Therefore the present invention adopts the broad sense suffix tree that avoids repeating inquiry.Finally, utilize the CSS algorithm, extract and characteristic fragment frequently occurs.In the present embodiment, " inurl " has three characteristic fragments " action ", " index " and " edu.cn "; " site " has a characteristic fragment " edu.cn "; " siletype " also only has a characteristic fragment " action ".

Step 102 is set up the keyword tree:

Arrange with being connected the described feature string fragment of extracting and set up a keyword tree, the characteristic fragment of the every paths process on the tree can be linked to be a regular expression.

The keyword tree is one and take the characteristic fragment frequency of occurrences as the tree according to setting up.Its each node is a characteristic fragment.The higher characteristic fragment of the frequency of occurrences is the closer to root node, and the lower characteristic fragment of the frequency of occurrences is the closer to leaf node.Frequency, content that it occurs according to characteristic fragment, by characteristic fragment subseries successively, the upper all characteristic fragments from root node to any non-root node path of the keyword of building up tree can form a regular expression.

Crucial tree is specifically set up according to following steps: root node of model; Secondly, for the characteristic fragment extracted, calculate the frequency that they occur, select that characteristic fragment that frequency is the highest, set up a child node, all malicious searches keywords that its content is current characteristic fragment and its coupling, the characteristic fragment mated can not be mated again in the back in this path; Then, the malicious searches keyword to remaining, continue to set up child node, until the malicious searches keyword that all father nodes comprise is all mated by the child node of this one deck; Next, just can set up the child node of lower one deck, until all malicious searches keywords of node do not have extendible characteristic fragment.Fig. 2 is the keyword tree of setting up at the characteristic fragment to " inurl ", and wherein, u1 is index.action, and u2 is action, and u3 is edu.cn.Path from root node to each non-root node can form several regular expressions, as: index (.*) action, edu.cn etc.

Step 103 is set up a filtrator:

At first, the regular expression that all keywords setting from root node to each leaf node path according to described keyword form carries out the evaluation based on entropy: for a regular expression e, definition E (u) is for using the number of this needed position of random word string u of regular expression formation, and definition B (u) is not for being used this regular expression to form the number of needed of a random word string u.Depreciation entropy d (e) just equals the poor of B (u)-E (u), so the probability of a random word string of matching regular expressions is exactly:

P(e)=2 ^E(u)2 ^B(u)=12 ^(B(u)-E(u))=1d(e)；

Secondly, all regular expressions on the keyword tree are screened: the random word string probability P of the coupling of regular expression (e) is compared with judgment threshold γ, wherein the span of judgment threshold γ, between 0～1, is generally rule of thumb to carry out according to specific circumstances value.Choose the regular expression that probability is less than threshold gamma, the regular expression filtered out like this is enough accurate;

Finally, all regular expressions that filter out are formed to regular expression output collection, the threshold value using regular expression output collection as filtrator, set up filtrator.

Step 104 identification and extraction malicious searches keyword: utilize filtrator, to being address, HTTP source according to HTTP Referer(in network traffics, it is a field of HTTP gauge outfit, be used for meaning from where being linked to current webpage) the search engine searching request that identifies comprise keyword carry out the identification of canonical coupling, to find the malicious searches attack and to extract new malicious searches keyword, and new malicious searches keyword is joined in known malicious searches keyword set.

Step 105 recognition methods finishes.

More than describe specific embodiments of the invention in detail.The ordinary skill that should be appreciated that this area just can design according to the present invention be made many modifications and variations without creative work.Therefore, all technician in the art, all should be in the determined protection domain by claims under this invention's idea on the basis of existing technology by the available technical scheme of logical analysis, reasoning, or a limited experiment.

Claims

1. the malicious searches keyword recognition methods based on regular expression, is characterized in that, comprises the following steps:

Step (101) is extracted characteristic fragment: according to known malicious searches keyword set, utilize sorter, broad sense suffix tree and CSS algorithm to extract characteristic fragment;

Step (105) finishes.

2. a kind of malicious searches keyword recognition methods based on regular expression as claimed in claim 1, wherein, described sorter is attacked purpose according to search described known malicious searches keyword set is classified.

3. a kind of malicious searches keyword recognition methods based on regular expression as claimed in claim 1, wherein, described broad sense suffix tree and described CSS algorithm extract described characteristic fragment according to the keyword frequency of occurrences.

4. a kind of malicious searches keyword recognition methods based on regular expression as claimed in claim 1, wherein, described keyword tree only has a root node, and described characteristic fragment is as the child node of described keyword tree.

5. a kind of malicious searches keyword recognition methods based on regular expression as claimed in claim 4, wherein, the described child node of described keyword tree is to take frequency that described characteristic fragment occurs as according to arranging: the frequency of occurrences of described characteristic fragment is higher, and described characteristic fragment is the closer to described root node; The frequency of occurrences of described characteristic fragment is lower, and described characteristic fragment is more away from described root node.

6. a kind of malicious searches keyword recognition methods based on regular expression as claimed in claim 1, wherein, the screening of regular expression described in described step (103) and to simplify be that evaluation by described regular expression is carried out based on entropy completes.

7. a kind of malicious searches keyword recognition methods based on regular expression as claimed in claim 6, wherein, the described evaluation based on entropy comprises: the probability that calculates a random word string of described matching regular expressions; Set judgment threshold; Described probability and described judgment threshold are compared; Choose described probability and be less than the regular expression of described judgment threshold as described regular expression output collection.

8. a kind of malicious searches keyword recognition methods based on regular expression as claimed in claim 7, wherein, the scope of described judgment threshold is between 0 to 1.