CN103455754A - Regular expression-based malicious search keyword recognition method - Google Patents

Regular expression-based malicious search keyword recognition method Download PDF

Info

Publication number
CN103455754A
CN103455754A CN201310401159XA CN201310401159A CN103455754A CN 103455754 A CN103455754 A CN 103455754A CN 201310401159X A CN201310401159X A CN 201310401159XA CN 201310401159 A CN201310401159 A CN 201310401159A CN 103455754 A CN103455754 A CN 103455754A
Authority
CN
China
Prior art keywords
regular expression
keyword
malicious searches
characteristic fragment
malicious
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310401159XA
Other languages
Chinese (zh)
Other versions
CN103455754B (en
Inventor
邹福泰
白巍
潘道欣
易平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201310401159.XA priority Critical patent/CN103455754B/en
Publication of CN103455754A publication Critical patent/CN103455754A/en
Application granted granted Critical
Publication of CN103455754B publication Critical patent/CN103455754B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a regular expression-based malicious search keyword recognition method, which comprises the steps of extracting characteristic fragments from a known malicious search keyword set by utilizing a classifier, the generalized suffix tree and a CSS (Color Set Size) algorithm; creating a keyword tree according to the occurrence frequency of the extracted characteristic fragments, wherein characteristic fragments, through which each path passes, of the keyword tree form one regular expression; performing screening and reduction to obtain a regular expression output set; creating a filter by taking the regular expression output set as a threshold value; identifying and extracting new malicious search keywords by using the filter. The method discloses by the invention identifies malicious search keywords by using the regular expression, has the advantages of quick speed, low false report rate and low missing report rate. New recently popular website loopholes can be timely discovered through the newly identified malicious search keywords, and websites containing the potential vulnerability as well as web security vulnerabilities can be found by using a website returned by malicious search keywords.

Description

A kind of malicious searches keyword recognition methods based on regular expression
Technical field
The present invention relates to a kind of recognition methods of malicious searches keyword, relate in particular to a kind of recognition methods of the malicious searches keyword based on regular expression.
Background technology
According to the analysis to hundreds of network security incidents and tracking, website is attacked intrusion event and often is accompanied by full spectrum information collection and the bug excavation to target of attack targetedly.Assault on the basis of having found system or some leak of network, constantly produces the new attack method for new leak often.In order to test new leak and attack method, the hacker often will utilize search engine to search on the internet the website that may have certain leak, and it is attacked.Also have the hacker for certain leak, write out the instrument of certain specific scanning and automatic invasion, by search engine, all websites that may have this leak on internet are scanned on a large scale and invaded.These several years, utilize the assault of the open search engine such as Google, Baidu to become a kind of important assault means.
This assault means are by skilled grasp of most of assailant institute.If the keyword that the analytical attack person uses in time, and find corresponding website, just can find in time the safe thin spot in website and the target web of easily being attacked, also can be by the analysis mining to these data, dope the direction of attack of assailant in different time sections, and leak is attacked in new website.
For example, find first " inurl:index.action ", " inurl:(.action) site:.edu.cn ", " inurl:edu.cn filetype:action ", " inurl:index.action ", " allinurl:+index.action " in the lists of keywords from search engine, all direct into certain several fixing several website.Keyword for the search engine of finding first all directs into certain several fixing website, by analyzing, this is the information search in early stage of attack attempting for Apache Struts2 framework leak, and at present a large amount of developer can use this framework when utilizing J2EE exploitation Web application.Therefore, to search engine, the examination of malice keyword used has meaning to safety precaution.。
Therefore, those skilled in the art is devoted to develop a kind of recognition methods of the malicious searches keyword based on regular expression, to identify known and unknown malicious searches keyword.Method is according to known malicious searches keyword, constantly identifies new malicious searches keyword, and new malicious searches keyword is constantly updated in known malicious searches keyword set, and it is synchronizeed with up-to-date hacking technique.
Summary of the invention
Because the above-mentioned defect of prior art, technical matters to be solved by this invention is to provide a kind of malicious searches keyword recognition methods based on regular expression
For achieving the above object, the invention provides a kind of malicious searches keyword recognition methods based on regular expression, it is characterized in that, comprise the following steps:
Step (101) is extracted characteristic fragment: according to known malicious searches keyword set, utilize sorter, broad sense suffix tree and CSS(Color Set Size) algorithm extraction characteristic fragment;
Step (102) is set up the keyword tree: arrange with being connected the described characteristic fragment extracted and set up a keyword tree, the characteristic fragment of the every paths process on described keyword tree all is linked to be a regular expression;
Step (103) is set up filtrator: screen and simplify all described regular expressions, obtain last regular expression output collection, the threshold value using described regular expression output collection as described filtrator, set up filtrator;
Step (104) identification and extract the malicious searches keyword: utilize described filtrator to comprise the searching request of the search engine that identifies according to HTTP Referer in network traffics keyword carry out the identification of canonical coupling, to find the malicious searches attack and to extract new malicious searches keyword, and described new malicious searches keyword is joined in described known malicious searches keyword set;
Step (105) finishes.
Further, described sorter is classified to described known malicious searches keyword set according to search attack purpose.
Further, described broad sense suffix tree and described CSS algorithm extract described characteristic fragment according to the keyword frequency of occurrences.
Further, described keyword tree only has a root node, and described characteristic fragment is as the child node of described keyword tree.
Further, the described child node of described keyword tree is to take frequency that described characteristic fragment occurs as according to arranging: the frequency of occurrences of described characteristic fragment is higher, and described characteristic fragment is the closer to described root node; The frequency of occurrences of described characteristic fragment is lower, and described characteristic fragment is more away from described root node.
Further, the screening of regular expression described in described step (103) and to simplify be that evaluation by described regular expression is carried out based on entropy completes.
Further, the described evaluation based on entropy comprises: the probability that calculates a random word string of described matching regular expressions; Set judgment threshold; Described probability and described judgment threshold are compared; Choose described probability and be less than the regular expression of described judgment threshold as described regular expression output collection.
Further, the scope of described judgment threshold is between 0 to 1.
The regular expression identification malicious searches keyword that utilizes of the present invention, have the advantage that speed is fast, rate of false alarm is low, rate of failing to report is low.And by the up-to-date malicious searches keyword identified, can know that up-to-date website attacks leak, the result that also can return by these malicious searches keywords, know the target web easily attacked and the safe thin spot in website.And the up-to-date malicious searches keyword identified constantly supplements into known malicious searches keywords database, making the malicious searches keywords database is in real-time renewal and enlarging, and has also just more ensured the security of network.
Technique effect below with reference to accompanying drawing to design of the present invention, concrete structure and generation is described further, to understand fully purpose of the present invention, feature and effect.
The accompanying drawing explanation
Fig. 1 is the process flow diagram of the recognition methods of a kind of malicious searches keyword based on regular expression of the present invention;
Fig. 2 is the keyword tree schematic diagram of a preferred embodiment of the recognition methods of a kind of malicious searches keyword based on regular expression of the present invention.
Embodiment
Below in conjunction with accompanying drawing, embodiments of the invention are elaborated: the present embodiment is implemented under with the technical solution of the present invention prerequisite, provided detailed embodiment and concrete operating process, but protection scope of the present invention is not limited to following embodiment.
Malicious searches keyword recognition methods based on regular expression of the present invention, be a kind of method of extracting the regular expression with malice feature from known malicious searches keywords database, specifically comprises the following steps, as shown in Figure 1:
Step 101 is extracted feature string fragment: the present invention adopts broad sense suffix tree and CSS(Color Set Size) algorithm extracts feature string fragment.
At first, use sorter to the processing of classifying of the principal ingredient of known malicious searches keyword set, in the present embodiment, to " inurl:index.action ", " inurl:(.action) site:.edu.cn ", " inurl:edu.cn filetype:action ", " inurl:index.action ", " allinurl:+index.action ", several malicious searches keywords utilize sorter to be processed.Utilizing sorter to attack purpose to these malicious searches keywords according to it is classified, be divided three classes: " inurl ", " site " and " filetype ", wherein keyword " index.action ", " .action ", " edu.cn " and " index.action " belong to type " inurl "; Keyword " edu.cn " belongs to type " site "; Keyword " action " belongs to type " filetype ".
Secondly, the known malicious searched key word that sorter has been classified (i.e. a word string) is operated, if directly operated, can repeat so the inquiry work of a lot of information, and the broad sense suffix tree is a kind of data structure of storage word string suffix in computer science, on the broad sense suffix tree, can carry out multiple string operation fast.Any word string can be set up the broad sense suffix tree, and the broad sense suffix tree only has a root node, and every limit is all a word word string of input string, connects the substring of the limit representative from the root node to the leaf node, will obtain all suffix of input string.While like this word string being operated, the broad sense suffix tree preserves Useful Information, has avoided the inquiry that repeats in the string operation process.Therefore the present invention adopts the broad sense suffix tree that avoids repeating inquiry.Finally, utilize the CSS algorithm, extract and characteristic fragment frequently occurs.In the present embodiment, " inurl " has three characteristic fragments " action ", " index " and " edu.cn "; " site " has a characteristic fragment " edu.cn "; " siletype " also only has a characteristic fragment " action ".
Step 102 is set up the keyword tree:
Arrange with being connected the described feature string fragment of extracting and set up a keyword tree, the characteristic fragment of the every paths process on the tree can be linked to be a regular expression.
The keyword tree is one and take the characteristic fragment frequency of occurrences as the tree according to setting up.Its each node is a characteristic fragment.The higher characteristic fragment of the frequency of occurrences is the closer to root node, and the lower characteristic fragment of the frequency of occurrences is the closer to leaf node.Frequency, content that it occurs according to characteristic fragment, by characteristic fragment subseries successively, the upper all characteristic fragments from root node to any non-root node path of the keyword of building up tree can form a regular expression.
Crucial tree is specifically set up according to following steps: root node of model; Secondly, for the characteristic fragment extracted, calculate the frequency that they occur, select that characteristic fragment that frequency is the highest, set up a child node, all malicious searches keywords that its content is current characteristic fragment and its coupling, the characteristic fragment mated can not be mated again in the back in this path; Then, the malicious searches keyword to remaining, continue to set up child node, until the malicious searches keyword that all father nodes comprise is all mated by the child node of this one deck; Next, just can set up the child node of lower one deck, until all malicious searches keywords of node do not have extendible characteristic fragment.Fig. 2 is the keyword tree of setting up at the characteristic fragment to " inurl ", and wherein, u1 is index.action, and u2 is action, and u3 is edu.cn.Path from root node to each non-root node can form several regular expressions, as: index (.*) action, edu.cn etc.
Step 103 is set up a filtrator:
At first, the regular expression that all keywords setting from root node to each leaf node path according to described keyword form carries out the evaluation based on entropy: for a regular expression e, definition E (u) is for using the number of this needed position of random word string u of regular expression formation, and definition B (u) is not for being used this regular expression to form the number of needed of a random word string u.Depreciation entropy d (e) just equals the poor of B (u)-E (u), so the probability of a random word string of matching regular expressions is exactly:
P(e)=2 E(u)2 B(u)=12 (B(u)-E(u))=1d(e);
Secondly, all regular expressions on the keyword tree are screened: the random word string probability P of the coupling of regular expression (e) is compared with judgment threshold γ, wherein the span of judgment threshold γ, between 0~1, is generally rule of thumb to carry out according to specific circumstances value.Choose the regular expression that probability is less than threshold gamma, the regular expression filtered out like this is enough accurate;
Finally, all regular expressions that filter out are formed to regular expression output collection, the threshold value using regular expression output collection as filtrator, set up filtrator.
Step 104 identification and extraction malicious searches keyword: utilize filtrator, to being address, HTTP source according to HTTP Referer(in network traffics, it is a field of HTTP gauge outfit, be used for meaning from where being linked to current webpage) the search engine searching request that identifies comprise keyword carry out the identification of canonical coupling, to find the malicious searches attack and to extract new malicious searches keyword, and new malicious searches keyword is joined in known malicious searches keyword set.
Step 105 recognition methods finishes.
More than describe specific embodiments of the invention in detail.The ordinary skill that should be appreciated that this area just can design according to the present invention be made many modifications and variations without creative work.Therefore, all technician in the art, all should be in the determined protection domain by claims under this invention's idea on the basis of existing technology by the available technical scheme of logical analysis, reasoning, or a limited experiment.

Claims (8)

1. the malicious searches keyword recognition methods based on regular expression, is characterized in that, comprises the following steps:
Step (101) is extracted characteristic fragment: according to known malicious searches keyword set, utilize sorter, broad sense suffix tree and CSS algorithm to extract characteristic fragment;
Step (102) is set up the keyword tree: arrange with being connected the described characteristic fragment extracted and set up a keyword tree, the characteristic fragment of the every paths process on described keyword tree all is linked to be a regular expression;
Step (103) is set up filtrator: screen and simplify all described regular expressions, obtain last regular expression output collection, the threshold value using described regular expression output collection as described filtrator, set up filtrator;
Step (104) identification and extract the malicious searches keyword: utilize described filtrator to comprise the searching request of the search engine that identifies according to HTTP Referer in network traffics keyword carry out the identification of canonical coupling, to find the malicious searches attack and to extract new malicious searches keyword, and described new malicious searches keyword is joined in described known malicious searches keyword set;
Step (105) finishes.
2. a kind of malicious searches keyword recognition methods based on regular expression as claimed in claim 1, wherein, described sorter is attacked purpose according to search described known malicious searches keyword set is classified.
3. a kind of malicious searches keyword recognition methods based on regular expression as claimed in claim 1, wherein, described broad sense suffix tree and described CSS algorithm extract described characteristic fragment according to the keyword frequency of occurrences.
4. a kind of malicious searches keyword recognition methods based on regular expression as claimed in claim 1, wherein, described keyword tree only has a root node, and described characteristic fragment is as the child node of described keyword tree.
5. a kind of malicious searches keyword recognition methods based on regular expression as claimed in claim 4, wherein, the described child node of described keyword tree is to take frequency that described characteristic fragment occurs as according to arranging: the frequency of occurrences of described characteristic fragment is higher, and described characteristic fragment is the closer to described root node; The frequency of occurrences of described characteristic fragment is lower, and described characteristic fragment is more away from described root node.
6. a kind of malicious searches keyword recognition methods based on regular expression as claimed in claim 1, wherein, the screening of regular expression described in described step (103) and to simplify be that evaluation by described regular expression is carried out based on entropy completes.
7. a kind of malicious searches keyword recognition methods based on regular expression as claimed in claim 6, wherein, the described evaluation based on entropy comprises: the probability that calculates a random word string of described matching regular expressions; Set judgment threshold; Described probability and described judgment threshold are compared; Choose described probability and be less than the regular expression of described judgment threshold as described regular expression output collection.
8. a kind of malicious searches keyword recognition methods based on regular expression as claimed in claim 7, wherein, the scope of described judgment threshold is between 0 to 1.
CN201310401159.XA 2013-09-05 2013-09-05 A kind of malicious searches keyword recognition methods based on regular expression Expired - Fee Related CN103455754B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310401159.XA CN103455754B (en) 2013-09-05 2013-09-05 A kind of malicious searches keyword recognition methods based on regular expression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310401159.XA CN103455754B (en) 2013-09-05 2013-09-05 A kind of malicious searches keyword recognition methods based on regular expression

Publications (2)

Publication Number Publication Date
CN103455754A true CN103455754A (en) 2013-12-18
CN103455754B CN103455754B (en) 2016-05-04

Family

ID=49738104

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310401159.XA Expired - Fee Related CN103455754B (en) 2013-09-05 2013-09-05 A kind of malicious searches keyword recognition methods based on regular expression

Country Status (1)

Country Link
CN (1) CN103455754B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107038608A (en) * 2017-04-21 2017-08-11 北京恒冠网络数据处理有限公司 A kind of big data analysis system
CN107247783A (en) * 2017-06-14 2017-10-13 上海思依暄机器人科技股份有限公司 A kind of method and device of phonetic search music
CN107992481A (en) * 2017-12-25 2018-05-04 中科鼎富(北京)科技发展有限公司 A kind of matching regular expressions method, apparatus and system based on multiway tree
CN110059725A (en) * 2019-03-21 2019-07-26 中国科学院计算技术研究所 A kind of detection malicious searches system and method based on search key
CN114757267A (en) * 2022-03-25 2022-07-15 北京爱奇艺科技有限公司 Method and device for identifying noise query, electronic equipment and readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101784022A (en) * 2009-01-16 2010-07-21 北京炎黄新星网络科技有限公司 Method and system for filtering and classifying short messages
CN102142009A (en) * 2010-12-09 2011-08-03 华为技术有限公司 Method and device for matching regular expressions

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101784022A (en) * 2009-01-16 2010-07-21 北京炎黄新星网络科技有限公司 Method and system for filtering and classifying short messages
CN102142009A (en) * 2010-12-09 2011-08-03 华为技术有限公司 Method and device for matching regular expressions

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DAOXIN PAN,WEI BAI,SIYU ZHANG,FUTAI ZOU: "Detecting Malicious Queries From Search Engine Traffic", 《THE 8TH INTERNATIONAL CONFERENCE ON COMMUNICATION,NETWORK AND MOBILE COMPUTING》, 23 September 2012 (2012-09-23) *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107038608A (en) * 2017-04-21 2017-08-11 北京恒冠网络数据处理有限公司 A kind of big data analysis system
CN107247783A (en) * 2017-06-14 2017-10-13 上海思依暄机器人科技股份有限公司 A kind of method and device of phonetic search music
CN107992481A (en) * 2017-12-25 2018-05-04 中科鼎富(北京)科技发展有限公司 A kind of matching regular expressions method, apparatus and system based on multiway tree
CN107992481B (en) * 2017-12-25 2021-05-04 鼎富智能科技有限公司 Regular expression matching method, device and system based on multi-way tree
CN110059725A (en) * 2019-03-21 2019-07-26 中国科学院计算技术研究所 A kind of detection malicious searches system and method based on search key
CN110059725B (en) * 2019-03-21 2021-07-09 中国科学院计算技术研究所 Malicious search detection system and method based on search keywords
CN114757267A (en) * 2022-03-25 2022-07-15 北京爱奇艺科技有限公司 Method and device for identifying noise query, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CN103455754B (en) 2016-05-04

Similar Documents

Publication Publication Date Title
US11496509B2 (en) Malicious software detection in a computing system
Nelms et al. {ExecScent}: Mining for New {C&C} Domains in Live Networks with Adaptive Control Protocol Templates
CN107251037B (en) Blacklist generation device, blacklist generation system, blacklist generation method, and recording medium
Jianliang et al. The application on intrusion detection based on k-means cluster algorithm
Sun et al. {HinDom}: A robust malicious domain detection system based on heterogeneous information network with transductive classification
CN103559235B (en) A kind of online social networks malicious web pages detection recognition methods
CN111431939B (en) CTI-based SDN malicious flow defense method
US10425436B2 (en) Identifying bulletproof autonomous systems
CN112165462A (en) Attack prediction method and device based on portrait, electronic equipment and storage medium
CN106411921A (en) Multi-step attack prediction method based on cause-and-effect Byesian network
CN103455754B (en) A kind of malicious searches keyword recognition methods based on regular expression
CN110177114A (en) The recognition methods of network security threats index, unit and computer readable storage medium
CN112333195B (en) APT attack scene reduction detection method and system based on multi-source log correlation analysis
CN110324273A (en) A kind of Botnet detection method combined based on DNS request behavior with domain name constitutive characteristic
Li et al. Security OSIF: Toward automatic discovery and analysis of event based cyber threat intelligence
Luo et al. A Convolution-Based System for Malicious URLs Detection.
Acharya et al. Detecting malware, malicious URLs and virus using machine learning and signature matching
He et al. Malicious domain detection via domain relationship and graph models
Teoh et al. Analyst intuition inspired neural network based cyber security anomaly detection
CN109194605B (en) Active verification method and system for suspicious threat indexes based on open source information
Teoh et al. Analyst intuition inspired high velocity big data analysis using PCA ranked fuzzy k-means clustering with multi-layer perceptron (MLP) to obviate cyber security risk
KR101863569B1 (en) Method and Apparatus for Classifying Vulnerability Information Based on Machine Learning
CN115438340A (en) Mining behavior identification method and system based on morpheme characteristics
CN107239704A (en) Malicious web pages find method and device
KR101893029B1 (en) Method and Apparatus for Classifying Vulnerability Information Based on Machine Learning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160504

Termination date: 20180905

CF01 Termination of patent right due to non-payment of annual fee