CN105072089A

CN105072089A - WEB malicious scanning behavior abnormity detection method and system

Info

Publication number: CN105072089A
Application number: CN201510404406.0A
Authority: CN
Inventors: 杨婧; 罗熙; 刘艇; 吴再龙
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2015-07-10
Filing date: 2015-07-10
Publication date: 2015-11-18
Anticipated expiration: 2035-07-10
Also published as: CN105072089B

Abstract

The invention discloses a WEB malicious scanning behavior abnormity detection method and a WEB malicious scanning behavior abnormity detection system. The method comprises the following steps: 1) extracting keyword characteristics and statistics characteristics of access users from an access history record, and building keyword vectors and statistic characteristic vectors of the users, 2) traversing the keyword vectors of the users, performing statistics on the user number corresponding to each keyword, and building a global keyword table, 3) calculating the uncommon degree of each keyword according to the global keyword table, calculating original abnormal score values of the access users according to the corresponding uncommon degrees, correcting the original abnormal score values then according to the statistic characteristic vectors of the access users, and obtaining final abnormal score values of the users; 4) for a jump point of a final abnormal score value sequence of all the access users, taking the final abnormal score value corresponding to the jump point as a threshold, and 5) comparing the final abnormal score values of the access values with the threshold, and taking the users as malicious scanning users if the final abnormal score values of the access values are greater than the threshold. An unknown attack behavior can be found, and normal historical data is not relied on.

Description

A kind of WEB malice scanning behavior method for detecting abnormality and system

Technical field

The present invention relates to a kind of method for detecting abnormality towards WEB malice scanning behavior and system, belong to WEB security fields.

Background technology

WEB scanning is that a kind of common WEB accesses behavior, generally refers to the content of web crawlers by certain Rule targeted website.The scanning of WEB malice is with the difference of normal scan, the former target is by information such as scanning discovery website vulnerability, sensitive information, mandate entrances, the target of the latter is then the content information that acquisition website normally provides, as the Html page, picture, CSS file etc.Target due to the two has the difference of internal, and therefore its access behavior also has obvious difference:

First, the access request of WEB malice scanning has obviously different from the access request that normal WEB scans semantically.As, the scanning of WEB malice can send "/robots.txt/1.php " to judge website and whether there is file type parse error leak, send "/web1.rar " and detect whether store backup file existence improperly, send "/servlet/ContentServer? pagename=<script>alert (' Vulnerable') </script> " judge whether website has cross site scripting leak etc.And normal WEB scanning can send request according to web site url, as "/abc/def/201309/201309_3629160.html ", "/images/xian06.gif " etc.

Secondly, WEB maliciously scans in the request sent, and the ratio of the corresponding code of mistake (e.g., 401,404 etc.) generally can be higher, and normal WEB scans because its target is the resource that acquisition website provides, and therefore in its request, the ratio of 200 generally can be higher.But first some WEB malice scanning can be carried out normal webpage and crawl, and after determining website structure, then carries out hostile content scanning, and the ratio of wrong corresponding code in its request can be caused like this to decline.

Again, normal WEB scanning generally all adopts the request of GET method, a small amount of reptile can adopt HEAD method, and the scanning of some WEB malice a large amount of HEAD method can be adopted to judge fast whether website exists file destination, or adopt the methods such as PUT, DELETE to test website whether can revised context.

WEB malice Scanning Detction is generally classified as WEB attack detecting, after normally whether detection WEB request attacks one by one, judge whether the global behavior of assailant belongs to malice scanning again, if general number of times of attack or attack duration exceed certain threshold value, just think that attack belongs to the scanning of WEB malice.The common method of WEB attack detecting can be divided into two classes:

One class is rule detection, by attacking rule match, identifies that the WEB for website vulnerability attacks, as SQL injection attacks, cross-site scripting attack etc.Rule detection method can carry out detecting in real time and interception to WEB malice scanning behavior, but can only detect known query-attack, cannot find unknown query-attack.

Another kind of is abnormality detection based on normal data, by the normal flowing of access study to website, sets up access white list or forward Access Model, only allows the request in white list list or meets the request access of forward Access Model.The accuracy of these class methods depends on normal historical data greatly, if there is no normal historical data, or the normal access behavior kind that normal historical data contains is very few, or has been mixed into attack data in historical data, and the rate of false alarm of these class methods and rate of failing to report all can raise greatly.In existing abnormality detection technology, mostly belong to this class.As disclosed a kind of web intrusion prevention method being applied to application layer in " a kind of web intrusion prevention method and system being applied to application layer " (CN201110117191), the method is given a mark to the behavior of visitor according to the hazardous act preset, and is on the defensive to access behavior by cumulative threat value.Wherein, the hazardous act preset is by obtaining setting up forward model parameter after the study of a large amount of history normal behaviours.Disclose a kind of method of protecting Web and attacking in " method that protection WEB attacks " (CN201410737526), the method carries out attack protection by the method that black and white lists mixes, and wherein white list is also by obtaining after normal action learning.

Summary of the invention

For the technical problem that prior art is deposited, the object of the present invention is to provide method for detecting abnormality and the system of a kind of novel WEB malice scanning behavior, it does not rely on normal visit data, can identify that malice scans user from the website visiting user of magnanimity.

The technical scheme of the method for the invention is by carrying out keyword feature to the calling party in WEB access history record (as WEB access log) and statistical nature extracts, normal users access behavior is utilized to have the feature of similitude and plurality, first the semantically anomalous value of user behavior is calculated according to keyword feature, statistical nature is utilized to revise semantically anomalous value according to heuritic approach again, thus obtain the access exception score value of user, finally calculate the threshold value of user's abnormal score, user abnormal score being exceeded threshold value is identified as malice and scans user.

A kind of WEB malice scanning behavior method for detecting abnormality, its method step comprises:

1) preliminary treatment is carried out to WEB access history record, resolve WEB access history record, mark well-known search engine reptile user, character string is asked to carry out word segmentation processing to the WEB of each user, extraction keyword vector, simultaneously to the access behavior of each user, respectively from visit capacity four aspect statistics of the visit capacity of the visit capacity of total visit capacity, different requesting method, different page type, different answer code, obtain statistical nature vector;

2) travel through the keyword vector of all users, add up the number of users that each keyword is corresponding, build overall antistop list, in table, record the number of users of each keyword and correspondence thereof;

3) the uncommon degree of each keyword is calculated according to overall antistop list, traverse user keyword vector, calculates the original anomaly score value of each user according to the uncommon degree of keyword, more vectorial according to the statistical nature of user, revise original anomaly score value, obtain final abnormal score;

4) exceptional value sequence is formed to the ascending sequence of the final abnormal score of user, calculate the catastrophe point of exceptional value sequence, using final abnormal score corresponding for abnormity point as threshold value, if without catastrophe point, then there is not threshold value;

5) if there is threshold value, then judging whether the final abnormal score of user is greater than this threshold value, if so, then identifying that user is for maliciously scanning user.

Further, described WEB asks character string can all or part of for each primary fields content in WEB request, as request URL, request user agent information USER-AGENT, asks BODY content etc.

Further, described participle processing method is as follows:

Ask character string to be converted into lowercase WEB, use appointment to stop lexicon and ask character string to be decomposed to WEB, record the number of times that each word occurs, build user's keyword vector.

Further, described appointment stop lexicon include but not limited to "/", ". ", "? ", "=", " & ", the character such as ", ".

Further, described user access activity statistical nature vector comprises PUT in the total access times of user, the errored response code number of times received, auxiliary element access times, http protocol and DELETE request method number of times, and wherein auxiliary element refers to the files such as CSS, picture, audio frequency, Office document, PDF.

Further, described user's keyword vector K _u={ (k _i, ck _i) | 0≤i≤m-1}, wherein m is the quantity of user's keyword, k _ibe i-th keyword, c _kifor keyword k _ithe number of times occurred in character string is asked at all WEB of user.

Further, described overall antistop list GK={ (k _i, uc _ki) | 0≤i≤N-1,1≤uc _ki≤ N _u, wherein N is the sum of the different keyword of all users, uc _kifor keyword k _ithe number of users occurred in character string is asked, N at WEB _ufor total number of users.

Further, the computational methods of the uncommon degree of described keyword are as follows:

For user's keyword vector K _u={ (k _i, c _ki) | 0≤i≤m-1} and overall antistop list GK={ (k _i, uc _ki) | 0≤i≤N-1}, k _iuncommon degree P _ki=Log (c _ki) * Log (N _u/ uc _ki* uc _ki), wherein Log (x) is natural logrithm function.

Further, described user's original anomaly divides value calculating method as follows:

If user is marked as well-known search engine reptile, then its original anomaly value is 0, otherwise obtains this user's original anomaly score value by cumulative for the uncommon degree of all keywords in this user's keyword vector.

Further, described user final abnormal score modification method is as follows:

Choose corrected parameter w ₁, w ₂with w ₃, calculate final abnormal score=original anomaly score value * Exp (w ₁, the access times of errored response synchronous codes number/total) and * Exp (w ₂, the access times of auxiliary element access times/total) and+w ₃* the access times of the number of times/total of PUT and DELETE method, wherein Exp (a, b) is exponential function, and a is the truth of a matter, and b is index.

Further, the computational methods of the catastrophe point of described exceptional value sequence are as follows:

Select mutation parameter T, for sequence SA={ α _i| 0≤i≤N _u-1, α _i≤ α _i+1, sequence of calculation SB={ β _j| 0≤j≤N _u-2}, if wherein a _j+1-a _j>T, then β _j=α _j+1-α _j, otherwise β _j=0.For sequence SB, sequence of calculation SC={ γ _k| 0≤k≤N _u-3}, wherein γ _k=β _k+1-β _k.To sequence SC, search k*, make k* for the γ that satisfies condition _k>T or γ _kthe minimum k of <0, if k* exists, the catastrophe point of exceptional value sequence is a _k*+1, otherwise there is not catastrophe point in exceptional value sequence.

A kind of WEB malice scanning behavior detection system, comprises configuration read module, data preprocessing module, abnormality detection module and data memory module.

Described configuration read module is responsible for from data memory module, read configuration parameter information, comprises the access configuration of WEB access history record, well-known search engine reptile IP list, corrected parameter w ₁, w ₂, w ₃and mutation parameter T;

Described data preprocessing module is responsible for reading from data memory module and is resolved original WEB access history record, whether be reptile according to well-known search engine reptile IP list mark user, and calculate the keyword vector of user and statistical nature vector, and by pre-processed results stored in data memory module;

Described abnormality detection module in charge reads pre-processed results from data memory module, carries out abnormality detection according to the keyword of user vector with statistical nature vector, exports malice scanning user's testing result and stored in data memory module;

Described data memory module is responsible for storage system configuration information, original WEB access history record, data prediction result and malice scanning user testing result.

Beneficial effect of the present invention:

Method of the present invention detecting from WEB access history record of automation can identify that WEB malice scans behavior, comprise the WEB attacks such as SQL injection, XSS, WEBSHELL detects access behavior, registration or management interface scanning behavior etc., do not rely on detected rule, do not rely on normal historical data, unknown attack can be found yet.

Accompanying drawing explanation

Fig. 1 is the principle synoptic diagram of a kind of WEB malice of the present invention scanning behavior abnormality detection system.

Fig. 2 is module composition schematic diagram in an embodiment of a kind of WEB malice of the present invention scanning behavior abnormality detection system.

Fig. 3 is the handling process schematic diagram of data preprocessing module in an embodiment of a kind of WEB malice of the present invention scanning behavior abnormality detection system.

Fig. 4 is the handling process schematic diagram of abnormality detection module in an embodiment of a kind of WEB malice scanning behavior abnormality detection system of the present invention.

Embodiment

Fig. 1 is the principle summary of a kind of WEB malice of the present invention scanning behavior abnormality detection system.WEB malice scan abnormalities detection system can carry out data prediction and anomaly analysis to the WEB access history record of input, therefrom finds that malice scans user.

Fig. 2 is module composition schematic diagram in an embodiment of a kind of WEB malice of the present invention scan abnormalities detection system.

In the present embodiment, WEB malice scanning behavior detection system forms by configuring read module, data preprocessing module, abnormality detection module and data memory module.

Data memory module is responsible for storage system configuration information, original WEB access history record, data prediction result and malice scanning user testing result, and data memory module can adopt relational database, non-relational database (as: Elasticsearch system) or the mode of text to realize.The mode of relational database is adopted to realize in the present embodiment.

Configuration read module reads configuration parameter information from data memory module, comprises database access configuration information, the database table name of WEB access history record to be analyzed, well-known search engine reptile IP list, corrected parameter w ₁, w ₂, w ₃and mutation parameter T.

The database access configuration information that data preprocessing module provides according to configuration read module, the database table name of WEB access history record to be analyzed, read from data memory module and resolve original WEB access history record, whether be reptile according to well-known search engine reptile IP list mark user, and calculate the keyword vector of user and statistical nature vector, by pre-processed results stored in data memory module.

The database access configuration information that abnormality detection module provides according to configuration read module, pre-processed results is read from data memory module, carry out abnormality detection according to the keyword of user vector with statistical nature vector, export malice scanning user's testing result and stored in data memory module.

In the present embodiment, WEB access history to be analyzed is recorded as the WEB access log that a website WEB server records, and different user is by accessing IP to distinguish, and WEB asks character string to be the request URL recorded in WEB daily record.

Typical WEB access log record is as follows:

192.168.1.1--[15/May/2015:15:55:29+0800]"GET/abc/font/webfont.php？id＝1HTTP/1.1"20043841"http://www.test.com/abc/""Mozilla/5.0(WindowsNT6.1)AppleWebKit/537.36(KHTML,likeGecko)Chrome/35.0.1916.153Safari/537.36SE2.XMetaSr1.0"。

Wherein 192.168.1.1 is access IP, unique identification user, and GET is requesting method, and 200 is request answer code, "/abc/font/webfont.php? id=1 " be request URL.This request URL is after word segmentation processing, and the keyword obtained is " abc ", " font ", " webfont ", " php ", " id ", " 1 ".

Without loss of generality, the http protocol communication process record that WEB access history record also can record for other network intermediate equipments, Session ID during user also can be combined by the proxy information USER-AGENT in access IP and request or ask is distinguished, and WEB to be analyzed asks character string also can comprise request user agent information USER-AGENT, request BODY content.

The concrete testing process of the present embodiment is as follows:

1, system starts, and configuration read module reads configuration information;

2, data preprocessing module carries out preliminary treatment to WEB daily record, each user is marked whether as search engine reptile and calculate keyword vector and statistical nature vectorial, export user characteristics set { F _u} _{u ∈ U};

3, abnormality detection module is according to user characteristics set { F _u} _{u ∈ U}, build overall antistop list, calculate the abnormal score of user and the threshold value of user's abnormal score, identify that malice scans user, output detections result.

Fig. 3 gives the handling process schematic diagram of data preprocessing module, and its concrete implementation is as follows:

P-1, traversal WEB daily record, obtain access IP list, form user and gather U;

P-2, the user u gathered for user in U, calculate user characteristics F _u=[IR _u, K _u, S _u], export user characteristics set { F _u} _u _{∈ U}, calculation procedure is as follows:

P-2-1, judge whether it is search engine reptile, if so, then marks IR according to well-known search engine reptile IP list _u=TRUE, otherwise IR _u=FALSE;

P-2-2, from WEB daily record, obtain all access logs of this access IP;

P-2-3, for each daily record extract request URL, use stop lexicon "/", ". ", "? ", "=", " & ", ", " carry out participle to request URL, add up the number of times that each keyword occurs, calculate keyword vector K _u={ (k _i, c _ki) | 0≤i≤m-1}, wherein m is K _unumber of members, k _ibe i-th keyword, c _kifor keyword k _ithe number of times occurred;

P-2-4, all access logs for this access IP, counting statistics characteristic vector S _u=[TH _u, EH _u, AH _u, PH _u], wherein TH _ufor total access times, EH _ufor the number of times of asking answer code to be more than or equal to 400 in daily record, AH _ufor request URL in daily record is the number of times of auxiliary element, PH _ufor requesting method in daily record is the number of times of PUT and DELETE.

Fig. 3 gives the handling process schematic diagram of abnormality detection module, and its idiographic flow is as follows:

D-1, reading user characteristics set { F _u} _{u ∈ U}with parameter w ₁, w ₂, w ₃, T;

D-2, travel through the keyword vector { K of all users _u} _{u ∈ U}, add up the number of users that each keyword is corresponding, build overall antistop list GK;

D-3, traverse user keyword vector, calculates the uncommon degree of each keyword according to overall antistop list, calculate user's abnormal score { A _u} _{u ∈ U};

D-4, ascending for user's abnormal score sequence is formed exceptional value sequence, calculate the catastrophe point of exceptional value sequence, as the threshold value A * of abnormal score, if without catastrophe point, then threshold value is-1;

If D-5 A* is greater than 0, judging whether the final abnormal score of user is greater than A*, if so, then identifying that user is for maliciously scanning user.

Wherein, the detailed process building overall antistop list is as follows:

D-2-1, initialization overall situation antistop list GK=Φ;

D-2-2, get arbitrary user u ∈ U, travel through its keyword vector K _u, for K _uin each keyword k _iif: k _ido not occur in GK, then make uc _ki=1, by (k _i, uc _ki) add in GK; If k _ioccur in GK, then from GK, obtain k _icorresponding uc _ki, make uc _ki=uc _ki+ 1, upgrade GK.

Wherein, user u abnormal score A is calculated _udetailed process as follows:

D-3-1, for user u, if F _uin IR _u=TRUE, then Au=0, otherwise enter D-3-2;

The keyword vector K of D-3-2, traverse user u _u, for K _uin each tuple (k _i, c _ki), from overall antistop list GK, search corresponding uc _ki, calculate the uncommon degree a of ki _ki=Log (c _ki) * Log (N _u/ uc _ki* uc _ki), wherein Log (x) is natural logrithm function, N _ufor user gathers the number of members of U;

The original anomaly score value A of D-3-3, calculating user u _u'=∑ a _ki;

D-3-4, calculate the final abnormal score A of user _u=A _u' * Exp (w ₁, EH _u/ TH _u) * Exp (w ₂, AH _u/ TH _u)+w ₃* PH _u/ TH _u, wherein Exp (a, b) is exponential function, and a is the truth of a matter, and b is index.

Wherein, the detailed process calculating the threshold value A * of user's abnormal score is as follows:

D-4-1, user is gathered the final abnormal score { A of all users in U _u} _{u ∈ U}by arranging from small to large, formation sequence SA={ α _i| 0≤i≤N _u-1, α _i≤ α _i+1;

D-4-2, according to mutation parameter T, sequence of calculation SB={ β _j| 0≤j≤N _u-1}, if wherein a _j+1-a _j>T, β _j=α _j+1-α _j, otherwise β _j=0;

D-4-3, according to sequence SB, sequence of calculation SC={ γ _k| 0≤k≤N _u-2}, wherein γ _k=β _k+1-β _k;

D-4-4, to sequence SC, search k*, make k* for the γ that satisfies condition _k>T or γ _kthe minimum k of <0, if k* exists, the catastrophe point of exceptional value sequence is α _k*+1, export A*=α _k*+1, otherwise there is not catastrophe point in exceptional value sequence, exports as A*=-1.

Claims

1. a WEB malice scanning behavior method for detecting abnormality, the steps include:

1) preliminary treatment is carried out to the WEB access history record of calling party, the user wherein belonging to setting search engine reptile user is marked; From this WEB access history record, extract keyword feature and the statistical nature of calling party, build the keyword vector sum statistical nature vector of this user respectively;

2) travel through the keyword vector of all users, add up the number of users that each keyword is corresponding, build overall antistop list;

3) the uncommon degree of each keyword is calculated according to this overall antistop list, travel through the keyword vector of each calling party, the original anomaly score value of this calling party is calculated according to the uncommon degree of keyword, then according to the statistical nature vector of this calling party, revise original anomaly score value, obtain the final abnormal score of this calling party;

4) sequence formation one exceptional value sequence is carried out to the final abnormal score of all calling parties, calculate the catastrophe point of this exceptional value sequence, using the final abnormal score of its correspondence as threshold value; If without catastrophe point, then there is not threshold value;

5) if there is threshold value, then the final abnormal score of calling party and this threshold value are compared, if be greater than this threshold value, then this calling party be identified as malice and scan user.

2. the method for claim 1, it is characterized in that, add up total visit capacity of each calling party, the visit capacity of different requesting method, the visit capacity of different page type, the visit capacity of different answer code, build the described statistical nature vector of this calling party.

3. the method for claim 1, is characterized in that, the keyword vector K of calling party u _u={ (k _i, ck _i) | 0≤i≤m-1}; Wherein, m is the keyword quantity of calling party u, k _ibe i-th keyword, c _kifor keyword k _ithe number of times occurred in character string is asked at all WEB of this calling party u.

4. method as claimed in claim 3, is characterized in that, described overall antistop list GK={ (k _i, uc _ki) | 0≤i≤N-1,1≤uc _ki≤ N _u; Wherein, N is the sum of the different keyword of all calling parties, uc _kifor keyword k _icorresponding number of users, i.e. keyword k _iat uc _kithe WEB of individual calling party asks to occur in character string, N _ufor total number of users.

5. method as claimed in claim 4, it is characterized in that, the method calculating described uncommon degree is: keyword k _iuncommon degree P _ki=Log (c _ki) * Log (N _u/ uc _ki* uc _ki).

6. the method for claim 1, it is characterized in that, the computational methods of described original anomaly score value are: if calling party is step 1) labeled calling party, then its original anomaly score value is 0, otherwise is added up by the uncommon degree of all keywords in the keyword vector of this calling party and obtain the original anomaly score value of this calling party.

7. the method as described in claim 1 or 6, is characterized in that, the method obtaining described final abnormal score is: choose corrected parameter w ₁, w ₂with w ₃, calculate final abnormal score=original anomaly score value * Exp (w ₁, the access times of errored response synchronous codes number/total) and * Exp (w ₂, the access times of auxiliary element access times/total) and+w ₃* the access times of the number of times/total of PUT and DELETE method.

8. the method for claim 1, is characterized in that, the computational methods of described catastrophe point are: select a mutation parameter T, according to exceptional value sequence SA={ α _i| 0≤i≤N _u-1, α _i≤ α _i+1calculate a sequence SB={ β _j| 0≤j≤N _u-2}, meets a _j+1-a _j>T, then β _j=α _j+1-α _j, otherwise β _j=0; Then a sequence SC={ γ is calculated according to sequence SB _k| 0≤k≤N _u-3}, wherein γ _k=β _k+1-β _k; Then from this sequence SC, the γ that satisfies condition is searched _k>T or γ _kthe minimum k of <0, is designated as k*, if k* exists, the catastrophe point of exceptional value sequence SA is a _k*+1, otherwise there is not catastrophe point in exceptional value sequence; Wherein, α _ifor the final abnormal score of i-th calling party in exceptional value sequence SA, N _ufor total calling party number.

9. a WEB malice scanning behavior abnormality detection system, is characterized in that, comprises configuration read module, data preprocessing module, abnormality detection module and data memory module, wherein:

Described configuration read module is responsible for from data memory module, read configuration parameter information, comprises the search engine reptile IP list of the access configuration of WEB access history record, setting;

Described data preprocessing module is responsible for reading from data memory module and is resolved original WEB access history record, whether be reptile according to this search engine reptile IP list mark calling party, and calculate the keyword vector of each calling party and statistical nature vector, and by pre-processed results stored in data memory module;

Described abnormality detection module in charge reads pre-processed results from data memory module, carries out abnormality detection according to the keyword of calling party vector with statistical nature vector, exports malice scanning user's testing result and stored in data memory module;

10. system as claimed in claim 9, is characterized in that, the keyword vector of all users of described abnormality detection module walks, adds up the number of users that each keyword is corresponding, build overall antistop list; Then the uncommon degree of each keyword is calculated according to this overall antistop list, travel through the keyword vector of each calling party, the original anomaly score value of this calling party is calculated according to the uncommon degree of keyword, then according to the statistical nature vector of this calling party, revise original anomaly score value, obtain the final abnormal score of this calling party; Then sequence formation one exceptional value sequence is carried out to the final abnormal score of all calling parties, calculate the catastrophe point of this exceptional value sequence, using the final abnormal score of its correspondence as threshold value; If without catastrophe point, then there is not threshold value; If there is threshold value, then the final abnormal score of calling party and this threshold value are compared, if be greater than this threshold value, then this calling party be identified as malice and scan user, export malice scanning user's testing result stored in data memory module.