CN103559235B

CN103559235B - A kind of online social networks malicious web pages detection recognition methods

Info

Publication number: CN103559235B
Application number: CN201310507897.2A
Authority: CN
Inventors: 李沁蕾; 王蕊; 贾晓启; 张道娟
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2013-10-24
Filing date: 2013-10-24
Publication date: 2016-08-17
Anticipated expiration: 2033-10-24
Also published as: CN103559235A

Abstract

The present invention relates to a kind of online social networks malicious web pages detection recognition methods, step is: 1) to the webpage of any one identification to be detected in online social networks, adds up the frequency of occurrences of all keywords in this webpage；According to source code in webpage, webpage is divided into: html tag set or JavaScript gather or one or more different types of set in set of URL conjunction；2) do not distinguish that obscuring character obtains the Relating Characteristic of webpage for empty set is extracted from above-mentioned；3) create the Relating Characteristic of webpage in related information data base real-time update data base, extract according to Relating Characteristic and obtain webpage spread speed；4) according to page spread speed, and combine statistics obtain features described above detection identify malicious web pages.The present invention not only have good universality can the feature of accurate description online social networks malicious web pages, and more accurate, in hgher efficiency to the detection identification of malicious web pages, analysis cost is lower.

Description

A kind of online social networks malicious web pages detection recognition methods

Technical field

The invention belongs to technical field of network security, relate to a kind of online social networks malicious web pages recognition methods, particularly to base Online social networks malicious web pages recognition methods in malicious web pages feature extraction.

Background technology

Flourish along with online social networks (Online Social Network, OSN), each big online social network-i i-platform Have huge customer volume, add its user's private information hidden and potential economic interests so that it is become increasingly The focus of Multi net voting hackers.In the attack for online social networks, cross-site scripting attack (Cross-site Scripting, XSS) it is one of a kind of common attack pattern with destructive power, utilizes the network worm that cross site scripting leak produces, permissible Infect the substantial amounts of network user at short notice, even have influence on the properly functioning of server.Therefore, effective webpage is extracted special Levying to improve the identification to online social networks malicious web pages is current problem demanding prompt solution.

Existing online social networks malicious web pages is analyzed and is mostly used complicated Static Analysis Method.Generally, at the source code of webpage In contain the elements such as HTML, CSS, URI, JavaScript, in webpage malice HTML, CSS, URI, JavaScript Webpage may be caused to produce the behavior of malice browser end loads when, such as, steal cookie, open fishing website etc.. In online social networks, user can input the content of certain length from the text box of webpage freely, including HTML, CSS, The codes such as URI, JavaScript, in order to avoid the malicious code that may comprise in user input content, in input frame in When holding submission, need it is carried out static analysis, can be respectively from the angle of HTML, CSS, URI, JavaScript, profit Judge whether these element structures and content may produce malicious act with formal methods analyst.

In malicious web pages, malicious code based on XSS leak is modal a kind of web virus, for this type Malicious code had the analysis means of many maturations.Non-online social networks (such as: portal website, forum website etc.) Web page analysis during, cut from the angle of obfuscated codes, extract the feature of obfuscated codes in webpage, it is judged that whether webpage is deposited At suspicious malicious code.Extract feature specifically include that keyword, JavaScript feature (including length, character number etc.), URL feature etc..

In existing a series of online social networks malicious web pages analyses detection recognition methods, Static Analysis Method needs multiple mostly Miscellaneous analytical procedure, processes the time long, and ageing the highest, compared with dynamic analysing method, Static Analysis Method should have Low time loss is withdrawn deposit the most completely, and the web-page requests delay that the analysis of complexity and calculating process cause also can be to network Application is negatively affected.Therefore, for online social networks malicious web pages, a kind of simple and effective feature extraction side is proposed Method, lowers analysis cost, is to need the problem researched and solved at present badly.

Summary of the invention

The problem identified for the detection of online social networks malicious web pages, it is an object of the invention to propose one based on online social The online social networks malicious web pages detection recognition methods of network malicious web pages feature extraction.The webpage of online social networks is being entered After row is analyzed, it is analyzed from following malicious web pages feature: keyword, JavaScript, HTML, URL and social activity online The angle extraction of network self-characteristic has the feature quantifying character, utilizes those malicious web pages features extracted to online social network The malicious web pages with the malicious code of XSS leak in network is identified.

Technical scheme is as follows: a kind of online social networks malicious web pages detection recognition methods, and its step includes:

1) to the webpage of any one identification to be detected in online social networks, the frequency of occurrences of all keywords in this webpage is added up； According to source code in described webpage, webpage is divided into: html tag set or JavaScript set or set of URL in closing a kind of or The set of person's number of different types；

2) do not distinguish that the webpage static nature obscuring character obtains suspicious field for empty set is extracted from above-mentioned, combine described in can The time doubting field appearance obtains the Relating Characteristic of webpage；

3) association of related information data base webpage in the Relating Characteristic storing this webpage real-time update data base is created Property feature, according to described Relating Characteristic extract obtain webpage spread speed；

4) according to described page spread speed, and combine statistics obtain keyword the frequency of occurrences, detection obtain suspicious In JavaScript script, suspicious html tag, suspicious URL, one or more feature detection identify malicious web pages.

Further, from webpage, source code takes out and meets the code segment of html tag and collect into html tag set, described Html tag is by starting label and/or end-tag forms, the masurium that described beginning label is surrounded by bracket, end-tag The brace surrounded by bracket and masurium.

Further, the position that in webpage, the JavaScript script of source code occurs in is:<script></script>between label Or at " javascript: after "；Occur that JavaScript script is taken out in position according to described script, collect into set.

Further, from webpage source code take out search with HTTP, HTTPS, one section of the entitled beginning of File Transfer Protocol have Effect character string separation and Extraction obtains set of URL after going out URL and closes.

Further, described html tag set extraction is distinguished that the webpage static nature method obscuring character is as follows:

In statistics html tag set, the information of all labels, extracts the greatest length of label, the number of long label in set, with And the ratio of contained JavaScript character string in label, the metering of degree is obscured as html tag.

Further, described JavaScript set extraction is distinguished that the webpage static nature method obscuring character is as follows:

In statistics JavaScript set, the information of all scripts, extracts the greatest length of script character string, script character in set String is encoded character ratio and set in the number of times that character string connects occurs, obscure degree as JavaScript script Metering.

Further, described set of URL is closed extraction and distinguishes that the webpage static nature method obscuring character is as follows:

The information of all URL in statistics set, extracts the greatest length of URL, the number of long URL in set, and URL The ratio of middle code character, obscures the metering of degree as URL.

Further, described keyword is to have the frequency occurred in optimum script and malicious script to there is diversity JavaScript function or html tag, including: eval, document.write, unescape, fromCharCode, createElement,createTextNode。

Further, described spread speed is: the frequency that in the unit interval, suspected malicious code occurs in webpage, calculates suspicious Character string step of spread speed in webpage is as follows:

1)<time that suspicious field occurs, suspicious string content>record of statistical web page in related information data base；

2) identical by string content in inquiry data base, and the time at t with the number of previous hour interior all records, system Count the spread speed of string content in each record；

3) maximum in all spread speeds is recorded, as the spread speed of webpage.

Beneficial effects of the present invention:

1. the present invention extracts one group of online social networks malicious web pages feature, has good universality.

2. the present invention is based on structure of web page feature, and webpage is carried out pretreatment, from web page element type angle, extracts web page characteristics, Set up the related information data base between webpage simultaneously.

3. during the present invention has fully taken into account online social networks, the propagating characteristic of malicious code, extracts one group pair based on propagating characteristic Online social networks has feature targetedly.

To sum up, the online social networks malicious web pages detection recognition methods that the present invention proposes, it is possible to accurate description is the most social The feature of network malicious web pages, more accurate, in hgher efficiency to the detection identification of malicious web pages, analysis cost is lower.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet of online social networks malicious web pages detection recognition methods.

Detailed description of the invention

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely retouched State, it is to be understood that described embodiment is only a part of embodiment of the present invention rather than whole embodiments.Base Embodiment in the present invention, the every other enforcement that those skilled in the art are obtained under not making creative work premise Example, broadly falls into the scope of protection of the invention.

A kind of detailed description of the invention realizing the present invention is as follows, the detection recognition methods of online social networks malicious web pages, the steps include:

1) to the webpage of any one identification to be detected in online social networks, the appearance of all given keywords in this webpage is added up Frequency；

2) webpage is divided into different types of set according to web page source code by analyzing web page structure, and web page source code is resolved into HTML Set, JavaScript set, set of URL close；

3) from HTML set, JavaScript set, set of URL close, extraction distinguishes the webpage static nature obscuring character Doubting field, the time occurred in conjunction with suspicious field obtains the Relating Characteristic of webpage；

4) store the Relating Characteristic of this webpage, update related information data base, to update up-to-date Relating Characteristic；According to note Record the database information of online social networks webpage relevance feature, extract the Relating Characteristic of webpage；

5) extract according to relationship information and obtain webpage spread speed, and combine given keyword, suspicious JavaScript, suspicious HTML, suspicious URL, totally five features, obtain the characteristic vector of webpage；

6) malicious web pages is identified according to characteristic vector detection.

In one embodiment of this invention, keyword has referred to some JavaScript function or html tag, and they are good Property script and malicious script in occur frequency there is diversity.More such keywords, they occur in optimum webpage Number of times less and in malicious web pages occur frequency higher, it is believed that these fields can become the keyword in webpage, from And the frequency that in webpage, keyword occurs can be utilized to judge whether webpage is malice.

In one embodiment of this invention, obtaining the component of webpage according to the structure of analyzing web page, webpage is carried out pre-by we Processing, the target of process is that webpage is divided into different types of set.Due to we extract feature both from html tag, JavaScript script, URL, therefore, when webpage is carried out pretreatment, divide into html tag collection by the source code of webpage Conjunction, JavaScript script set and set of URL close, and in a subsequent step, we have only to respectively to these three set Extract relevant information, it is to avoid need to process substantial amounts of data and the process overlong time that causes every time, it addition, right Sorted set is analyzed extracting feature, and feature also can be made the most accurate.

In one embodiment of this invention, webpage static nature is extracted, according to being the feature having in webpage and obscuring character, to three groups Set, its extracting method is:

(1) to html tag set, in statistics set, the information of all labels, extracts the greatest length of label, length in set The number of label, and the ratio of contained JavaScript character string in label, these statistical values quantified can conduct Html tag obscures the metering of degree.

(2) to JavaScript script set, the information of all scripts in statistics set, extract the maximum of script character string in set Length, script character string are encoded in the ratio of character, and set the number of times that character string connects occur, this tittle The statistical value changed can obscure the metering of degree as JavaScript script.

(3) closing set of URL, in statistics set, the information of all URL, extracts the greatest length of URL, long URL in set Number, and the ratio of code character in URL, these statistical values quantified can obscure degree as URL Metering.

In one embodiment of this invention, according to the feature of online social networks, malicious code propagation in social networks is different Propagation in general networking, feature is the most intuitively, the high concentration class of social networks topology and less average beeline, Cause the malicious code spread speed in social networks far above the spread speed in general networking.In order to quantify the value of spread speed, Definition spread speed is in the present invention: in the unit interval, suspected malicious code occurs in the system of the frequency in webpage, i.e. speed Meter need to rely on the number of times that in the webpage that server end sent in nearest hour, suspected malicious code occurs.In order to extract Feature, it is to be appreciated that all webpages that in the unit interval in the past, it is detected, therefore creates a related information data base reality Time more new database in online social networks webpage relevance feature, from data base, the feature extracting needs can be added up.

In one embodiment of this invention, related information data base needs constantly to update, and data base needs to preserve all webpages Related information, therefore, after obtaining the Relating Characteristic of webpage, is saved in the relationship information of webpage in data base, updates Related information data base.Have only to reference to nearest one hour interior related information to improve renewal efficiency, front owning in a hour Information is actually not as reference, in order to improve the access efficiency of data base, within every ten minutes, safeguards a data base entries, By one hour front all information deletion.

It is the schematic flow sheet of line social networks malicious web pages detection recognition methods as shown in Figure 1, including step:

1. extracting part 1 web page characteristics, feature mainly includes key characteristics.

When malicious web pages loads in client browser, can carry out some aggressive behaviors, these behaviors are by a series of Combination of function performs realization.When static analysis front-page keyword, when utilizing the number of times of the appearance of keyword to replace dynamically analyzing The execution sequence of keyword is as the feature of keyword.Finding from statistical data, some script function, it possibly be present at institute In some webpages, but the frequency that they are used but differs widely.Keyword can include but not limited to: eval, Document.write, unescape, fromCharCode, createElement, createTextNode etc., this area is the brightest The white extraction the most how carrying out keyword for malicious web pages leak, so keyword is not limited by the type of above-mentioned keyword. As character string performs function eval, it can perform a code existed with character string forms, eval be one legal Function, it is present in various webpage, but the frequency that it typically occurs in webpage is relatively low.But, in malicious web pages, Therefore the number of times that eval occurs can extract the feature of of this sort keyword higher than the number of times generally occurred, can conduct A kind of sign identifying malicious web pages.

2. Web-page preprocessing, classifies webpage source code according to element type.

Having multiple element in webpage source code, most basic includes html tag, JavaScript script, URL etc..This One of bright point of penetration is to search the clues and traces that malicious code exists from html tag, JavaScript script, URL, The realization of method for convenience, before extracting other several Partial Feature, needs to carry out a Web-page preprocessing, obtains three after process Plant the set of element.

Preprocessing process is as follows:

1) html tag is one group and has cannonical format, label by starting label and end-tag forms, start label by The masurium that bracket surrounds, brace that end-tag is surrounded by bracket and masurium, some is likely not to have end-tag, as <br/>.From webpage source code, take out the code segment meeting html tag, collect into set.

2) in webpage, JavaScript script typically occurs in<script></script>between label, or at " javascript: " After.The position occurred according to script, analyzes webpage source code, is taken out JavaScript script, collects into set.

3) URL is all resources address on internet, and they are followed and there may be some in unified standard web page From this territory or the resource in other territories.The conventional agreement that is initially in a protocol name, and Internet of URL format is Limited, include HTTP, HTTPS, FTP etc., when collecting set of URL and closing, it is only necessary to search with protocol name for opening One section of valid string of head i.e. separation and Extraction can go out URL from webpage source code.

3. extracting part 2 feature, feature mainly includes html tag feature.

Html tag constitutes the structure of web page, and label can add dynamically and delete by script, additionally in label Attribute script can be utilized to revise (such as: value) dynamically, some can also perform (such as: src), therefore, HTML automatically Label becomes the good place that malicious script is concealed.General html tag limited length, if concealed in html tag Malicious script, then html tag length is likely larger than the length of optimum netpage tag.

4. extracting third portion feature, feature mainly includes JavaScript script feature.

XSS malicious code generally uses JavaScript script edit, in addition to the aggressivity of code, a lot of in the case of malicious code Script, in order to confuse victim, can be used some to obscure means, reduce the readability of program, it is to avoid victim discovers by maker. A kind of universal means of obscuring are to encode malicious code.Encoded shell script, length substantially increases, and character string The ratio of middle code character also will increase.

5. extracting the 4th Partial Feature, feature mainly includes URL feature.

When webpage exists reflection-type XSS, webpage source code can comprise malice URL causing XSS, these URL with Malicious script.Clicking on malice URL to confuse user, what malicious code maker can be had a mind to carried out URL processes and deforms, Making user cannot be distinguished by out the content of URL parameter part, user cheating is without the URL in webpage clicking under defence.

6. store webpage relevance information, update related information data base.

Set up a related information data base, the relationship information of in store webpage in data base.So-called relationship information, be Refer to that some suspicious fields in webpage are (such as suspicious JavaScript script character string, suspicious URL, suspicious HTML mark Sign), and the time that suspicious field occurs.Due to the speed needing statistics to propagate, and speed is directly related with the time, because of now Between be a significant field in data base.

When a webpage is extracted feature, after having carried out the process of first five step, can obtain some in one group of this webpage can Doubt character string, the spread speed calculating the webpage occurred afterwards in network flow for convenience, need to preserve suspicious word in this webpage The symbol string relationship information to webpage later, by<time that suspicious field occurs, the suspicious string content>such of webpage Group record is inserted in data base, it addition, for the efficiency improving database work, within every ten minutes, the content to data storehouse is carried out One time redundancy processes, and deletes one hour front all data record, reduces the scale of data base, accomplishes to upgrade in time and safeguard pass Connection information database.

7. extracting the 5th Partial Feature, feature mainly includes webpage relevance feature.

In social networks, the similar malicious code based on XSS leak spread speed in webpage is all very fast, and spread speed is Identify a validity feature of malicious web pages, accordingly, it would be desirable to extracted from the webpage source code that UTF-8 encodes by a kind of method The quantization characteristic of reaction spread speed.

The simple definition (distance that in the unit interval, object passes through) of similar scalar speed, is defined spread speed, i.e. unit In time, the number of times that character string occurs in webpage.Calculate suspicious character string step of spread speed in webpage as follows:

1)<the time t, string content C>record of webpage in statistic procedure 6；

2) string content in the spread speed of string content in each record, i.e. inquiry data base is added up identical, and the time At t with the number of previous hour interior all records, this statistical value is spread speed of this record；

3) maximum in all spread speeds is recorded, as the spread speed of webpage.

8. combining step 1, five Partial Feature obtained in 3,4,5,7, obtain the characteristic vector of webpage after merging.

Online social networks is unique compared with general networking application.The propagation in social networks of the XSS malicious code is different from Propagation in general networking, feature is the most intuitively, the high concentration class of social networks topology and less average beeline, leads Cause the XSS malicious code spread speed in social networks far above the spread speed in general networking.With an example actually occurred Evidence, the computer virus Blaster with 2003 infected 336,000 at 20 hours and compares, and social networks XSS anthelmintic Samy is 20 1,000,000 users have been infected in hour.Malicious code sense from such correction data it is found that in the mean unit time The number of users of dye, the number of online social networks is about 3 times of general networking, therefore, if can spread speed be indicated, just Can preferably distinguish the malicious web pages in network flow.

Experimental data:

Type	Malicious	Benign
			Sample number	11,761	18,302
Precision	87.1%	96.1%
			Recall	94.3%	91.1%
F-Measure	90.6%	93.5%

Testing result from upper table is it can be seen that the resolution utilizing the feature proposed in the present invention to detect webpage is average Can reach 90%, Detection results is good, it can be seen that " spread speed " that propose in the present invention is at online social networks Malicious web pages identification has important effect.

Claims

1. an online social networks malicious web pages detection recognition methods, its step includes:

1) to the webpage of any one identification to be detected in online social networks, the frequency of occurrences of all keywords in this webpage is added up；Root According to source code in described webpage, webpage is divided into: html tag set or JavaScript set or set of URL in closing a kind of or The set of number of different types；

2) do not distinguish that the webpage static nature obscuring character obtains suspicious field, in conjunction with described suspicious word for empty set is extracted from above-mentioned The time that section occurs obtains the Relating Characteristic of webpage；

3) relatedness creating related information data base webpage in the Relating Characteristic storing this webpage real-time update data base is special Levy, extract according to described Relating Characteristic and obtain webpage spread speed；

4) according to described webpage spread speed, and combine statistics obtain keyword the frequency of occurrences, detection obtain suspicious In JavaScript script, suspicious html tag, suspicious URL, one or more feature detection identify malicious web pages.

2. online social networks malicious web pages detection recognition methods as claimed in claim 1, it is characterised in that source code from webpage Take out and meet the code segment of html tag and collect into html tag set, described html tag by label and/or end Label forms, the masurium that described beginning label is surrounded by bracket, brace that end-tag is surrounded by bracket and masurium.

3. online social networks malicious web pages detection recognition methods as claimed in claim 1, it is characterised in that source code in webpage The position that JavaScript script occurs in is:<script></script>between label or at " javascript: after "；According to described Script occurs that JavaScript script is taken out in position, collects into set.

4. online social networks malicious web pages detection recognition methods as claimed in claim 1, it is characterised in that source code from webpage Take out after lookup goes out URL with HTTP, HTTPS, one section of valid string separation and Extraction of the entitled beginning of File Transfer Protocol and obtain Set of URL closes.

5. online social networks malicious web pages detection recognition methods as claimed in claim 1 or 2, it is characterised in that to described HTML Tag set extracts and distinguishes that the webpage static nature method obscuring character is as follows:

6. the online social networks malicious web pages detection recognition methods as described in claim 1 or 3, it is characterised in that to described JavaScript set is extracted and is distinguished that the webpage static nature method obscuring character is as follows:

7. the online social networks malicious web pages detection recognition methods as described in claim 1 or 4, it is characterised in that to described URL Set is extracted and is distinguished that the webpage static nature method obscuring character is as follows:

8. online social networks malicious web pages detection recognition methods as claimed in claim 1, it is characterised in that described keyword is to have The frequency occurred in optimum script and malicious script also exists JavaScript function or the html tag of diversity, including: Eval, document.write, unescape, fromCharCode, createElement, createTextNode.

9. online social networks malicious web pages detection recognition methods as claimed in claim 1, it is characterised in that described spread speed is: In unit interval, suspected malicious code occurs in the frequency in webpage, and the step of spread speed is such as in webpage to calculate suspicious character string Under:

2) identical by string content in inquiry data base, and the time is interior all with previous hour at the time t that suspicious field occurs The number of record, adds up the spread speed of string content in each record；

3) maximum in all spread speeds is recorded, as the spread speed of webpage.