CN103559235A

CN103559235A - Online social network malicious webpage detection and identification method

Info

Publication number: CN103559235A
Application number: CN201310507897.2A
Authority: CN
Inventors: 李沁蕾; 王蕊; 贾晓启; 张道娟
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2013-10-24
Filing date: 2013-10-24
Publication date: 2014-02-05
Anticipated expiration: 2033-10-24
Also published as: CN103559235B

Abstract

The invention relates to an online social network malicious webpage detection and identification method. The online social network malicious webpage detection and identification method comprises the steps of: 1) calculating frequency of occurrence of all keywords in any webpage to be detected and identified in an online social network; dividing the webpage into one or more collections in different types of an HTML (Hypertext Markup Language) label collection or JavaScript collection or a URL (Uniform Resource Locator) collection based on a source code in the webpage; 2) extracting and identifying confusing natures from a collection which is not null so as to obtain correlation characteristics of the webpage; 3) establishing a correlation information database, updating correlation characteristics of the webpage in the database in real time, and extracting based on the correlation characteristics to obtain a webpage propagation velocity; 4) identifying a malicious webpage based on the webpage propagation velocity and in combination with the characteristics which is obtained through statistics. The online social network malicious webpage detection and identification method not only has very good universality and can describe characteristics of online social network malicious webpage exactly, but also achieves more precise detection and identification, higher efficiency and lower analysis cost for the malicious webpage.

Description

A kind of online social networks malicious web pages detects recognition methods

Technical field

The invention belongs to network security technology field, relate to the recognition methods of a kind of online social networks malicious web pages, particularly the online social networks malicious web pages recognition methods based on malicious web pages feature extraction.

Background technology

Along with online social networks (Online Social Network, OSN) flourish, each large online social network-i i-platform has had huge customer volume, adds its hiding user's private information and potential economic interests, has become the focus of more and more network hackers.In the attack for online social networks, cross-site scripting attack (Cross-site Scripting, XSS) be a kind of common one of attack pattern of destructive power that has, the network worm that utilizes cross site scripting leak to produce, can infect at short notice a large amount of network users, even have influence on the normal operation of server.Therefore, extracting effective web page characteristics is current problem demanding prompt solution to improve to the identification of online social networks malicious web pages.

The analysis of existing online social networks malicious web pages adopts complicated Static Analysis Method mostly.Conventionally, the elements such as HTML, CSS, URI, JavaScript in the source code of webpage, have been comprised, in webpage, HTML, CSS, URI, the JavaScript of malice may cause webpage when browser end loads, to produce the behavior of malice, such as stealing cookie, opening fishing website etc.In online social networks, user can input freely the content of certain length from the text box of webpage, comprise the codes such as HTML, CSS, URI, JavaScript, for fear of the malicious code that may comprise in user input content, when the content in input frame is submitted to, need to carry out static analysis to it, can, respectively from the angle of HTML, CSS, URI, JavaScript, utilize formal methods analyst to judge that these element structures and content possibility produce malicious act.

In malicious web pages, the malicious code based on XSS leak is modal a kind of webpage malicious code, has had the analysis means of many maturations for such malicious code.In the web page analysis process of non-online social networks (as: portal website, forum website etc.), from the angle incision of obfuscated codes, extract the feature of obfuscated codes in webpage, judge whether webpage exists suspicious malicious code.The feature of extracting mainly comprises: key word, JavaScript feature (comprising length, character number etc.), URL feature etc.

In the recognition methods of existing a series of online social networks malicious web pages analyzing and testing, Static Analysis Method needs complicated analytical procedure mostly, processing time is long, ageing not high, compare with dynamic analysing method, the low time loss that Static Analysis Method should have is not withdrawn deposit completely, and the web-page requests that complicated analysis and calculation process causes postpones also can bring negative effect to network application.Therefore, for online social networks malicious web pages, proposing a kind of simple and effective feature extracting method, lower analysis cost, is to need at present the problem of researching and solving badly.

Summary of the invention

For online social networks malicious web pages, detect the problem of identification, the object of the invention is to propose a kind of online social networks malicious web pages based on the feature extraction of online social networks malicious web pages and detect recognition methods.After the webpage of online social networks is analyzed, from following malicious web pages feature, analyze: the angle extraction of key word, JavaScript, HTML, URL and online social networks self-characteristic has the feature that quantizes character, utilize those malicious web pages features of extracting to identify the malicious web pages of the malicious code with XSS leak in online social networks.

Technical scheme of the present invention is as follows: a kind of online social networks malicious web pages detects recognition methods, and its step comprises:

1) webpage to any one identification to be detected in online social networks, the frequency of occurrences of adding up all key words in this webpage; According to source code in described webpage, webpage is divided into: one or more dissimilar set in html tag set or JavaScript set or URL set;

2) from above-mentioned, for extraction empty set, do not distinguish that the webpage static nature of obscuring character obtains suspicious field, the time occurring in conjunction with described suspicious field obtains the Relating Characteristic of webpage;

3) create related information database for storing the Relating Characteristic of the Relating Characteristic of this webpage the webpage of real-time update database, according to described Relating Characteristic, extract and obtain webpage velocity of propagation;

4) according to described page velocity of propagation, and the frequency of occurrences of the key word obtaining in conjunction with statistics, detect one or more feature detection in the suspicious JavaScript script that obtains, suspicious html tag, suspicious URL and identify malicious web pages.

Further, from webpage, source code takes out the code segment meet html tag and is gathered into html tag set, described html tag is by starting label and/or end-tag forms, the masurium that described beginning label is surrounded by bracket, the brace that end-tag is surrounded by bracket and masurium.

Further, the position that in webpage, the JavaScript script of source code appears at is: between <script></script > label or after " javascript: "; According to described script, there is position taking-up JavaScript script, be gathered into set.

Further, from webpage, source code takes out to search after the one section of valid string separation and Extraction that is called beginning with HTTP, HTTPS, File Transfer Protocol name goes out URL and obtains URL set.

Further, described html tag set is extracted and is distinguished that to obscure the webpage static nature method of character as follows:

The information of all labels in statistics html tag set, extracts the maximum length of label in set, the number of long label, and the ratio of contained JavaScript character string in label, obscures the metering of degree as html tag.

Further, to described JavaScript, set is extracted and is distinguished that to obscure the webpage static nature method of character as follows:

The information of all scripts in statistics JavaScript set, extract in the ratio of the character that is encoded in the maximum length, script character string of script character string in set and set and occur the number of times that character string connects, as JavaScript script, obscure the metering of degree.

Further, to described URL, set is extracted and is distinguished that to obscure the webpage static nature method of character as follows:

The information of all URL in statistics set, extracts the maximum length of URL in set, the number of long URL, and the ratio of coded character in URL, obscures the metering of degree as URL.

Further, described key word is to have the frequency occurring in optimum script and malicious script to exist JavaScript function or the html tag of otherness, comprising: eval, document.write, unescape, fromCharCode, createElement, createTextNode.

Further, described velocity of propagation is: in the unit interval, suspicious malicious code appears at the frequency in webpage, calculates suspicious character string step of velocity of propagation in webpage as follows:

1) time that the suspicious field of the < of statistical web page occurs in related information database, suspicious character string content > record;

2) identical by character string content in Query Database, and the number of time all records in last hour at t, add up the velocity of propagation of character string content in each record;

3) record the maximal value in all velocity of propagation, the velocity of propagation using it as webpage.

Beneficial effect of the present invention:

1. the present invention extracts one group of online social networks malicious web pages feature, has good universality.

2. the present invention is based on structure of web page feature, webpage is carried out to pre-service, from web page element type angle, extract web page characteristics, set up the related information database between webpage simultaneously.

3. the present invention has fully taken into account the propagating characteristic of malicious code in online social networks, based on propagating characteristic, extracts one group of feature pointed to online social networks.

To sum up, the online social networks malicious web pages that the present invention proposes detects recognition methods, can describe more accurately the feature of online social networks malicious web pages, to the detection identification of malicious web pages more accurately, efficiency is higher, analysis cost is lower.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet that online social networks malicious web pages detects recognition methods.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, be understandable that, described embodiment is only the present invention's part embodiment, rather than whole embodiment.Embodiment based in the present invention, those skilled in the art, not making the every other embodiment obtaining under creative work prerequisite, belong to the scope of protection of the invention.

Realize a kind of embodiment of the present invention as follows, online social networks malicious web pages detects recognition methods, the steps include:

1) webpage to any one identification to be detected in online social networks, the frequency of occurrences of adding up all given key words in this webpage;

2) analyzing web page structure is divided into dissimilar set according to webpage source code by webpage, and webpage source code is resolved into HTML set, JavaScript set, URL set;

3) from HTML set, JavaScript set, URL set, extract and distinguish that the webpage static nature of obscuring character is suspicious field, the time occurring in conjunction with suspicious field obtains the Relating Characteristic of webpage;

4) store the Relating Characteristic of this webpage, upgrade related information database, to upgrade up-to-date Relating Characteristic; According to the database information that records online social networks webpage relevance feature, extract the Relating Characteristic of webpage;

5) according to relationship information, extract and to obtain webpage velocity of propagation, and in conjunction with given key word, suspicious JavaScript, suspicious HTML, suspicious URL, totally five features, obtain the proper vector of webpage;

6) according to proper vector, detect and identify malicious web pages.

In one embodiment of this invention, key word has referred to some JavaScript function or html tags, and the frequency that they occur in optimum script and malicious script exists otherness.Some key words like this, their occurrence number frequencies less and that occur in malicious web pages in optimum webpage are higher, we think that these fields can become the key word in webpage, thereby can utilize the frequency that in webpage, key word occurs to judge whether webpage is maliciously.

In one embodiment of this invention, obtain the component of webpage according to the structure of analyzing web page, we carry out pre-service to webpage, and the target of processing is that webpage is divided into dissimilar set.The feature of extracting due to us all comes from html tag, JavaScript script, URL, therefore, when webpage is carried out to pre-service, the source code of webpage has been divided into html tag set, the set of JavaScript script and URL set, in following step, we only need respectively these three set to be extracted to relevant information, the processing time of having avoided all needing to process a large amount of data at every turn and having caused is long, in addition, sorted set is analyzed and extracted feature, also can make feature more accurate.

In one embodiment of this invention, extract webpage static nature, according to being to have the feature of obscuring character in webpage, to three groups of set, its extracting method is:

(1) to html tag set, the information of all labels in statistics set, extract the maximum length of label in set, the number of long label, and the ratio of contained JavaScript character string in label, the statistical value of these quantifications can be used as the metering that html tag is obscured degree.

(2) to the set of JavaScript script, the information of all scripts in statistics set, extract the ratio of the character that is encoded in the maximum length, script character string of script character string in set, and in set, there is the number of times that character string connects, the statistical value of these quantifications can be used as the metering that JavaScript script is obscured degree.

(3) to URL set, the information of all URL in statistics set, extracts the maximum length of URL in set, the number of long URL, and the ratio of coded character in URL, and the statistical value of these quantifications can be used as the metering that URL obscures degree.

In one embodiment of this invention, according to the feature of online social networks, the propagation of malicious code in social networks is different from the propagation in general networking, feature is the most intuitively, the high concentration class of social networks topology and less average bee-line, cause malicious code velocity of propagation in social networks far above the velocity of propagation in general networking.In order to quantize the value of velocity of propagation, defining in the present invention velocity of propagation is: in the unit interval, suspicious malicious code appears at the frequency in webpage, and the statistics of speed need to depend on the number of times that in the webpage that server end sent in nearest hour, suspicious malicious code occurs.In order to extract feature, need to know its detected all webpages in the unit interval in the past, therefore create the online social networks webpage relevance feature in a related information database real-time update database, from database, can add up and extract the feature needing.

In one embodiment of this invention, related information database needs constantly to upgrade, and database need to be preserved the related information of all webpages, therefore, after obtaining the Relating Characteristic of webpage, the relationship information of webpage is saved in database, upgrade related information database.In order to improve, to upgrade efficiency and only need with reference to the related information in nearest a hour, all information before one hour are actually not as a reference, in order to improve the access efficiency of database, within every ten minutes, safeguard a data base entries, by all information deletions before a hour.

Be the schematic flow sheet that line social networks malicious web pages detects recognition methods as shown in Figure 1, comprise step:

1. extract part 1 web page characteristics, feature mainly comprises key characteristics.

When malicious web pages loads in client browser, can carry out some attacks, these behaviors are carried out and are realized by a series of combination of function.When static analysis front-page keyword, while utilizing the number of times of the appearance of key word to replace performance analysis, the execution sequence of key word is as the feature of key word.From statistics, find, some script function, it may appear in all webpages, but the frequency that they are used but differs widely.Key word can include but not limited to: eval, document.write, unescape, fromCharCode, createElement, createTextNode etc., this area clearly understands how for malicious web pages leak, to carry out the extraction of key word, so the type of above-mentioned key word does not limit key word.As a character string, carry out function eval, it can carry out a code existing with character string forms, and eval is a legal function, and it is present in various webpages, but the frequency that it generally occurs in webpage is lower.Yet in malicious web pages, therefore the number of times that eval occurs can extract the feature of of this sort key word higher than the number of times generally occurring, can be used as a kind of sign of identifying malicious web pages.

2. webpage pre-service, classifies webpage source code according to element type.

In webpage source code, there is multiple element, the most basic html tag, JavaScript script, the URL etc. of having comprised.One of point of penetration of the present invention is from html tag, JavaScript script, URL, to search the clues and traces that malicious code exists, in order to facilitate the realization of method, before extracting other several Partial Feature, need to carry out a webpage pre-service, after processing, obtain the set of three kinds of elements.

Preprocessing process is as follows:

1) html tag is one group and has cannonical format, label is by starting label and end-tag forms, start the masurium that label is surrounded by bracket, the brace that end-tag is surrounded by bracket and masurium, some may not have end-tag, as <br/>.From webpage source code, take out the code segment that meets html tag, be gathered into set.

2), in webpage, JavaScript script appears between <script></script > label conventionally, or after " javascript: ".The position occurring according to script, analyzing web page source code, from wherein taking out JavaScript script, is gathered into set.

3) URL is the addresses of all resources on Internet, and they are followed in unified standard webpage may exist some from the resource in this territory or other territories.The initial of URL form is a protocol name, and in Internet, conventional agreement is limited, comprised HTTP, HTTPS, FTP etc., when collecting URL set, only need to search that to take protocol name be that one section of valid string of beginning can separation and Extraction go out URL from webpage source code.

3. extract part 2 feature, feature mainly comprises html tag feature.

Html tag has formed the structure of web webpage, label can add and delete by script is dynamic, attribute in label can utilize script to revise dynamically (as: value) in addition, some can also automatically perform (as: src), therefore, html tag becomes the good place that malicious script is concealed.General html tag limited length, if concealed malicious script in html tag, html tag length may be greater than the length of optimum netpage tag so.

4. extract the 3rd Partial Feature, feature mainly comprises JavaScript script feature.

XSS malicious code is generally used JavaScript script edit, and except the aggressiveness of code, in a lot of situations, malicious code fabricator, in order to confuse victim, can use some to obscure means to script, reduces the readability of program, avoids victim to discover.A kind of general means of obscuring are that malicious code is encoded.Through the shell script of coding, length obviously increases, and in character string, the ratio of coded character also will increase.

5. extract the 4th Partial Feature, feature mainly comprises URL feature.

While there is reflection-type XSS in webpage, in webpage source code, can comprise the malice URL that causes XSS, these URL are with malicious script.For confuse user click malice URL, what malicious code fabricator can have a mind to processes distortion to URL, makes user cannot distinguish the content of URL argument section, user cheating is the URL in webpage clicking under unguard.

6. store webpage relevance information, upgrade related information database.

Set up a related information database, the relationship information of in store webpage in database.So-called relationship information, refers to some suspicious fields in webpage (as suspicious JavaScript script character string, suspicious URL, suspicious html tag etc.), and the time of suspicious field appearance.Due to the speed of needs statistics propagation, and speed is directly related with the time, so the time is a significant field in database.

When a webpage is extracted feature, after having carried out the processing of the first five step, can obtain some the suspicious character strings in one group of this webpage, velocity of propagation for the webpage that computational grid occurs in flowing afterwards easily, need to preserve in this webpage suspicious character string to the relationship information of webpage afterwards, the time that the suspicious field of the < of webpage is occurred, one group of such record of suspicious character string content > is inserted in database, in addition, in order to improve the efficiency of database work, the content of database being carried out to a redundancy in every ten minutes processes, delete all data recording before a hour, reduce the scale of database, accomplish to upgrade in time and maintenance association information database.

7. extract the 5th Partial Feature, feature mainly comprises webpage relevance feature.

In social networks, the velocity of propagation of the similar malicious code based on XSS leak in webpage is all very fast, velocity of propagation is a validity feature of identification malicious web pages, therefore, need to from the webpage source code of UTF-8 coding, extract by a kind of method the quantization characteristic of reaction velocity of propagation.

The simple defining of similar scalar speed (distance that in the unit interval, object passes through), defines velocity of propagation, i.e. in the unit interval, and the number of times that character string occurs in webpage.Calculate suspicious character string step of velocity of propagation in webpage as follows:

1) the < time t of webpage in statistic procedure 6, character string content C> record;

2) add up the velocity of propagation of character string content in each record, in Query Database, character string content is identical, and the number of time all records in last hour at t, the velocity of propagation that this statistical value is this record;

8. combining step 1,3, and five Partial Feature that obtain in 4,5,7 obtain the proper vector of webpage after merging.

Online social networks is compared and is had uniqueness with general networking application.The propagation of XSS malicious code in social networks is different from the propagation in general networking, feature is the most intuitively, the high concentration class of social networks topology and less average bee-line, cause XSS malicious code velocity of propagation in social networks far above the velocity of propagation in general networking.With an actual example evidence occurring, compare infection 336,000 in 20 hours with the computer virus Blaster of 2003, social networks XSS worm Samy has infected 1,000,000 user in 20 hours.From such correlation data, can find, the number of users that in the average unit interval, malicious code infects, the number of online social networks is about 3 times of general networking, therefore, if velocity of propagation can be indicated, just can better distinguish the malicious web pages in network flow.

Experimental data:

Type	Malicious	Benign
			Sample number	11,761	18,302
Precision	87.1%	96.1%
			Recall	94.3%	91.1%
F-Measure	90.6%	93.5%

Testing result from upper table can be found out, utilize the resolution that the feature that proposes in the present invention detects webpage on average can reach 90%, detect respond well, can find out, " velocity of propagation " proposing in the present invention has important effect in the malicious web pages identification of online social networks.

Claims

1. online social networks malicious web pages detects a recognition methods, and its step comprises:

2. online social networks malicious web pages as claimed in claim 1 detects recognition methods, it is characterized in that, from webpage, source code takes out the code segment meet html tag and is gathered into html tag set, described html tag is by starting label and/or end-tag forms, the masurium that described beginning label is surrounded by bracket, the brace that end-tag is surrounded by bracket and masurium.

3. online social networks malicious web pages as claimed in claim 1 detects recognition methods, it is characterized in that, the position that in webpage, the JavaScript script of source code appears at is: between <script></script > label or after " javascript: "; According to described script, there is position taking-up JavaScript script, be gathered into set.

4. online social networks malicious web pages as claimed in claim 1 detects recognition methods, it is characterized in that, from webpage, source code takes out to search after the one section of valid string separation and Extraction that is called beginning with HTTP, HTTPS, File Transfer Protocol name goes out URL and obtains URL set.

5. online social networks malicious web pages as claimed in claim 1 or 2 detects recognition methods, it is characterized in that, described html tag set is extracted and distinguished that to obscure the webpage static nature method of character as follows:

6. the online social networks malicious web pages as described in claim 1 or 3 detects recognition methods, it is characterized in that, described JavaScript set is extracted and distinguished that to obscure the webpage static nature method of character as follows:

7. the online social networks malicious web pages as described in claim 1 or 4 detects recognition methods, it is characterized in that, described URL set is extracted and distinguished that to obscure the webpage static nature method of character as follows:

8. online social networks malicious web pages as claimed in claim 1 detects recognition methods, it is characterized in that, described key word is to have the frequency occurring in optimum script and malicious script to exist JavaScript function or the html tag of otherness, comprise: eval, document.write, unescape, fromCharCode, createElement, createTextNode.

9. online social networks malicious web pages as claimed in claim 1 detects recognition methods, it is characterized in that, described velocity of propagation is: in the unit interval, suspicious malicious code appears at the frequency in webpage, calculates suspicious character string step of velocity of propagation in webpage as follows: