CN113609246A

CN113609246A - Webpage similarity detection method and system

Info

Publication number: CN113609246A
Application number: CN202110891633.6A
Authority: CN
Inventors: 陈业炫; 奉轶; 徐文博; 张燕; 陆亦恬; 朱璋颖; 唐祝寿
Original assignee: Shanghai Benzhong Information Technology Co ltd
Current assignee: Shanghai Benzhong Information Technology Co ltd
Priority date: 2021-08-04
Filing date: 2021-08-04
Publication date: 2021-11-05
Anticipated expiration: 2041-08-04

Abstract

The invention provides a method and a system for detecting webpage similarity. The invention discloses a webpage similarity detection method which comprises the steps of obtaining a CSS file and a JS file in a dynamic rendering page; performing lexical analysis on the CSS file and the JS file to obtain the similarity of the fuzzy hash value of the dynamic rendering page and other web pages; carrying out syntactic analysis on the CSS file and the JS file to obtain the similarity of the feature vectors of the dynamic rendering page and other web pages; and detecting the similarity of the dynamically rendered page and other web pages according to the fuzzy hash value similarity and the page feature vector similarity. According to the method and the device, the webpage JS and CSS styles are statically analyzed to obtain the characteristics of the dynamically rendered webpage and the similarity of the webpage is calculated based on the characteristics, so that the detection efficiency of the similar webpage can be greatly improved, and the defect that the similarity of the dynamically rendered webpage cannot be statically calculated in the prior art is overcome.

Description

Webpage similarity detection method and system

Technical Field

The invention belongs to the technical field of webpage detection, and particularly relates to a webpage similarity detection method and system.

Background

With the vigorous development of the internet, various web malicious applications such as fraud, gambling and the like are bred, and various detection technologies are developed along with the web malicious applications in order to find the malicious applications in time. Through research, the main processing objects of the detection technologies are web application pages, and besides analyzing and processing the content of the pages, similarity comparison among multiple pages is needed to screen out more malicious web applications. The web application pages are obtained through automatic crawling by a crawler, but as the web application development technology is continuously developed and matured, the obtained pages are not only static pages, but also a large number of dynamic rendering pages. The static page refers to an HTML file in which page data and a DOM structure are directly stored, and the dynamic rendering page refers to a page without a real DOM structure and needs to be further generated through JS and CSS dynamic rendering, such as a single page web application (SPA). For similarity comparison of static pages, there are currently similarity of web page contents and similarity of web page structures.

Web page content similarity means that although different web application pages are different in layout, the same text content is reprinted. At this time, the technology for calculating content similarity usually adopts a vector space model to identify webpage text information, specifically, a word segmentation is performed on a webpage text, then a certain weight is given to the word through calculation (such as a TF-IDF algorithm), finally, a webpage is represented as a high-dimensional vector, and the similarity between the webpages is measured through distance calculation (such as euclidean distance).

The web page structural similarity means that text content, pictures, colors and the like of different web application pages are different, but the page layouts are very similar. The method for calculating the similarity of the webpage structures mainly comprises the following steps: 1) based on a webpage DOM (document object model) tree, calculating the similarity of the webpage structures through the DOM structure according to the tree editing distance, a simple tree matching algorithm or tree path matching; 2) based on the visual information of the webpage structure, DOM visual block information is obtained through a webpage DOM tree, each visual block appearing in the page is subjected to differential cutting and division in the aspects of position center, area and aspect ratio, different representation sequences are given to information of different levels, and finally the obtained representation sequences are used as identity information of the page to carry out similarity calculation.

It can be seen that the following problems exist in the prior art: the internet has a large number of malicious applications such as gambling and fraud developed through dynamic rendering pages, but because the JS code and the CSS pattern included in the dynamic rendering pages crawled by crawlers are not run, real webpage data and DOM structures are not included in the pages, the similarity calculation technology for static pages cannot be applied to the dynamic rendering pages, and no method is available at present for quickly calculating similar dynamic rendering pages without executing JS and CSS codes, that is, the similarity of the dynamic rendering pages cannot be calculated at present.

Disclosure of Invention

The invention aims to provide a method and a system for detecting webpage similarity, and aims to solve the problem that the similarity calculation technology of a static page in the prior art cannot be applied to dynamic rendering of the page.

In order to achieve the purpose, the invention adopts the technical scheme that:

a webpage similarity detection method comprises the following steps:

step 1: acquiring a CSS file and a JS file in a dynamic rendering page;

step 2: performing lexical analysis on the CSS file and the JS file to generate a CSS file token sequence and a JS file token sequence;

and step 3: obtaining the fuzzy hash value similarity of the dynamic rendering page and other web pages according to the CSS file token sequence and the JS file token sequence;

and 4, step 4: carrying out syntactic analysis on the CSS file and the JS file to construct an abstract syntax tree of the CSS file and an abstract syntax tree of the JS file;

and 5: obtaining the similarity of the feature vectors of the dynamic rendering page and other web pages according to the abstract syntax tree of the CSS file and the abstract syntax tree of the JS file;

step 6: and detecting the similarity of the dynamically rendered page and other web pages according to the fuzzy hash value similarity and the page feature vector similarity.

Preferably, the step 1: obtaining a CSS file and a JS file in a dynamically rendered page, comprising:

step 1.1: analyzing an HTML (hypertext markup language) label of the dynamic rendering page to obtain an original file with a suffix name of CSS (cascading style sheets) and an original file with a suffix name of JS;

step 1.2: acquiring the code lengths of the original file of the CSS and the original file of the JS and setting a length threshold value;

step 1.3: and filtering all the corresponding original files with the code lengths larger than the length threshold value in the CSS original file and the JS original file to obtain the CSS file and the JS file.

Preferably, the step 3: obtaining the fuzzy hash value similarity of the dynamic rendering page and other web pages according to the CSS file token sequence and the JS file token sequence, including:

step 3.1: splicing the type types in the CSS file token sequence into a first character string;

step 3.2: splicing the type in the JS file token sequence into a second character string;

step 3.3: performing fuzzy hash operation on the first character string and the second character string respectively to obtain a CSS file page hash value characteristic and a JS file page hash value characteristic;

step 3.4: and obtaining the fuzzy hash value similarity of the dynamically rendered page and other web pages according to the CSS file page hash value characteristics and the JS file page hash value characteristics.

Preferably, the step 3.4: obtaining the fuzzy hash value similarity of the dynamically rendered page and other webpages according to the CSS file page hash value characteristic and the JS file page hash value characteristic, comprising:

step 3.4.1: calculating the similarity between the CSS file page hash value characteristics and corresponding CSS file page hash value characteristics of other webpages by using a fuzzy hash algorithm to generate a first similarity;

step 3.4.2: calculating the similarity of the JS file page hash value characteristic and the corresponding JS file page hash value characteristics of other webpages by using a fuzzy hash algorithm to generate a second similarity;

step 3.4.3: and taking the maximum value between the first similarity and the second similarity as the fuzzy hash value similarity of the dynamic rendering page and other web pages.

Preferably, the step 5: obtaining the similarity of the feature vectors of the dynamically rendered page and other web pages according to the abstract syntax tree of the CSS file and the abstract syntax tree of the JS file, including:

step 5.1: dividing all nodes in the abstract syntax tree of the CSS file and the abstract syntax tree of the JS file into different feature units;

step 5.2: obtaining a high-dimensional feature vector according to the feature unit;

step 5.3: determining the weight value of the feature unit in the abstract syntax tree according to a preset weight calculation rule;

step 5.4: reducing the dimension of the high-dimensional feature vector according to the weight value to obtain a code feature vector of JS and a code feature vector of CSS in the dynamic rendering page;

step 5.5: and obtaining the similarity of the page feature vectors of the dynamic rendering page and other web pages according to the code feature vector of the JS and the code feature vector of the CSS.

Preferably, the step 5.3: determining the weight value of the feature unit in the abstract syntax tree according to a preset weight calculation rule, wherein the weight value comprises the following steps:

and determining the weight value of each feature unit in the corresponding abstract syntax tree according to the rule that the weight value decreases with the depth of the feature unit in the corresponding abstract syntax tree.

Preferably, the step 5.5: obtaining the similarity of the page feature vectors of the dynamic rendering page and other web pages according to the code feature vector of the JS and the code feature vector of the CSS, and the similarity comprises the following steps:

step 5.5.1: calculating a first distance between the code characteristic vector of the JS and the code characteristic vectors of the corresponding JS of other webpage;

step 5.5.2: calculating a second distance between the code feature vector of the CSS and the code feature vector of the corresponding CSS of the other web page;

step 5.5.3: and taking the minimum value between the first distance and the second distance as the page feature vector similarity of the dynamic rendering page and other web pages.

Preferably, the step 6: detecting the similarity of the dynamically rendered page and other web pages according to the fuzzy hash value similarity and the page feature vector similarity, including:

and performing descending arrangement on the fuzzy hash value similarity, and performing ascending arrangement on the page feature vector similarity to detect the similarity between the dynamically rendered page and other webpages.

The invention also provides a webpage similarity detection system, which comprises:

the CSS file and JS file acquisition module is used for acquiring a CSS file and a JS file in the dynamic rendering page;

the token sequence generating module is used for performing lexical analysis on the CSS file and the JS file to generate a CSS file token sequence and a JS file token sequence;

the fuzzy hash value similarity calculation module is used for obtaining the fuzzy hash value similarity of the dynamic rendering page and other webpages according to the CSS file token sequence and the JS file token sequence;

the abstract syntax tree extraction module is used for carrying out syntax analysis on the CSS file and the JS file to construct an abstract syntax tree of the CSS file and an abstract syntax tree of the JS file;

the page feature vector similarity calculation module is used for obtaining the page feature vector similarity of the dynamic rendering page and other web pages according to the abstract syntax tree of the CSS file and the abstract syntax tree of the JS file;

and the similarity detection module is used for detecting the similarity of the dynamically rendered page and other web pages according to the fuzzy hash value similarity and the page feature vector similarity.

The webpage similarity detection method and the webpage similarity detection system have the advantages that: compared with the prior art, the webpage similarity detection method comprises the steps of obtaining a CSS file and a JS file in a dynamic rendering page; performing lexical analysis on the CSS file and the JS file to obtain the similarity of the fuzzy hash value of the dynamic rendering page and other web pages; carrying out syntactic analysis on the CSS file and the JS file to construct an abstract syntax tree of the CSS file and an abstract syntax tree of the JS file; obtaining the similarity of the feature vectors of the dynamic rendering page and other web pages according to the abstract syntax tree of the CSS file and the abstract syntax tree of the JS file; and detecting the similarity of the dynamically rendered page and other web pages according to the fuzzy hash value similarity and the page feature vector similarity. According to the method and the device, the webpage JS and CSS styles are statically analyzed to obtain the characteristics of the dynamically rendered webpage and the similarity of the webpage is calculated based on the characteristics, so that the detection efficiency of the similar webpage can be greatly improved, and the defect that the similarity of the dynamically rendered webpage cannot be statically calculated in the prior art is overcome.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic diagram of a method for detecting web page similarity according to an embodiment of the present invention.

Fig. 2 is a flowchart of a method for detecting web page similarity according to an embodiment of the present invention.

Fig. 3 is a schematic diagram illustrating a principle of a web page similarity detection apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

a webpage similarity detection method comprises the following steps:

step 1: acquiring a CSS file and a JS file in a dynamic rendering page; wherein, step 1 specifically includes:

and step 3: obtaining the fuzzy hash value similarity of the dynamic rendering page and other web pages according to the CSS file token sequence and the JS file token sequence; wherein, step 3 specifically includes:

step 3.4: and obtaining the fuzzy hash value similarity of the dynamically rendered page and other web pages according to the CSS file page hash value characteristics and the JS file page hash value characteristics. Specifically, the method comprises the following steps:

step 3.4.1: calculating the similarity between the CSS file page hash value characteristics and corresponding CSS file page hash value characteristics of other webpages by using a fuzzy hash algorithm to generate a first similarity; it should be noted that the page hash value characteristics of the corresponding CSS file/JS file of other web pages are also obtained based on the web page similarity detection method in the present invention.

The following describes the process of calculating the fuzzy hash value similarity between a dynamically rendered page and other web pages in the present invention with reference to specific embodiments:

fig. 1 is a schematic diagram of a method for detecting web page similarity according to an embodiment of the present invention, and as shown in fig. 1, the method for detecting web page similarity according to the embodiment of the present invention includes:

step 101: respectively calculating the characteristics of the page according to the JS code and the CSS style of the page;

step 102: and performing webpage similarity calculation according to the characteristics of the pages, and detecting.

Through the processing, the similarity of the dynamically rendered pages can be calculated, and when a certain web malicious application is found, all the malicious application pages similar to the web malicious application can be quickly found by detecting the application pages with similar characteristics.

Fig. 2 is a flowchart of a method for detecting web page similarity according to the present invention, as shown in fig. 2, including the following processing steps:

step 201: acquiring CSS and JS files in a dynamic rendering page; the files with the suffix names of CSS and JS are obtained by analyzing the href and type attributes corresponding to the link and style tags in html, and for example, the corresponding CSS file is obtained by analyzing the html tag of < link href ═ cs/mobile-motion-vue.0915736c. And meanwhile, setting the length threshold value to be 10000, and filtering files of which the lengths of the CSS and the JS code are larger than the length threshold value.

Step 202 (1): and performing lexical analysis on the acquired CSS and JS files. The method is specifically realized by performing lexical analysis on JS and CSS respectively based on the existing open source toolkit. For example, the Esprima parser is used for performing lexical analysis on the JS file, so that the token sequence of the JS file can be quickly obtained. Similarly, the token sequence of the CSS file can be rapidly obtained by performing lexical analysis on the CSS file by using the tinycss2 toolkit.

Step 203 (1): after the CSS and JS files are analyzed in a lexical mode, the analysis results are spliced into character strings, and hash values are calculated for the character strings to obtain page characteristics.

In practical application, the method comprises the following processing procedures: firstly, sequentially traversing the token sequence of the output result of the lexical analysis of each JS file to obtain the type of each token. For example, only a few types are defined in the Esprima parser, such as "keyword", "string", etc. Splicing the type of each token into a character string; splicing the character strings obtained by each JS file into a final character string; and finally, carrying out fuzzy hash operation on the character string to obtain a final hash value as one of the page hash value characteristics. The CSS file is processed according to the flow, and the obtained hash value is used as another page hash value characteristic.

Step 204 (1): and calculating the similarity of the web pages according to the obtained hash value. In particular, different fuzzy hash algorithms each provide distance or similarity calculations between hash values accordingly. For example, TLSH is an algorithm for calculating the distance between hash values, and SSdeep is an algorithm for calculating the degree of matching between hash values. According to the method, an SSdeep fuzzy hash algorithm is used, after the matching degree of JS hash values and the matching degree of CSS hash values of different pages are obtained through calculation, a result with a higher matching degree value is selected as one of page similarity. Specifically, the matching degree of JS hash values in different dynamic rendering pages is D1, and the matching degree of CSS hash values is D2; if D1> D2, then D1 is chosen as the fuzzy hash value similarity for the different pages.

and 5: obtaining the similarity of the feature vectors of the dynamic rendering page and other web pages according to the abstract syntax tree of the CSS file and the abstract syntax tree of the JS file; wherein, step 5 specifically includes:

step 5.3: determining the weight value of the feature unit in the abstract syntax tree according to a preset weight calculation rule; specifically, step 5.3 includes: determining the weight value of each feature unit in the corresponding abstract syntax tree according to the rule that the weight value decreases with the depth of the feature unit in the corresponding abstract syntax tree;

step 5.5: and obtaining the similarity of the page feature vectors of the dynamic rendering page and other web pages according to the code feature vector of the JS and the code feature vector of the CSS. Wherein, the step 5.5 specifically comprises:

The following describes the process of calculating the similarity between the dynamically rendered page and the page feature vectors of other web pages in the present invention with reference to specific embodiments:

step 202 (2): and carrying out syntactic analysis on the acquired CSS and JS files. The specific implementation is that JS and CSS are analyzed grammatically based on the existing open source toolkit. For example, the abstract syntax tree of the JS file can be quickly obtained by analyzing the syntax of the JS file through the Esprima parser. Similarly, the tinycoss 2 toolkit can be used for parsing the CSS file, and the abstract syntax tree of the CSS file can be rapidly obtained.

Step 203 (2): and constructing an abstract syntax tree and extracting page feature vectors.

In practical application, the method specifically comprises the following processing procedures: dividing nodes in the abstract syntax tree into different characteristic units; then, obtaining a high-dimensional feature vector according to the feature unit; calculating the dimensionality of the feature unit mapped to the high-dimensional feature vector, specifically, firstly, taking the node type of the feature unit as a character string, then, applying hash operation to the character string to obtain a positive integer value, such as md5, sha1, sha128 and the like, and taking the positive integer value as the dimensionality of the feature unit mapped to the high-dimensional vector; and finally, calculating a weight value of the feature unit in the abstract syntax tree, and determining a real numerical value of the feature unit on the dimension of the high-dimensional feature vector according to the weight value.

The following is an example of constructing an abstract syntax tree according to a JS code of a web page, where the abstract syntax tree includes a root node Program, where the root node Program includes a child node, which is expressstate, and the expressstate includes a child node, which is CallExpression, and the CallExpression includes two child nodes, which are ArrayExpression and MemberExpression, respectively, and the MemberExpression includes a child node, which is identity.

The mapping process of the invention is explained in detail by the JS code example (the processing method of the CSS style is the same as that of the JS), and different nodes in the constructed abstract syntax tree are divided into different feature units; regarding the node type of the feature unit as a character string, for example: the characteristic unit of the second layer node is 'expression State'; and then, md5 hash operation is applied to the character string content of the feature unit, and the hash result is as follows: md5 ("expressonstate") -64556525, and thus, it can be determined that the dimension in which the feature element "expressonstate" is mapped to the high-dimensional feature vector is 64556525 th dimension.

It should be noted that, the information of the feature unit is a node type character string after parsing, and in practical applications, the relevant types include, but are not limited to, the types appearing in the above example.

After the number of dimensions of the feature unit mapped to the high-dimensional feature vector is determined, the weight value of the feature unit in the grammar tree is determined according to a weight calculation rule and is used as the weight value of the corresponding number of dimensions of the feature unit on the high-dimensional vector. Specifically, the weight value represents the importance of the corresponding feature unit in the web page (syntax tree of the web page), the feature unit is given a weight once every time it appears in the syntax tree, and the final weight value of the feature unit is the accumulation of the weight values given every time it appears in the syntax tree. The invention mainly calculates the weight value of the characteristic unit according to two rules, one of which is as follows: the weight value of the feature unit decreases with the depth of the feature unit in the syntax tree, and the weight value of the feature unit is two: the weight value of a feature element decreases with repetition of the feature element in sibling nodes (i.e., children nodes under the same parent node). And when a feature unit is too deep in the abstract syntax tree, ignoring the feature unit.

It should be noted that, in the abstract syntax tree constructed by the CSS style or the JS code, the content difference of the information represented by the feature unit of the more inner layer is not obvious to change the whole web page, so that in practical application, the weight of the feature unit can be calculated in a subtractive manner, and the feature unit with the depth greater than 10 in the abstract syntax tree is ignored.

In practical applications, the weight of each occurrence of a feature unit is determined by several factors mentioned above.

For example, in the above example, if the feature unit "expressionstatment" is located at the second layer, the weight value of the feature unit should be greater than that of the subsequent feature unit, and the weight value of the feature unit may be preset to 1.0, and since the number of layers of the feature unit in the abstract syntax tree is 2, the final weight value needs to be multiplied by the attenuation factor (the attenuation factor is set to be 0.5) to the power of 2.

It should be noted that the weight value of the feature vector of the web page is a floating point number type.

In the above example, it is determined that the dimension of the feature element "expressstatement" in the high-dimensional feature vector is 64556525, and the weight value of the feature element is affected by the depth of the feature element in the abstract syntax tree. Then, the weight value obtained by final calculation can be used as a real numerical value on the corresponding dimension of the high-dimensional feature vector. That is, a real value in the 64556525 th dimension of the high-dimensional feature vector is determined. In practical applications, each feature unit needs to be processed as described above, and the weight value of each feature unit in the abstract syntax tree is used as the real value of the corresponding dimension of the feature unit on the high-dimensional feature vector.

After the high-dimensional feature vector of the page is obtained, the high-dimensional vector needs to be compressed to obtain a low-dimensional vector, and in practical application, because a real value obtained by performing hash operation on the information of the feature unit is large, the feature vector with a small dimension is needed to ensure the calculation efficiency. Therefore, after determining the high-dimensional feature vector of the web page abstract syntax tree, the high-dimensional feature vector needs to be compressed to a low dimension. In the embodiment of the invention, the accuracy of calculating the similarity of the web pages by using the compressed web page feature vectors is ensured by compressing by adopting a dimension modulus and simple superposition method.

The following describes the process of compressing the high-dimensional feature vector to the low-dimensional feature vector in detail:

assuming that the dimension of the high-dimensional feature vector is M, the M-dimensional high-dimensional feature vector needs to be compressed into an N-dimensional low-dimensional feature vector, where N is a natural number greater than or equal to 1 and less than M, and the following processing needs to be performed:

1. dividing each dimension on the M-dimension high-dimension vector by N (1< ═ N < M) to obtain a corresponding remainder;

2. taking the dimension of the corresponding high-dimensional feature vector with the same remainder as one dimension of the compressed N-dimensional feature vector;

3. and overlapping the weighted values of the corresponding high-dimensional feature vectors with the same remainder, wherein the overlapped real numerical value is used as the weighted value of the corresponding dimension in the N-dimensional feature vector.

For example, the dimension of the high-dimensional feature vector is 50000 dimensions, and finally a 128-dimensional feature vector is desired, so the high-dimensional feature vector needs to be compressed, and assuming that the high-dimensional feature vector is [ b1, b2, b3, … … b50000], the weight value of the first dimension of the compressed feature vector is b1+ b257+ … + b49921, and so on, the compression from the high-dimensional vector to the low-dimensional vector is realized.

Thus, the code feature vector of the JS and the style feature vector of the CSS in the dynamically rendered page can be obtained.

Step 204 (2): and after the webpage feature vectors are obtained, calculating the similarity of the webpages according to a preset algorithm. The distance calculation algorithm may include euclidean distance, jaccard distance, hamming distance, cos distance, and the like. Specifically, the euclidean distance is used to calculate the distance D3 between JS code feature vectors and the distance D4 between CSS style feature vectors between different pages, and a smaller distance indicates that the vectors are more similar to each other, so a smaller value between D3 and D4 is taken as the similarity of the page feature vectors.

Step 6: and detecting the similarity of the dynamically rendered page and other web pages according to the fuzzy hash value similarity and the page feature vector similarity. Wherein, step 6 specifically includes:

and performing descending order arrangement on the fuzzy hash value similarity, and performing ascending order arrangement on the page feature vector similarity to detect the similarity between the dynamic rendering page and other web pages, wherein the page with more similar features is arranged in front.

It should be noted that, finally, the web pages are sorted in an ascending order according to the vector similarity between the web pages, and sorted in a descending order according to the fuzzy hash similarity, so as to quickly search the web pages with similar characteristics. The method is mainly applied to the condition that a given webpage needs to search for the webpage with the similar characteristics to the webpage, and can greatly improve the efficiency of searching the similar webpage.

Fig. 3 is a schematic diagram illustrating a principle of a web page similarity detection apparatus according to an embodiment of the present invention, and as shown in fig. 3, the web page similarity detection apparatus according to the present invention includes: a feature extraction module 30 and a similarity calculation module 31. The enhanced web page similarity calculation apparatus according to the embodiment of the present invention is described below.

Specifically, the feature extraction module 30 is configured to extract the code feature of the page according to the JS and CSS codes of the web page, and specifically includes: a code extraction module 300, a lexical analysis module 301(1) and a fuzzy hash module 302 (1); syntax parsing module 301(2), feature vector calculation module (dimension calculation module 302(2), weight calculation module 303(2), and dimension reduction module 304 (2).

The code extraction module 300 is configured to identify JS and CSS file links in a page and download the JS and CSS file links, and filter files with code lengths greater than a threshold by setting the threshold to 10000.

The lexical analysis module 301(1) is configured to, after the code extraction module 300 acquires the JS and CSS files, perform lexical analysis on the acquired JS and CSS files, thereby acquiring a token sequence.

The fuzzy hash module 302(1) is configured to, after the lexical analysis module 301(1) performs lexical analysis on the JS and CSS files, first splice a token sequence parsed by each JS file into a character string; then, sequentially concatenating character strings obtained after analyzing all reserved JS files in the page into an integral character string; and finally, carrying out fuzzy hash operation on the whole character string to obtain a final hash value as the hash value characteristic of the JS file on the page. The CSS file is analyzed like a JS file, and the finally obtained CSS file hash value is used as the page CSS file hash value characteristic.

The syntax analysis module 301(2) is configured to, after the code extraction module 300 acquires the JS and CSS files, perform syntax analysis on the acquired JS and CSS files, thereby constructing an abstract syntax tree.

After the parsing module 301(2) parses the JS and CSS files, the feature vector calculation module is configured to calculate feature vectors according to the abstract syntax tree.

Specifically, the feature vector module includes a dimension calculation module 302(2), a weight calculation module 303(2), and a dimension reduction module 304(2), wherein the dimension calculation module 302(2) is configured to calculate a dimension in which each feature unit is mapped onto the high-dimensional vector; the weight calculation module 303(2) is configured to calculate weight values of the feature units in the abstract syntax tree; the dimension reduction module 304(2) is configured to compress the high-dimensional vector to obtain a low-dimensional feature vector, and obtain a final code feature vector and a final style feature vector, respectively.

The similarity calculation module 31 is configured to, after the feature extraction module 30 extracts the page features of the web page, specifically, the similarity calculation module 31 includes: a fuzzy hash value similarity calculation module 310(1), a vector similarity calculation module 310(2), and a sorting module 311.

The fuzzy hash value similarity calculation module 310(1) calculates JS page hash value similarity D1 and CSS page hash value similarity D2 between different pages according to the finally obtained page hash feature, and selects D1 as different web page hash similarities if D1> D2.

The vector similarity calculation module 310(2) calculates the distance D3 of the JS code feature vector and the distance D4 of the CSS style feature vector between different pages according to the finally obtained webpage feature vector, and selects D3 as the similarity of different webpage vectors if D3< D4.

The sorting module 311 performs descending sorting and ascending sorting according to the similarity calculated by the fuzzy hash value similarity calculation module 310(1) and the vector similarity calculation module 310(2), respectively, and the pages with more similar features are arranged in front, so as to search for the web pages with the same features.

It should be noted that the meanings represented by the web page similarities obtained by the fuzzy hash value similarity calculation module 310(1) and the vector similarity calculation module 310(2) are different, and the larger the web page similarity obtained by the fuzzy hash value similarity calculation module 310(1), the higher the web page similarity represents; the smaller the similarity of the web pages obtained in the vector similarity calculation module 310(2), the higher the similarity of the web pages.

In summary, with the aid of the technical solution of the present invention, by calculating the similarity of the features of the dynamically rendered web page, the defect that the similarity of the dynamically rendered web page cannot be statically calculated in the prior art is overcome, and the fast calculation of the similar dynamically rendered web page is realized. When a certain malicious web application is found, all the malicious web applications with similar page features can be found by searching application pages with similar code feature vectors.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A webpage similarity detection method is characterized by comprising the following steps:

step 1: acquiring a CSS file and a JS file in a dynamic rendering page;

2. The method for detecting web page similarity according to claim 1, wherein the step 1: obtaining a CSS file and a JS file in a dynamically rendered page, comprising:

3. The method for detecting web page similarity according to claim 1, wherein the step 3: obtaining the fuzzy hash value similarity of the dynamic rendering page and other web pages according to the CSS file token sequence and the JS file token sequence, including:

4. The method for detecting webpage similarity according to claim 3, wherein the step 3.4: obtaining the fuzzy hash value similarity of the dynamically rendered page and other webpages according to the CSS file page hash value characteristic and the JS file page hash value characteristic, comprising:

5. The method for detecting web page similarity according to claim 1, wherein the step 5: obtaining the similarity of the feature vectors of the dynamically rendered page and other web pages according to the abstract syntax tree of the CSS file and the abstract syntax tree of the JS file, including:

6. The method for detecting webpage similarity according to claim 5, wherein the step 5.3: determining the weight value of the feature unit in the abstract syntax tree according to a preset weight calculation rule, wherein the weight value comprises the following steps:

7. The method for detecting webpage similarity according to claim 5, wherein the step 5.5: obtaining the similarity of the page feature vectors of the dynamic rendering page and other web pages according to the code feature vector of the JS and the code feature vector of the CSS, and the similarity comprises the following steps:

8. The method for detecting web page similarity according to claim 1, wherein the step 6: detecting the similarity of the dynamically rendered page and other web pages according to the fuzzy hash value similarity and the page feature vector similarity, including:

9. A web page similarity detection system, comprising: