CN113609246A - Webpage similarity detection method and system - Google Patents

Webpage similarity detection method and system Download PDF

Info

Publication number
CN113609246A
CN113609246A CN202110891633.6A CN202110891633A CN113609246A CN 113609246 A CN113609246 A CN 113609246A CN 202110891633 A CN202110891633 A CN 202110891633A CN 113609246 A CN113609246 A CN 113609246A
Authority
CN
China
Prior art keywords
file
similarity
page
css
hash value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110891633.6A
Other languages
Chinese (zh)
Other versions
CN113609246B (en
Inventor
陈业炫
奉轶
徐文博
张燕
陆亦恬
朱璋颖
唐祝寿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Benzhong Information Technology Co ltd
Original Assignee
Shanghai Benzhong Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Benzhong Information Technology Co ltd filed Critical Shanghai Benzhong Information Technology Co ltd
Priority to CN202110891633.6A priority Critical patent/CN113609246B/en
Priority claimed from CN202110891633.6A external-priority patent/CN113609246B/en
Publication of CN113609246A publication Critical patent/CN113609246A/en
Application granted granted Critical
Publication of CN113609246B publication Critical patent/CN113609246B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2119Authenticating web pages, e.g. with suspicious links

Abstract

The invention provides a method and a system for detecting webpage similarity. The invention discloses a webpage similarity detection method which comprises the steps of obtaining a CSS file and a JS file in a dynamic rendering page; performing lexical analysis on the CSS file and the JS file to obtain the similarity of the fuzzy hash value of the dynamic rendering page and other web pages; carrying out syntactic analysis on the CSS file and the JS file to obtain the similarity of the feature vectors of the dynamic rendering page and other web pages; and detecting the similarity of the dynamically rendered page and other web pages according to the fuzzy hash value similarity and the page feature vector similarity. According to the method and the device, the webpage JS and CSS styles are statically analyzed to obtain the characteristics of the dynamically rendered webpage and the similarity of the webpage is calculated based on the characteristics, so that the detection efficiency of the similar webpage can be greatly improved, and the defect that the similarity of the dynamically rendered webpage cannot be statically calculated in the prior art is overcome.

Description

Webpage similarity detection method and system
Technical Field
The invention belongs to the technical field of webpage detection, and particularly relates to a webpage similarity detection method and system.
Background
With the vigorous development of the internet, various web malicious applications such as fraud, gambling and the like are bred, and various detection technologies are developed along with the web malicious applications in order to find the malicious applications in time. Through research, the main processing objects of the detection technologies are web application pages, and besides analyzing and processing the content of the pages, similarity comparison among multiple pages is needed to screen out more malicious web applications. The web application pages are obtained through automatic crawling by a crawler, but as the web application development technology is continuously developed and matured, the obtained pages are not only static pages, but also a large number of dynamic rendering pages. The static page refers to an HTML file in which page data and a DOM structure are directly stored, and the dynamic rendering page refers to a page without a real DOM structure and needs to be further generated through JS and CSS dynamic rendering, such as a single page web application (SPA). For similarity comparison of static pages, there are currently similarity of web page contents and similarity of web page structures.
Web page content similarity means that although different web application pages are different in layout, the same text content is reprinted. At this time, the technology for calculating content similarity usually adopts a vector space model to identify webpage text information, specifically, a word segmentation is performed on a webpage text, then a certain weight is given to the word through calculation (such as a TF-IDF algorithm), finally, a webpage is represented as a high-dimensional vector, and the similarity between the webpages is measured through distance calculation (such as euclidean distance).
The web page structural similarity means that text content, pictures, colors and the like of different web application pages are different, but the page layouts are very similar. The method for calculating the similarity of the webpage structures mainly comprises the following steps: 1) based on a webpage DOM (document object model) tree, calculating the similarity of the webpage structures through the DOM structure according to the tree editing distance, a simple tree matching algorithm or tree path matching; 2) based on the visual information of the webpage structure, DOM visual block information is obtained through a webpage DOM tree, each visual block appearing in the page is subjected to differential cutting and division in the aspects of position center, area and aspect ratio, different representation sequences are given to information of different levels, and finally the obtained representation sequences are used as identity information of the page to carry out similarity calculation.
It can be seen that the following problems exist in the prior art: the internet has a large number of malicious applications such as gambling and fraud developed through dynamic rendering pages, but because the JS code and the CSS pattern included in the dynamic rendering pages crawled by crawlers are not run, real webpage data and DOM structures are not included in the pages, the similarity calculation technology for static pages cannot be applied to the dynamic rendering pages, and no method is available at present for quickly calculating similar dynamic rendering pages without executing JS and CSS codes, that is, the similarity of the dynamic rendering pages cannot be calculated at present.
Disclosure of Invention
The invention aims to provide a method and a system for detecting webpage similarity, and aims to solve the problem that the similarity calculation technology of a static page in the prior art cannot be applied to dynamic rendering of the page.
In order to achieve the purpose, the invention adopts the technical scheme that:
a webpage similarity detection method comprises the following steps:
step 1: acquiring a CSS file and a JS file in a dynamic rendering page;
step 2: performing lexical analysis on the CSS file and the JS file to generate a CSS file token sequence and a JS file token sequence;
and step 3: obtaining the fuzzy hash value similarity of the dynamic rendering page and other web pages according to the CSS file token sequence and the JS file token sequence;
and 4, step 4: carrying out syntactic analysis on the CSS file and the JS file to construct an abstract syntax tree of the CSS file and an abstract syntax tree of the JS file;
and 5: obtaining the similarity of the feature vectors of the dynamic rendering page and other web pages according to the abstract syntax tree of the CSS file and the abstract syntax tree of the JS file;
step 6: and detecting the similarity of the dynamically rendered page and other web pages according to the fuzzy hash value similarity and the page feature vector similarity.
Preferably, the step 1: obtaining a CSS file and a JS file in a dynamically rendered page, comprising:
step 1.1: analyzing an HTML (hypertext markup language) label of the dynamic rendering page to obtain an original file with a suffix name of CSS (cascading style sheets) and an original file with a suffix name of JS;
step 1.2: acquiring the code lengths of the original file of the CSS and the original file of the JS and setting a length threshold value;
step 1.3: and filtering all the corresponding original files with the code lengths larger than the length threshold value in the CSS original file and the JS original file to obtain the CSS file and the JS file.
Preferably, the step 3: obtaining the fuzzy hash value similarity of the dynamic rendering page and other web pages according to the CSS file token sequence and the JS file token sequence, including:
step 3.1: splicing the type types in the CSS file token sequence into a first character string;
step 3.2: splicing the type in the JS file token sequence into a second character string;
step 3.3: performing fuzzy hash operation on the first character string and the second character string respectively to obtain a CSS file page hash value characteristic and a JS file page hash value characteristic;
step 3.4: and obtaining the fuzzy hash value similarity of the dynamically rendered page and other web pages according to the CSS file page hash value characteristics and the JS file page hash value characteristics.
Preferably, the step 3.4: obtaining the fuzzy hash value similarity of the dynamically rendered page and other webpages according to the CSS file page hash value characteristic and the JS file page hash value characteristic, comprising:
step 3.4.1: calculating the similarity between the CSS file page hash value characteristics and corresponding CSS file page hash value characteristics of other webpages by using a fuzzy hash algorithm to generate a first similarity;
step 3.4.2: calculating the similarity of the JS file page hash value characteristic and the corresponding JS file page hash value characteristics of other webpages by using a fuzzy hash algorithm to generate a second similarity;
step 3.4.3: and taking the maximum value between the first similarity and the second similarity as the fuzzy hash value similarity of the dynamic rendering page and other web pages.
Preferably, the step 5: obtaining the similarity of the feature vectors of the dynamically rendered page and other web pages according to the abstract syntax tree of the CSS file and the abstract syntax tree of the JS file, including:
step 5.1: dividing all nodes in the abstract syntax tree of the CSS file and the abstract syntax tree of the JS file into different feature units;
step 5.2: obtaining a high-dimensional feature vector according to the feature unit;
step 5.3: determining the weight value of the feature unit in the abstract syntax tree according to a preset weight calculation rule;
step 5.4: reducing the dimension of the high-dimensional feature vector according to the weight value to obtain a code feature vector of JS and a code feature vector of CSS in the dynamic rendering page;
step 5.5: and obtaining the similarity of the page feature vectors of the dynamic rendering page and other web pages according to the code feature vector of the JS and the code feature vector of the CSS.
Preferably, the step 5.3: determining the weight value of the feature unit in the abstract syntax tree according to a preset weight calculation rule, wherein the weight value comprises the following steps:
and determining the weight value of each feature unit in the corresponding abstract syntax tree according to the rule that the weight value decreases with the depth of the feature unit in the corresponding abstract syntax tree.
Preferably, the step 5.5: obtaining the similarity of the page feature vectors of the dynamic rendering page and other web pages according to the code feature vector of the JS and the code feature vector of the CSS, and the similarity comprises the following steps:
step 5.5.1: calculating a first distance between the code characteristic vector of the JS and the code characteristic vectors of the corresponding JS of other webpage;
step 5.5.2: calculating a second distance between the code feature vector of the CSS and the code feature vector of the corresponding CSS of the other web page;
step 5.5.3: and taking the minimum value between the first distance and the second distance as the page feature vector similarity of the dynamic rendering page and other web pages.
Preferably, the step 6: detecting the similarity of the dynamically rendered page and other web pages according to the fuzzy hash value similarity and the page feature vector similarity, including:
and performing descending arrangement on the fuzzy hash value similarity, and performing ascending arrangement on the page feature vector similarity to detect the similarity between the dynamically rendered page and other webpages.
The invention also provides a webpage similarity detection system, which comprises:
the CSS file and JS file acquisition module is used for acquiring a CSS file and a JS file in the dynamic rendering page;
the token sequence generating module is used for performing lexical analysis on the CSS file and the JS file to generate a CSS file token sequence and a JS file token sequence;
the fuzzy hash value similarity calculation module is used for obtaining the fuzzy hash value similarity of the dynamic rendering page and other webpages according to the CSS file token sequence and the JS file token sequence;
the abstract syntax tree extraction module is used for carrying out syntax analysis on the CSS file and the JS file to construct an abstract syntax tree of the CSS file and an abstract syntax tree of the JS file;
the page feature vector similarity calculation module is used for obtaining the page feature vector similarity of the dynamic rendering page and other web pages according to the abstract syntax tree of the CSS file and the abstract syntax tree of the JS file;
and the similarity detection module is used for detecting the similarity of the dynamically rendered page and other web pages according to the fuzzy hash value similarity and the page feature vector similarity.
The webpage similarity detection method and the webpage similarity detection system have the advantages that: compared with the prior art, the webpage similarity detection method comprises the steps of obtaining a CSS file and a JS file in a dynamic rendering page; performing lexical analysis on the CSS file and the JS file to obtain the similarity of the fuzzy hash value of the dynamic rendering page and other web pages; carrying out syntactic analysis on the CSS file and the JS file to construct an abstract syntax tree of the CSS file and an abstract syntax tree of the JS file; obtaining the similarity of the feature vectors of the dynamic rendering page and other web pages according to the abstract syntax tree of the CSS file and the abstract syntax tree of the JS file; and detecting the similarity of the dynamically rendered page and other web pages according to the fuzzy hash value similarity and the page feature vector similarity. According to the method and the device, the webpage JS and CSS styles are statically analyzed to obtain the characteristics of the dynamically rendered webpage and the similarity of the webpage is calculated based on the characteristics, so that the detection efficiency of the similar webpage can be greatly improved, and the defect that the similarity of the dynamically rendered webpage cannot be statically calculated in the prior art is overcome.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a schematic diagram of a method for detecting web page similarity according to an embodiment of the present invention.
Fig. 2 is a flowchart of a method for detecting web page similarity according to an embodiment of the present invention.
Fig. 3 is a schematic diagram illustrating a principle of a web page similarity detection apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention aims to provide a method and a system for detecting webpage similarity, and aims to solve the problem that the similarity calculation technology of a static page in the prior art cannot be applied to dynamic rendering of the page.
In order to achieve the purpose, the invention adopts the technical scheme that:
a webpage similarity detection method comprises the following steps:
step 1: acquiring a CSS file and a JS file in a dynamic rendering page; wherein, step 1 specifically includes:
step 1.1: analyzing an HTML (hypertext markup language) label of the dynamic rendering page to obtain an original file with a suffix name of CSS (cascading style sheets) and an original file with a suffix name of JS;
step 1.2: acquiring the code lengths of the original file of the CSS and the original file of the JS and setting a length threshold value;
step 1.3: and filtering all the corresponding original files with the code lengths larger than the length threshold value in the CSS original file and the JS original file to obtain the CSS file and the JS file.
Step 2: performing lexical analysis on the CSS file and the JS file to generate a CSS file token sequence and a JS file token sequence;
and step 3: obtaining the fuzzy hash value similarity of the dynamic rendering page and other web pages according to the CSS file token sequence and the JS file token sequence; wherein, step 3 specifically includes:
step 3.1: splicing the type types in the CSS file token sequence into a first character string;
step 3.2: splicing the type in the JS file token sequence into a second character string;
step 3.3: performing fuzzy hash operation on the first character string and the second character string respectively to obtain a CSS file page hash value characteristic and a JS file page hash value characteristic;
step 3.4: and obtaining the fuzzy hash value similarity of the dynamically rendered page and other web pages according to the CSS file page hash value characteristics and the JS file page hash value characteristics. Specifically, the method comprises the following steps:
step 3.4.1: calculating the similarity between the CSS file page hash value characteristics and corresponding CSS file page hash value characteristics of other webpages by using a fuzzy hash algorithm to generate a first similarity; it should be noted that the page hash value characteristics of the corresponding CSS file/JS file of other web pages are also obtained based on the web page similarity detection method in the present invention.
Step 3.4.2: calculating the similarity of the JS file page hash value characteristic and the corresponding JS file page hash value characteristics of other webpages by using a fuzzy hash algorithm to generate a second similarity;
step 3.4.3: and taking the maximum value between the first similarity and the second similarity as the fuzzy hash value similarity of the dynamic rendering page and other web pages.
The following describes the process of calculating the fuzzy hash value similarity between a dynamically rendered page and other web pages in the present invention with reference to specific embodiments:
fig. 1 is a schematic diagram of a method for detecting web page similarity according to an embodiment of the present invention, and as shown in fig. 1, the method for detecting web page similarity according to the embodiment of the present invention includes:
step 101: respectively calculating the characteristics of the page according to the JS code and the CSS style of the page;
step 102: and performing webpage similarity calculation according to the characteristics of the pages, and detecting.
Through the processing, the similarity of the dynamically rendered pages can be calculated, and when a certain web malicious application is found, all the malicious application pages similar to the web malicious application can be quickly found by detecting the application pages with similar characteristics.
Fig. 2 is a flowchart of a method for detecting web page similarity according to the present invention, as shown in fig. 2, including the following processing steps:
step 201: acquiring CSS and JS files in a dynamic rendering page; the files with the suffix names of CSS and JS are obtained by analyzing the href and type attributes corresponding to the link and style tags in html, and for example, the corresponding CSS file is obtained by analyzing the html tag of < link href ═ cs/mobile-motion-vue.0915736c. And meanwhile, setting the length threshold value to be 10000, and filtering files of which the lengths of the CSS and the JS code are larger than the length threshold value.
Step 202 (1): and performing lexical analysis on the acquired CSS and JS files. The method is specifically realized by performing lexical analysis on JS and CSS respectively based on the existing open source toolkit. For example, the Esprima parser is used for performing lexical analysis on the JS file, so that the token sequence of the JS file can be quickly obtained. Similarly, the token sequence of the CSS file can be rapidly obtained by performing lexical analysis on the CSS file by using the tinycss2 toolkit.
Step 203 (1): after the CSS and JS files are analyzed in a lexical mode, the analysis results are spliced into character strings, and hash values are calculated for the character strings to obtain page characteristics.
In practical application, the method comprises the following processing procedures: firstly, sequentially traversing the token sequence of the output result of the lexical analysis of each JS file to obtain the type of each token. For example, only a few types are defined in the Esprima parser, such as "keyword", "string", etc. Splicing the type of each token into a character string; splicing the character strings obtained by each JS file into a final character string; and finally, carrying out fuzzy hash operation on the character string to obtain a final hash value as one of the page hash value characteristics. The CSS file is processed according to the flow, and the obtained hash value is used as another page hash value characteristic.
Step 204 (1): and calculating the similarity of the web pages according to the obtained hash value. In particular, different fuzzy hash algorithms each provide distance or similarity calculations between hash values accordingly. For example, TLSH is an algorithm for calculating the distance between hash values, and SSdeep is an algorithm for calculating the degree of matching between hash values. According to the method, an SSdeep fuzzy hash algorithm is used, after the matching degree of JS hash values and the matching degree of CSS hash values of different pages are obtained through calculation, a result with a higher matching degree value is selected as one of page similarity. Specifically, the matching degree of JS hash values in different dynamic rendering pages is D1, and the matching degree of CSS hash values is D2; if D1> D2, then D1 is chosen as the fuzzy hash value similarity for the different pages.
And 4, step 4: carrying out syntactic analysis on the CSS file and the JS file to construct an abstract syntax tree of the CSS file and an abstract syntax tree of the JS file;
and 5: obtaining the similarity of the feature vectors of the dynamic rendering page and other web pages according to the abstract syntax tree of the CSS file and the abstract syntax tree of the JS file; wherein, step 5 specifically includes:
step 5.1: dividing all nodes in the abstract syntax tree of the CSS file and the abstract syntax tree of the JS file into different feature units;
step 5.2: obtaining a high-dimensional feature vector according to the feature unit;
step 5.3: determining the weight value of the feature unit in the abstract syntax tree according to a preset weight calculation rule; specifically, step 5.3 includes: determining the weight value of each feature unit in the corresponding abstract syntax tree according to the rule that the weight value decreases with the depth of the feature unit in the corresponding abstract syntax tree;
step 5.4: reducing the dimension of the high-dimensional feature vector according to the weight value to obtain a code feature vector of JS and a code feature vector of CSS in the dynamic rendering page;
step 5.5: and obtaining the similarity of the page feature vectors of the dynamic rendering page and other web pages according to the code feature vector of the JS and the code feature vector of the CSS. Wherein, the step 5.5 specifically comprises:
step 5.5.1: calculating a first distance between the code characteristic vector of the JS and the code characteristic vectors of the corresponding JS of other webpage;
step 5.5.2: calculating a second distance between the code feature vector of the CSS and the code feature vector of the corresponding CSS of the other web page;
step 5.5.3: and taking the minimum value between the first distance and the second distance as the page feature vector similarity of the dynamic rendering page and other web pages.
The following describes the process of calculating the similarity between the dynamically rendered page and the page feature vectors of other web pages in the present invention with reference to specific embodiments:
step 202 (2): and carrying out syntactic analysis on the acquired CSS and JS files. The specific implementation is that JS and CSS are analyzed grammatically based on the existing open source toolkit. For example, the abstract syntax tree of the JS file can be quickly obtained by analyzing the syntax of the JS file through the Esprima parser. Similarly, the tinycoss 2 toolkit can be used for parsing the CSS file, and the abstract syntax tree of the CSS file can be rapidly obtained.
Step 203 (2): and constructing an abstract syntax tree and extracting page feature vectors.
In practical application, the method specifically comprises the following processing procedures: dividing nodes in the abstract syntax tree into different characteristic units; then, obtaining a high-dimensional feature vector according to the feature unit; calculating the dimensionality of the feature unit mapped to the high-dimensional feature vector, specifically, firstly, taking the node type of the feature unit as a character string, then, applying hash operation to the character string to obtain a positive integer value, such as md5, sha1, sha128 and the like, and taking the positive integer value as the dimensionality of the feature unit mapped to the high-dimensional vector; and finally, calculating a weight value of the feature unit in the abstract syntax tree, and determining a real numerical value of the feature unit on the dimension of the high-dimensional feature vector according to the weight value.
The following is an example of constructing an abstract syntax tree according to a JS code of a web page, where the abstract syntax tree includes a root node Program, where the root node Program includes a child node, which is expressstate, and the expressstate includes a child node, which is CallExpression, and the CallExpression includes two child nodes, which are ArrayExpression and MemberExpression, respectively, and the MemberExpression includes a child node, which is identity.
<Program>
<ExpressionStatement>
<CallExpression>
<ArrayExpression>
<MemberExpression>
<Identifier>
The mapping process of the invention is explained in detail by the JS code example (the processing method of the CSS style is the same as that of the JS), and different nodes in the constructed abstract syntax tree are divided into different feature units; regarding the node type of the feature unit as a character string, for example: the characteristic unit of the second layer node is 'expression State'; and then, md5 hash operation is applied to the character string content of the feature unit, and the hash result is as follows: md5 ("expressonstate") -64556525, and thus, it can be determined that the dimension in which the feature element "expressonstate" is mapped to the high-dimensional feature vector is 64556525 th dimension.
It should be noted that, the information of the feature unit is a node type character string after parsing, and in practical applications, the relevant types include, but are not limited to, the types appearing in the above example.
After the number of dimensions of the feature unit mapped to the high-dimensional feature vector is determined, the weight value of the feature unit in the grammar tree is determined according to a weight calculation rule and is used as the weight value of the corresponding number of dimensions of the feature unit on the high-dimensional vector. Specifically, the weight value represents the importance of the corresponding feature unit in the web page (syntax tree of the web page), the feature unit is given a weight once every time it appears in the syntax tree, and the final weight value of the feature unit is the accumulation of the weight values given every time it appears in the syntax tree. The invention mainly calculates the weight value of the characteristic unit according to two rules, one of which is as follows: the weight value of the feature unit decreases with the depth of the feature unit in the syntax tree, and the weight value of the feature unit is two: the weight value of a feature element decreases with repetition of the feature element in sibling nodes (i.e., children nodes under the same parent node). And when a feature unit is too deep in the abstract syntax tree, ignoring the feature unit.
It should be noted that, in the abstract syntax tree constructed by the CSS style or the JS code, the content difference of the information represented by the feature unit of the more inner layer is not obvious to change the whole web page, so that in practical application, the weight of the feature unit can be calculated in a subtractive manner, and the feature unit with the depth greater than 10 in the abstract syntax tree is ignored.
In practical applications, the weight of each occurrence of a feature unit is determined by several factors mentioned above.
For example, in the above example, if the feature unit "expressionstatment" is located at the second layer, the weight value of the feature unit should be greater than that of the subsequent feature unit, and the weight value of the feature unit may be preset to 1.0, and since the number of layers of the feature unit in the abstract syntax tree is 2, the final weight value needs to be multiplied by the attenuation factor (the attenuation factor is set to be 0.5) to the power of 2.
It should be noted that the weight value of the feature vector of the web page is a floating point number type.
In the above example, it is determined that the dimension of the feature element "expressstatement" in the high-dimensional feature vector is 64556525, and the weight value of the feature element is affected by the depth of the feature element in the abstract syntax tree. Then, the weight value obtained by final calculation can be used as a real numerical value on the corresponding dimension of the high-dimensional feature vector. That is, a real value in the 64556525 th dimension of the high-dimensional feature vector is determined. In practical applications, each feature unit needs to be processed as described above, and the weight value of each feature unit in the abstract syntax tree is used as the real value of the corresponding dimension of the feature unit on the high-dimensional feature vector.
After the high-dimensional feature vector of the page is obtained, the high-dimensional vector needs to be compressed to obtain a low-dimensional vector, and in practical application, because a real value obtained by performing hash operation on the information of the feature unit is large, the feature vector with a small dimension is needed to ensure the calculation efficiency. Therefore, after determining the high-dimensional feature vector of the web page abstract syntax tree, the high-dimensional feature vector needs to be compressed to a low dimension. In the embodiment of the invention, the accuracy of calculating the similarity of the web pages by using the compressed web page feature vectors is ensured by compressing by adopting a dimension modulus and simple superposition method.
The following describes the process of compressing the high-dimensional feature vector to the low-dimensional feature vector in detail:
assuming that the dimension of the high-dimensional feature vector is M, the M-dimensional high-dimensional feature vector needs to be compressed into an N-dimensional low-dimensional feature vector, where N is a natural number greater than or equal to 1 and less than M, and the following processing needs to be performed:
1. dividing each dimension on the M-dimension high-dimension vector by N (1< ═ N < M) to obtain a corresponding remainder;
2. taking the dimension of the corresponding high-dimensional feature vector with the same remainder as one dimension of the compressed N-dimensional feature vector;
3. and overlapping the weighted values of the corresponding high-dimensional feature vectors with the same remainder, wherein the overlapped real numerical value is used as the weighted value of the corresponding dimension in the N-dimensional feature vector.
For example, the dimension of the high-dimensional feature vector is 50000 dimensions, and finally a 128-dimensional feature vector is desired, so the high-dimensional feature vector needs to be compressed, and assuming that the high-dimensional feature vector is [ b1, b2, b3, … … b50000], the weight value of the first dimension of the compressed feature vector is b1+ b257+ … + b49921, and so on, the compression from the high-dimensional vector to the low-dimensional vector is realized.
Thus, the code feature vector of the JS and the style feature vector of the CSS in the dynamically rendered page can be obtained.
Step 204 (2): and after the webpage feature vectors are obtained, calculating the similarity of the webpages according to a preset algorithm. The distance calculation algorithm may include euclidean distance, jaccard distance, hamming distance, cos distance, and the like. Specifically, the euclidean distance is used to calculate the distance D3 between JS code feature vectors and the distance D4 between CSS style feature vectors between different pages, and a smaller distance indicates that the vectors are more similar to each other, so a smaller value between D3 and D4 is taken as the similarity of the page feature vectors.
Step 6: and detecting the similarity of the dynamically rendered page and other web pages according to the fuzzy hash value similarity and the page feature vector similarity. Wherein, step 6 specifically includes:
and performing descending order arrangement on the fuzzy hash value similarity, and performing ascending order arrangement on the page feature vector similarity to detect the similarity between the dynamic rendering page and other web pages, wherein the page with more similar features is arranged in front.
It should be noted that, finally, the web pages are sorted in an ascending order according to the vector similarity between the web pages, and sorted in a descending order according to the fuzzy hash similarity, so as to quickly search the web pages with similar characteristics. The method is mainly applied to the condition that a given webpage needs to search for the webpage with the similar characteristics to the webpage, and can greatly improve the efficiency of searching the similar webpage.
The invention also provides a webpage similarity detection system, which comprises:
the CSS file and JS file acquisition module is used for acquiring a CSS file and a JS file in the dynamic rendering page;
the token sequence generating module is used for performing lexical analysis on the CSS file and the JS file to generate a CSS file token sequence and a JS file token sequence;
the fuzzy hash value similarity calculation module is used for obtaining the fuzzy hash value similarity of the dynamic rendering page and other webpages according to the CSS file token sequence and the JS file token sequence;
the abstract syntax tree extraction module is used for carrying out syntax analysis on the CSS file and the JS file to construct an abstract syntax tree of the CSS file and an abstract syntax tree of the JS file;
the page feature vector similarity calculation module is used for obtaining the page feature vector similarity of the dynamic rendering page and other web pages according to the abstract syntax tree of the CSS file and the abstract syntax tree of the JS file;
and the similarity detection module is used for detecting the similarity of the dynamically rendered page and other web pages according to the fuzzy hash value similarity and the page feature vector similarity.
Fig. 3 is a schematic diagram illustrating a principle of a web page similarity detection apparatus according to an embodiment of the present invention, and as shown in fig. 3, the web page similarity detection apparatus according to the present invention includes: a feature extraction module 30 and a similarity calculation module 31. The enhanced web page similarity calculation apparatus according to the embodiment of the present invention is described below.
Specifically, the feature extraction module 30 is configured to extract the code feature of the page according to the JS and CSS codes of the web page, and specifically includes: a code extraction module 300, a lexical analysis module 301(1) and a fuzzy hash module 302 (1); syntax parsing module 301(2), feature vector calculation module (dimension calculation module 302(2), weight calculation module 303(2), and dimension reduction module 304 (2).
The code extraction module 300 is configured to identify JS and CSS file links in a page and download the JS and CSS file links, and filter files with code lengths greater than a threshold by setting the threshold to 10000.
The lexical analysis module 301(1) is configured to, after the code extraction module 300 acquires the JS and CSS files, perform lexical analysis on the acquired JS and CSS files, thereby acquiring a token sequence.
The fuzzy hash module 302(1) is configured to, after the lexical analysis module 301(1) performs lexical analysis on the JS and CSS files, first splice a token sequence parsed by each JS file into a character string; then, sequentially concatenating character strings obtained after analyzing all reserved JS files in the page into an integral character string; and finally, carrying out fuzzy hash operation on the whole character string to obtain a final hash value as the hash value characteristic of the JS file on the page. The CSS file is analyzed like a JS file, and the finally obtained CSS file hash value is used as the page CSS file hash value characteristic.
The syntax analysis module 301(2) is configured to, after the code extraction module 300 acquires the JS and CSS files, perform syntax analysis on the acquired JS and CSS files, thereby constructing an abstract syntax tree.
After the parsing module 301(2) parses the JS and CSS files, the feature vector calculation module is configured to calculate feature vectors according to the abstract syntax tree.
Specifically, the feature vector module includes a dimension calculation module 302(2), a weight calculation module 303(2), and a dimension reduction module 304(2), wherein the dimension calculation module 302(2) is configured to calculate a dimension in which each feature unit is mapped onto the high-dimensional vector; the weight calculation module 303(2) is configured to calculate weight values of the feature units in the abstract syntax tree; the dimension reduction module 304(2) is configured to compress the high-dimensional vector to obtain a low-dimensional feature vector, and obtain a final code feature vector and a final style feature vector, respectively.
The similarity calculation module 31 is configured to, after the feature extraction module 30 extracts the page features of the web page, specifically, the similarity calculation module 31 includes: a fuzzy hash value similarity calculation module 310(1), a vector similarity calculation module 310(2), and a sorting module 311.
The fuzzy hash value similarity calculation module 310(1) calculates JS page hash value similarity D1 and CSS page hash value similarity D2 between different pages according to the finally obtained page hash feature, and selects D1 as different web page hash similarities if D1> D2.
The vector similarity calculation module 310(2) calculates the distance D3 of the JS code feature vector and the distance D4 of the CSS style feature vector between different pages according to the finally obtained webpage feature vector, and selects D3 as the similarity of different webpage vectors if D3< D4.
The sorting module 311 performs descending sorting and ascending sorting according to the similarity calculated by the fuzzy hash value similarity calculation module 310(1) and the vector similarity calculation module 310(2), respectively, and the pages with more similar features are arranged in front, so as to search for the web pages with the same features.
It should be noted that the meanings represented by the web page similarities obtained by the fuzzy hash value similarity calculation module 310(1) and the vector similarity calculation module 310(2) are different, and the larger the web page similarity obtained by the fuzzy hash value similarity calculation module 310(1), the higher the web page similarity represents; the smaller the similarity of the web pages obtained in the vector similarity calculation module 310(2), the higher the similarity of the web pages.
In summary, with the aid of the technical solution of the present invention, by calculating the similarity of the features of the dynamically rendered web page, the defect that the similarity of the dynamically rendered web page cannot be statically calculated in the prior art is overcome, and the fast calculation of the similar dynamically rendered web page is realized. When a certain malicious web application is found, all the malicious web applications with similar page features can be found by searching application pages with similar code feature vectors.
The webpage similarity detection method and the webpage similarity detection system have the advantages that: compared with the prior art, the webpage similarity detection method comprises the steps of obtaining a CSS file and a JS file in a dynamic rendering page; performing lexical analysis on the CSS file and the JS file to obtain the similarity of the fuzzy hash value of the dynamic rendering page and other web pages; carrying out syntactic analysis on the CSS file and the JS file to construct an abstract syntax tree of the CSS file and an abstract syntax tree of the JS file; obtaining the similarity of the feature vectors of the dynamic rendering page and other web pages according to the abstract syntax tree of the CSS file and the abstract syntax tree of the JS file; and detecting the similarity of the dynamically rendered page and other web pages according to the fuzzy hash value similarity and the page feature vector similarity. According to the method and the device, the webpage JS and CSS styles are statically analyzed to obtain the characteristics of the dynamically rendered webpage and the similarity of the webpage is calculated based on the characteristics, so that the detection efficiency of the similar webpage can be greatly improved, and the defect that the similarity of the dynamically rendered webpage cannot be statically calculated in the prior art is overcome.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (9)

1. A webpage similarity detection method is characterized by comprising the following steps:
step 1: acquiring a CSS file and a JS file in a dynamic rendering page;
step 2: performing lexical analysis on the CSS file and the JS file to generate a CSS file token sequence and a JS file token sequence;
and step 3: obtaining the fuzzy hash value similarity of the dynamic rendering page and other web pages according to the CSS file token sequence and the JS file token sequence;
and 4, step 4: carrying out syntactic analysis on the CSS file and the JS file to construct an abstract syntax tree of the CSS file and an abstract syntax tree of the JS file;
and 5: obtaining the similarity of the feature vectors of the dynamic rendering page and other web pages according to the abstract syntax tree of the CSS file and the abstract syntax tree of the JS file;
step 6: and detecting the similarity of the dynamically rendered page and other web pages according to the fuzzy hash value similarity and the page feature vector similarity.
2. The method for detecting web page similarity according to claim 1, wherein the step 1: obtaining a CSS file and a JS file in a dynamically rendered page, comprising:
step 1.1: analyzing an HTML (hypertext markup language) label of the dynamic rendering page to obtain an original file with a suffix name of CSS (cascading style sheets) and an original file with a suffix name of JS;
step 1.2: acquiring the code lengths of the original file of the CSS and the original file of the JS and setting a length threshold value;
step 1.3: and filtering all the corresponding original files with the code lengths larger than the length threshold value in the CSS original file and the JS original file to obtain the CSS file and the JS file.
3. The method for detecting web page similarity according to claim 1, wherein the step 3: obtaining the fuzzy hash value similarity of the dynamic rendering page and other web pages according to the CSS file token sequence and the JS file token sequence, including:
step 3.1: splicing the type types in the CSS file token sequence into a first character string;
step 3.2: splicing the type in the JS file token sequence into a second character string;
step 3.3: performing fuzzy hash operation on the first character string and the second character string respectively to obtain a CSS file page hash value characteristic and a JS file page hash value characteristic;
step 3.4: and obtaining the fuzzy hash value similarity of the dynamically rendered page and other web pages according to the CSS file page hash value characteristics and the JS file page hash value characteristics.
4. The method for detecting webpage similarity according to claim 3, wherein the step 3.4: obtaining the fuzzy hash value similarity of the dynamically rendered page and other webpages according to the CSS file page hash value characteristic and the JS file page hash value characteristic, comprising:
step 3.4.1: calculating the similarity between the CSS file page hash value characteristics and corresponding CSS file page hash value characteristics of other webpages by using a fuzzy hash algorithm to generate a first similarity;
step 3.4.2: calculating the similarity of the JS file page hash value characteristic and the corresponding JS file page hash value characteristics of other webpages by using a fuzzy hash algorithm to generate a second similarity;
step 3.4.3: and taking the maximum value between the first similarity and the second similarity as the fuzzy hash value similarity of the dynamic rendering page and other web pages.
5. The method for detecting web page similarity according to claim 1, wherein the step 5: obtaining the similarity of the feature vectors of the dynamically rendered page and other web pages according to the abstract syntax tree of the CSS file and the abstract syntax tree of the JS file, including:
step 5.1: dividing all nodes in the abstract syntax tree of the CSS file and the abstract syntax tree of the JS file into different feature units;
step 5.2: obtaining a high-dimensional feature vector according to the feature unit;
step 5.3: determining the weight value of the feature unit in the abstract syntax tree according to a preset weight calculation rule;
step 5.4: reducing the dimension of the high-dimensional feature vector according to the weight value to obtain a code feature vector of JS and a code feature vector of CSS in the dynamic rendering page;
step 5.5: and obtaining the similarity of the page feature vectors of the dynamic rendering page and other web pages according to the code feature vector of the JS and the code feature vector of the CSS.
6. The method for detecting webpage similarity according to claim 5, wherein the step 5.3: determining the weight value of the feature unit in the abstract syntax tree according to a preset weight calculation rule, wherein the weight value comprises the following steps:
and determining the weight value of each feature unit in the corresponding abstract syntax tree according to the rule that the weight value decreases with the depth of the feature unit in the corresponding abstract syntax tree.
7. The method for detecting webpage similarity according to claim 5, wherein the step 5.5: obtaining the similarity of the page feature vectors of the dynamic rendering page and other web pages according to the code feature vector of the JS and the code feature vector of the CSS, and the similarity comprises the following steps:
step 5.5.1: calculating a first distance between the code characteristic vector of the JS and the code characteristic vectors of the corresponding JS of other webpage;
step 5.5.2: calculating a second distance between the code feature vector of the CSS and the code feature vector of the corresponding CSS of the other web page;
step 5.5.3: and taking the minimum value between the first distance and the second distance as the page feature vector similarity of the dynamic rendering page and other web pages.
8. The method for detecting web page similarity according to claim 1, wherein the step 6: detecting the similarity of the dynamically rendered page and other web pages according to the fuzzy hash value similarity and the page feature vector similarity, including:
and performing descending arrangement on the fuzzy hash value similarity, and performing ascending arrangement on the page feature vector similarity to detect the similarity between the dynamically rendered page and other webpages.
9. A web page similarity detection system, comprising:
the CSS file and JS file acquisition module is used for acquiring a CSS file and a JS file in the dynamic rendering page;
the token sequence generating module is used for performing lexical analysis on the CSS file and the JS file to generate a CSS file token sequence and a JS file token sequence;
the fuzzy hash value similarity calculation module is used for obtaining the fuzzy hash value similarity of the dynamic rendering page and other webpages according to the CSS file token sequence and the JS file token sequence;
the abstract syntax tree extraction module is used for carrying out syntax analysis on the CSS file and the JS file to construct an abstract syntax tree of the CSS file and an abstract syntax tree of the JS file;
the page feature vector similarity calculation module is used for obtaining the page feature vector similarity of the dynamic rendering page and other web pages according to the abstract syntax tree of the CSS file and the abstract syntax tree of the JS file;
and the similarity detection module is used for detecting the similarity of the dynamically rendered page and other web pages according to the fuzzy hash value similarity and the page feature vector similarity.
CN202110891633.6A 2021-08-04 Webpage similarity detection method and system Active CN113609246B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110891633.6A CN113609246B (en) 2021-08-04 Webpage similarity detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110891633.6A CN113609246B (en) 2021-08-04 Webpage similarity detection method and system

Publications (2)

Publication Number Publication Date
CN113609246A true CN113609246A (en) 2021-11-05
CN113609246B CN113609246B (en) 2024-04-12

Family

ID=

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115687736A (en) * 2022-12-30 2023-02-03 北京长亭未来科技有限公司 Web application searching method and device and electronic equipment

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102811213A (en) * 2011-11-23 2012-12-05 北京安天电子设备有限公司 Fuzzy hashing algorithm-based malicious code detection system and method
CN103761483A (en) * 2014-01-27 2014-04-30 百度在线网络技术(北京)有限公司 Method and device for detecting malicious codes
CN107204960A (en) * 2016-03-16 2017-09-26 阿里巴巴集团控股有限公司 Web page identification method and device, server
CN109445834A (en) * 2018-10-30 2019-03-08 北京计算机技术及应用研究所 The quick comparative approach of program code similitude based on abstract syntax tree
CN110502897A (en) * 2018-05-16 2019-11-26 南京大学 A kind of identification of webpage malicious JavaScript code and antialiasing method based on hybrid analysis
CN110515838A (en) * 2019-07-31 2019-11-29 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Method and system for detecting software defects based on topic model
CN110795731A (en) * 2019-10-09 2020-02-14 新华三信息安全技术有限公司 Page detection method and device
CN111290784A (en) * 2020-01-21 2020-06-16 北京航空航天大学 Program source code similarity detection method suitable for large-scale samples
CN112507337A (en) * 2020-12-18 2021-03-16 四川长虹电器股份有限公司 Implementation method of malicious JavaScript code detection model based on semantic analysis
CN112596708A (en) * 2020-12-16 2021-04-02 平安普惠企业管理有限公司 Webpage generating method and device, computer equipment and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102811213A (en) * 2011-11-23 2012-12-05 北京安天电子设备有限公司 Fuzzy hashing algorithm-based malicious code detection system and method
CN103761483A (en) * 2014-01-27 2014-04-30 百度在线网络技术(北京)有限公司 Method and device for detecting malicious codes
CN107204960A (en) * 2016-03-16 2017-09-26 阿里巴巴集团控股有限公司 Web page identification method and device, server
CN110502897A (en) * 2018-05-16 2019-11-26 南京大学 A kind of identification of webpage malicious JavaScript code and antialiasing method based on hybrid analysis
CN109445834A (en) * 2018-10-30 2019-03-08 北京计算机技术及应用研究所 The quick comparative approach of program code similitude based on abstract syntax tree
CN110515838A (en) * 2019-07-31 2019-11-29 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Method and system for detecting software defects based on topic model
CN110795731A (en) * 2019-10-09 2020-02-14 新华三信息安全技术有限公司 Page detection method and device
CN111290784A (en) * 2020-01-21 2020-06-16 北京航空航天大学 Program source code similarity detection method suitable for large-scale samples
CN112596708A (en) * 2020-12-16 2021-04-02 平安普惠企业管理有限公司 Webpage generating method and device, computer equipment and storage medium
CN112507337A (en) * 2020-12-18 2021-03-16 四川长虹电器股份有限公司 Implementation method of malicious JavaScript code detection model based on semantic analysis

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115687736A (en) * 2022-12-30 2023-02-03 北京长亭未来科技有限公司 Web application searching method and device and electronic equipment

Similar Documents

Publication Publication Date Title
US10650087B2 (en) Systems and methods for content extraction from a mark-up language text accessible at an internet domain
US7941420B2 (en) Method for organizing structurally similar web pages from a web site
CN107590219A (en) Webpage personage subject correlation message extracting method
CN109344355B (en) Automatic regression detection and block matching self-adaption method and device for webpage change
CN111625659A (en) Knowledge graph processing method, device, server and storage medium
Gottron Evaluating content extraction on HTML documents
Evert A Lightweight and Efficient Tool for Cleaning Web Pages.
CN112580331A (en) Method and system for establishing knowledge graph of policy text
CN107145591B (en) Title-based webpage effective metadata content extraction method
CN106372232B (en) Information mining method and device based on artificial intelligence
Kosala et al. Information extraction from web documents based on local unranked tree automaton inference
CN114398138A (en) Interface generation method and device, computer equipment and storage medium
CN112667940A (en) Webpage text extraction method based on deep learning
CN111061975A (en) Method and device for processing irrelevant content in page
Sirsat et al. Pattern matching for extraction of core contents from news web pages
CN113609246B (en) Webpage similarity detection method and system
WO2023155303A1 (en) Webpage data extraction method and apparatus, computer device, and storage medium
CN113609246A (en) Webpage similarity detection method and system
Saravanan et al. Extraction of Core Web Content from Web Pages using Noise Elimination.
CN111339457A (en) Method and apparatus for extracting information from web page and storage medium
Munot et al. Conceptual framework for abstractive text summarization
YesuRaju et al. A language independent web data extraction using vision based page segmentation algorithm
CN114117242A (en) Data query method and device, computer equipment and storage medium
Rae et al. Main Content Detection in HTML Journal Articles
Hernández et al. Towards discovering conceptual models behind web sites

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Country or region after: China

Address after: 201100 floor 3, building 3, No. 2555, Hechuan Road, Minhang District, Shanghai

Applicant after: Qi'an Pangu (Shanghai) Information Technology Co.,Ltd.

Address before: 201103 3rd floor, building 3, 2555 Hechuan Road, Minhang District, Shanghai

Applicant before: SHANGHAI BENZHONG INFORMATION TECHNOLOGY Co.,Ltd.

Country or region before: China

GR01 Patent grant