CN113806667A

CN113806667A - Method and system for supporting webpage classification

Info

Publication number: CN113806667A
Application number: CN202111129758.1A
Authority: CN
Inventors: 陈超凡; 王轶骏
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2021-12-17
Anticipated expiration: 2041-09-26
Also published as: CN113806667B

Abstract

The invention relates to a method and a system for supporting webpage classification, wherein HTML (hypertext markup language) files and JS (Java script) files of data set webpages are obtained; calculating a feature vector according to the DOM tree; calculating a feature vector according to the CFG of the JS; combining the corresponding feature vectors of the HTML file and the JS file to obtain a webpage feature vector; taking the obtained webpage feature vector as the input of a neural network for training; and obtaining the characteristic vector of the webpage to be detected by the same method, inputting the characteristic vector into a neural network, and obtaining the output classification. Compared with the prior art, the method has the advantages of improving the identification accuracy of the dynamically loaded webpage, supporting large-scale webpage classification detection, overcoming the defects of language difference and the like in the webpage classification based on the content, and the like.

Description

Method and system for supporting webpage classification

Technical Field

The invention relates to the technical field of internet communication, in particular to a method and a system for supporting webpage classification.

Background

In the development process of the internet, web pages are always important participants, the web pages begin to appear from the W3C world wide web starting from shared information, in the sharing process, various phenomena of stealing information represented by copying web page source codes occur, similar patterns are buried in similar websites, and nowadays, the second wave of the mobile internet and the convenience of web page scale generation are brought to the present, and the number of web pages is increased explosively.

In the process of classifying these Web pages, the Web pages are mainly processed. Some people use the text content of the web page to identify the web page, some people identify the web page through the display content of the web page image, the picture and the like, and some people identify the web page through the screenshot of the web page. With the continuous development and replacement of web page development technology, static web pages are gradually reduced, and more websites adopt a dynamic web page loading mode.

By static web page, it is meant that the page data and DOM tree structure are stored directly in the HTML file. The dynamic webpage loading refers to a static webpage enhancement programming technology, and changes due to the fact that dynamic adjustment is conducted according to the JS codes in the webpage DOM tree generation and webpage rendering processes, so that if a webpage source code is crawled directly through a crawler, real data cannot be obtained.

Different types of webpages often have different webpage structures, the webpage structures also have the characteristics of webpage category information, conditions such as webpage localization, content filling and the like can be avoided, however, dynamic loading pages cannot be well identified through a simple HTML DOM tree calculation technology at present, and identification accuracy is not high.

Disclosure of Invention

The present invention is directed to a method and system for supporting web page classification, which overcome the above-mentioned shortcomings of the prior art.

The purpose of the invention can be realized by the following technical scheme:

the first aspect of the present invention provides a method for supporting webpage classification, which comprises the following steps:

a resource obtaining step:

acquiring an HTML file and a JS file of a dynamically loaded webpage;

HTML file feature vector calculation step:

calculating a feature vector corresponding to the HTML file according to the DOM tree of the HTML;

calculating a JS file feature vector:

acquiring a control flow graph containing attribute parameters in the JS file, converting basic blocks of the control flow graph into eigenvectors, and calculating the eigenvectors of the JS file based on the eigenvectors of the basic blocks;

calculating a webpage feature vector:

combining the obtained HTML file feature vector and the JS file feature vector to obtain a feature vector of the webpage;

training a neural network:

converting the labeled web pages into feature vectors of the labeled web pages according to the steps, and taking the feature vectors as input to train the neural network;

a neural network identification step:

and converting the web page to be detected into the feature vector of the web page to be detected according to the steps, inputting the feature vector into the trained neural network, and acquiring a classification result.

Further, the specific steps of the resource obtaining step include:

crawling and analyzing: the HTML file of the dynamic loading webpage is crawled, the HTML label of the dynamic loading webpage is analyzed, and the file with the JS suffix, the HTML file in the frame label and the file path of the HTML file are obtained;

and (3) classification downloading: and determining the acquired HTML file in the frame tag as an embedded HTML file, dividing the JS file into a source JS file and an embedded JS file according to whether the file path of the embedded HTML file is input into the source domain name or not, and downloading the JS file.

Further, the specific steps of the HTML document feature vector calculation step include:

and (3) label identification: identifying HTML tags in the DOM tree and converting the HTML tags into tag units;

and (3) merging the label units: merging the label units with the same attribute to obtain a merged label unit set;

vector calculation: and calculating the weight values of the corresponding tags for the combined tag unit sets, and constructing the feature vector of the HTML file according to the weight values of the tags.

Further, the specific steps of the JS file feature vector calculation step include:

file analysis: analyzing each function of the JS file to obtain a control flow graph;

and a function vector calculation step: converting the basic blocks of the control flow graph into vectors with fixed length according to the preset attribute parameter characteristics, and calculating the vectors of the functions;

vector calculation: and constructing a JS file feature vector for the vector of the function.

Another aspect of the present invention provides a system for supporting webpage classification, including:

a resource acquisition module: acquiring an HTML file and a JS file of a dynamically loaded webpage;

HTML file feature vector calculation module: calculating a feature vector corresponding to the HTML file according to the DOM tree of the HTML;

JS file feature vector calculation module: acquiring a control flow graph containing attribute parameters of the JS file, converting basic blocks of the control flow graph into eigenvectors, and calculating the eigenvectors of the JS file based on the eigenvectors of the basic blocks;

the webpage feature vector calculation module: combining the feature vector corresponding to the HTML file and the feature vector of the JS file to obtain the feature vector of the dynamic loading webpage;

a neural network training module: converting the labeled webpage into a vector according to the above mode, and taking the vector as an input training neural network;

the neural network identification module: and converting the web page to be detected into a vector based on the characteristic vector calculation module, inputting the trained neural network, and acquiring a classification result.

Compared with the prior art, the method and the system for supporting webpage classification provided by the invention at least have the following beneficial effects:

1) compared with the text-based technology, the method has better adaptability by taking the statistical information of the HTML label as part of the characteristics of the webpage. For example, the text-based classification technology with the Chinese word band model cannot process English web pages, is insensitive to text languages based on tag statistical information, can eliminate the influence caused by inconsistent text languages, and overcomes the defects of language difference and the like in text-based web page classification.

2) According to the method, the JS file information is used as part of characteristics of the webpage, and compared with a classification technology based on the contents such as texts and images, the method has better adaptability. For example, in a dynamic loading webpage, corresponding texts and images cannot be obtained directly through a webpage source code, and usually, dynamically loaded codes are in a JS file, so that the JS file is characterized, and the defect that the dynamically loaded webpage cannot be well classified in the prior art can be overcome.

3) The invention distinguishes the source file and the embedded file, takes the link information into consideration and can improve the classification accuracy. For example, for a video website, for a certain type of web pages, in order to avoid conventional classification techniques and supervision, an embedded frame is used to embed an HTML file and an exclusive JS file, whereas if the video website belongs to a regular video website, video resources are generally stored in a specific server, a new HTML file is not embedded, and the JS file is also general. The difference between the two types of websites can be amplified by distinguishing the source files and the embedded files and introducing the link information, so that the classification accuracy can be improved, and the website classification can be finer in granularity.

4) According to the invention, the webpage structure and the JS file are statically analyzed, the dynamically loaded webpage can be classified by using the neural network under the condition that the JS code is not executed, and the large-scale webpage classification detection can be supported.

Drawings

FIG. 1 is a schematic diagram illustrating a method for supporting webpage classification according to an embodiment;

fig. 2 is a schematic flow chart illustrating an implementation of the method for supporting webpage classification in the embodiment.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

Examples

The invention relates to a method for supporting webpage classification, which comprises the following steps:

a resource obtaining step: and acquiring the HTML file and the JS file of the dynamic loading webpage.

HTML file feature vector calculation step: and calculating a feature vector corresponding to the HTML file according to the DOM tree of the HTML.

Calculating a JS file feature vector: and analyzing the JS file to obtain a control flow graph containing attribute parameters, converting basic blocks of the control flow graph into vectors, and calculating the feature vector of the JS file based on the feature vectors of the basic blocks.

Calculating a webpage feature vector: and combining the obtained HTML file feature vector and the JS file feature vector to obtain the feature vector of the webpage.

Training a neural network: and converting the labeled webpage into a vector according to the mode, and taking the vector as an input to train the neural network.

A neural network identification step: and converting the web page to be detected into a vector according to the above mode, and inputting the vector into the trained neural network to obtain a classification result.

Specifically, the specific content of the resource obtaining step is as follows:

crawling and analyzing: crawling an HTML (hypertext markup language) file of the dynamic loading webpage, analyzing an HTML tag of the dynamic loading webpage, and acquiring a file with a JS suffix, a HTML file in a frame tag and a file path of the HTML file;

and (3) classification downloading: and determining the acquired HTML file as an embedded HTML file, dividing the JS file into a source JS file and an embedded JS file according to whether the file path inputs the source domain name, and downloading.

Specifically, the specific steps of the HTML document feature vector calculation step include:

and (3) label identification: identifying HTML tags in the DOM tree, and converting the HTML tags into tag attribute units according to a certain rule, namely converting the HTML tags into the tag attribute units according to the types and the positions of the nodes in the DOM tree;

and (3) merging the label units: merging the label units with the same attribute;

vector calculation: calculating the weight values of the corresponding labels of the combined label unit sets according to a preset rule, and constructing HTML file feature vectors; specifically, the initial weight value of each tag type is set according to the importance degree of the tag type in the webpage, the weight value of the corresponding tag is calculated according to the principle that the weight value decreases with the depth difference, and the HTML file feature vector is constructed.

Specifically, the JS file feature vector calculation step includes:

file analysis: and analyzing each function of the JS file to obtain a control flow graph. Specifically, analyzing each function of the source JS file and the embedded JS file by using an open source toolkit to obtain a control flow graph; the operation processes of the source JS file and the embedded JS file are the same, firstly, the JS file contains information of a website, and webpage classification can be assisted by using the JS file, secondly, the source JS file and the embedded JS file contained in different websites have different characteristics, and the characteristics carried by the JS file of the website can be more clearly expressed by dividing the source JS file and the embedded JS file, so that webpage classification is better supported;

vector calculation: constructing a JS file feature vector for the function vector according to a preset rule; further, the feature vector of the JS file is constructed according to certain calculation, such as addition or averaging.

Specifically, the webpage feature vector calculating step includes: the combination may be one or any of a plurality of ways, for example, splicing the source HTML file feature vector, the embedded HTML file feature vector, the source JS file feature vector, and the embedded JS file feature vector together in order may be adopted.

Specifically, in the neural network training step, the web pages with labels are marked, and the labels can be one or any plurality of labels; the network structure of the neural network is not limited to one, and for example, a simple stack of fully connected layers of the ReLu activation function may be employed.

The invention also provides a system for supporting webpage classification, which comprises:

a resource acquisition module: the system comprises an HTML file and a JS file, wherein the HTML file and the JS file are used for acquiring a dynamic loading webpage;

JS file feature vector calculation module: analyzing the JS file to obtain a control flow graph containing attribute parameters, converting basic blocks of the control flow graph into vectors, and calculating the feature vector of the JS file based on the basic block feature vector;

the webpage feature vector calculation module: combining the obtained HTML file feature vector and the JS file feature vector to obtain a feature vector of the webpage;

the neural network identification module: and converting the web page to be detected into a vector according to the above mode, and inputting the vector into the trained neural network to obtain a classification result.

The system for supporting webpage classification provided by the invention can be realized by the step flow of the method for supporting webpage classification. The method for supporting the classification of web pages can be understood as a preferred example of the system for supporting the classification of web pages by those skilled in the art.

Preferred examples are further described below.

As shown in fig. 1, the method for supporting webpage classification according to the embodiment of the present invention includes:

step 101: respectively calculating partial page features according to the HTML file structure and the JS code of the page;

step 102: performing webpage classification training according to the page features, and detecting;

the type of the dynamic loading webpage can be identified through the processing, and when a certain unknown Web page is input, the category of the Web page can be quickly found through identification. As shown in fig. 2, the method specifically includes the following processing steps:

step 201: acquiring an HTML file and a JS file; crawling the dynamic loading page, and acquiring files with suffix names of HTML and JS by analyzing src attributes under iframe and script tags in the HTML, for example, acquiring a corresponding JS file by analyzing an HTML tag of < script src ═ view/JS/jquery-3.1.0.JS1.5"> < script >. And simultaneously distinguishing a source HTML file, an embedded HTML file, a source JS file and an embedded JS file according to the label and the file path.

Step 202: calculating a feature vector according to the DOM tree of the webpage:

and dividing the node tag types and the node positions thereof in the DOM tree of the HTML file into tag units.

The following is an example of a web page DOM tree, which has two child nodes head and body under the root node html, and then contains multiple tags.

<html>

<head>

< title > Web Page title </title >

<head>

<body>

</div>

</body>

<html>

For example, if html is 0, any script tag unit in the second layer can be expressed as < script,2>, and the same img tag appearing for the first time can be expressed as < img,3 >.

After dividing the DOM tree into tag units, determining the conversion from the tag units to the feature vectors, specifically comprising the following steps:

first, the same tag unit is counted, where the same tag unit refers to a tag unit with the same node tag type and located at the same level. For example, in the html level 0, the feature units < script src ═ static/1.js defer > and < script src ═ static/2.js defer > under the node body are script tag types, and belong to the second level, and their tag units may be labeled the same. Similarly, in the embodiment, merging tag units in units of < type, hierarchy, number >, the following representation of tag feature units can be obtained:

<html,0,1>

<head,1,1>

<body,1,1>

<title,2,1>

<link,2,2>

<style,2,4>

<script,2,5>

<div,2,2>

<div,3,1>

<img,3,1>

<img,4,1>

since there is a fixed format for any web page

<html>

<head>

</head>

</html>

And eliminating the information of the html, head, title and body labels without any distinction.

Secondly, a feature vector representation method is constructed and the weight value of the response label is calculated.

The dimension of the feature vector is a fixed value preset according to the tag type, and here, the dimension may be 6 dimensions, and the information represented by each dimension is a weight value, < link, style, script, div, img, video > represented by the tag type.

And determining the weight value corresponding to each label type according to a preset rule. Specifically, the weight value represents the importance degree of the corresponding tag type on the web page, each occurrence of the tag type is assigned with a weight value, the final weight value is the accumulation of the weight values of the feature units, and the weight value of the feature unit appearing on the web page is determined by a predetermined rule, which includes:

the weighted value of the tag type is decreased with the depth difference of the tag type in the DOM tree from the first appearance, because the probability that different tag types appear on the same layer in the webpage DOM tree is different, if the absolute depth decreasing method is adopted, the weighted value of part of the tags is too low. In practical application, the weight values of the feature units can be determined in an equal-ratio decreasing mode, only the feature units in a limited depth are considered,

for example, the weight value of the highest layer of each tag type is preset to 1.0, if the layer where the tag type appears for the second time is also the highest layer, the weight value is still 1, and if the relative deviation of the layer where the tag appears for the second time and the highest layer is 2, the weight value is multiplied by the square of the attenuation factor (assuming that the attenuation factor is preset to 0.5).

Finally, calculating the HTML file feature vector

After the weight value of the tag feature is determined, the real numerical value of the tag type on the dimension of the feature vector is determined according to the weight value of the tag type in the DOM tree, and the feature vector corresponding to the webpage DOM tree is determined accordingly.

For example, the link, style, script, and video tags in the above example all appear in the same layer, and by simple superposition, the partial feature vector <2,4,5, div, img,0> can be calculated. For div, 2 occurrences occur in the second tier, the overlay weight is 1 x 2 — 2, and 1 occurrence in the third tier, the overlay weight is 1 x 0.5 x 1 — 0.5, and the final weight is 2+0.5 — 2.5. A similar final weight of img is 1.5. The final feature vector <2,4,5,2.5,1.5,0>

Step 202: calculating a feature vector according to the CFG (Control Flow Graph) of the JS; the specific implementation is based on an open source toolkit, for example, a JS file is analyzed through an ast-flow-graph, and a control flow graph of all functions can be directly generated.

After the control flow graph is obtained, determining a feature vector of the JS file, and specifically comprising the following processing procedures: the control flow graph of each function is composed of basic blocks and edges connecting the basic blocks. And performing characterization on each control block, wherein the used characteristics comprise statistical characteristics such as the constant number of character strings, the constant number of numerical values, the calling number and the like, and structural characteristics such as the number of subsequent basic blocks and the like.

Here, the dimension of a single basic block vector may be 4 dimensions, and the information represented by each dimension is the weight value represented by each feature of the basic block, < number of constants of character string, number of constants of numerical value, number of calls, number of subsequent basic blocks >

The dimensions of the feature vector generated by each basic block correspond to the number of features selected. And (4) taking all the basic block feature vectors of the function as the feature vectors of the function through certain calculation, such as addition or averaging, and calculating the feature vectors of the function to obtain the feature vectors of the JS file similarly to the above steps.

And determining the weight value corresponding to each feature according to a booking rule. In particular, the weight value represents the degree of importance of the corresponding feature in the function.

Step 203: and combining the obtained feature vectors to obtain the webpage feature vector. Specifically, the source HTML file feature vector, the embedded HTML file feature vector, the source JS file feature vector, and the embedded JS file feature vector are combined together, such as spliced, to serve as the final feature vector of the web page.

The simplest splicing method can be selected here, namely, the final feature vector of the webpage is source HTML file feature vector-embedded HTML file feature vector-source JS file feature vector-embedded JS file feature vector.

Step 204: and inputting the characteristics into a neural network for training. And converting the marked webpage data set into a characteristic vector to be input into a neural network, and selecting a proper loss function for training.

Because the dimensions of the HTML file feature vector and the JS file feature vector are determined by the booking rules, the dimension of the feature vector of the webpage is fixed without considering the problem of filling the vector.

In practical cases, for example, binary classification is used, i.e. to determine whether the web page belongs to the blog class, where a simple stack of fully connected layers of the Relu activation function can be used, outputting the probability that an arbitrary value is compressed between [0,1] using the sigmoid function and used as the prediction result.

Step 205: and inputting the web pages to be detected into the neural network, and outputting the web pages to be detected as the corresponding categories of the web pages.

Compared with the text-based technology, the method has better adaptability by taking the statistical information of the HTML label as part of the characteristics of the webpage. For example, the text-based classification technology with the Chinese word band model cannot process English web pages, is insensitive to text languages based on tag statistical information, can eliminate the influence caused by inconsistent text languages, and overcomes the defects of language difference and the like in text-based web page classification. According to the fact that the JS file information is used as part of characteristics of the webpage, compared with a classification technology based on the contents such as texts and images, the method has good adaptability. For example, in a dynamic loading webpage, corresponding texts and images cannot be obtained directly through a webpage source code, and usually, dynamically loaded codes are in a JS file, so that the JS file is characterized, and the defect that the dynamically loaded webpage cannot be well classified in the prior art can be overcome.

The invention distinguishes the source file and the embedded file, takes the link information into consideration and can improve the classification accuracy. For example, for video websites, if they belong to the pornographic category, in order to avoid the conventional classification technology and supervision, embedded frames are used to embed HTML files and proprietary JS files, while for regular video websites, video resources are generally stored in a specific server, no new HTML file is embedded, and the JS files are also common. The difference between the two types of websites can be amplified by distinguishing the source files and the embedded files and introducing the link information, so that the classification accuracy can be improved, and the website classification can be finer in granularity.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and those skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for supporting classification of web pages, comprising:

a resource obtaining step:

acquiring an HTML file and a JS file of a dynamically loaded webpage;

HTML file feature vector calculation step:

calculating a JS file feature vector:

calculating a webpage feature vector:

training a neural network:

a neural network identification step:

2. The method for supporting webpage classification according to claim 1, wherein the specific steps of the resource obtaining step include:

3. The method for supporting webpage classification according to claim 1, wherein the specific steps of the HTML document feature vector calculating step include:

4. The method for supporting webpage classification according to claim 1, wherein the JS file feature vector calculation step specifically comprises:

5. A system for supporting classification of web pages, comprising:

a neural network training module: converting the labeled web pages into vectors according to the calculation mode of each feature vector calculation module, and using the vectors as input training neural networks;

the neural network identification module: and converting the web page to be detected into a vector according to the above mode, inputting the trained neural network, and acquiring a classification result.