CN113806667B

CN113806667B - Method and system for supporting webpage classification

Info

Publication number: CN113806667B
Application number: CN202111129758.1A
Authority: CN
Inventors: 陈超凡; 王轶骏
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2023-10-03
Anticipated expiration: 2041-09-26
Also published as: CN113806667A

Abstract

The invention relates to a method and a system for supporting webpage classification, which acquire HTML files and JS files of a data set webpage; calculating a feature vector according to the DOM tree; calculating a feature vector according to the CFG of the JS; combining the corresponding feature vectors of the HTML file and the JS file to obtain a webpage feature vector; taking the obtained webpage feature vector as the input of the neural network for training; and obtaining the feature vector of the webpage to be tested by the same method, inputting the feature vector into the neural network, and obtaining the output classification. Compared with the prior art, the method has the advantages of improving the recognition accuracy of the dynamic loading web page, supporting large-scale web page classification detection, overcoming the defects of language difference and the like in web page classification based on content and the like.

Description

Method and system for supporting webpage classification

Technical Field

The invention relates to the technical field of internet communication, in particular to a method and a system for supporting webpage classification.

Background

In the development process of the Internet, web pages are all important participants, the W3C world wide web starts from shared information, the web pages start to appear, various phenomena of information stealing which are represented by copying web page source codes appear in the sharing process, vowels are buried for similar websites in similar modes, the second wave climax of the mobile Internet and the convenience of webpage large-scale generation are realized, and the number of web pages is explosively increased.

In classifying these Web pages, web pages are mainly processed. Some people use the text content of the webpage to identify the webpage, some people use the display content of the webpage image, picture and the like to identify the webpage, and some people use the screenshot of the webpage to identify the webpage. With the continuous development and replacement of web page development technologies, static web pages gradually decrease, and more websites adopt a dynamic loading web page mode.

By static web page, it is meant that the page data and DOM tree structures are stored directly in the HTML file. The dynamic loading webpage refers to a static webpage enhancement programming technology, and changes according to the dynamic adjustment of JS codes in the process of generating a webpage DOM tree and rendering the webpage, so that if a crawler is directly used for crawling the webpage source code, the real data condition cannot be obtained.

Different types of webpages often have different webpage structures, the webpage structures also have the characteristics of webpage category information, and the situations of localization, content filling and the like of webpages can be avoided, however, the dynamic loading page cannot be well identified through a simple HTML DOM tree calculation technology at present, and the identification accuracy is not high.

Disclosure of Invention

The present invention is directed to a method and system for supporting web page classification that overcomes the above-mentioned drawbacks of the prior art.

The aim of the invention can be achieved by the following technical scheme:

the first aspect of the present invention provides a method for supporting web page classification, the method comprising the following steps:

a resource acquisition step:

acquiring an HTML file and a JS file of a dynamic loading webpage;

and calculating the feature vector of the HTML file:

calculating a feature vector corresponding to the HTML file according to the DOM tree of the HTML;

the JS file feature vector calculation step:

acquiring a control flow graph containing attribute parameters in the JS file, converting basic blocks of the control flow graph into feature vectors, and calculating the feature vectors of the JS file based on the feature vectors of the basic blocks;

and calculating the webpage feature vector:

combining the obtained HTML file feature vector and the JS file feature vector to obtain the feature vector of the webpage;

training a neural network:

converting the labeled web page into a feature vector of the labeled web page according to the steps, and training the neural network by taking the feature vector as input;

neural network identification:

and converting the webpage to be detected into the feature vector of the webpage to be detected according to the steps, inputting the feature vector into a trained neural network, and obtaining a classification result.

Further, the specific steps of the resource obtaining step include:

and (3) crawling and analyzing: crawling the HTML file of the dynamic loading webpage, analyzing the HTML label of the dynamic loading webpage, and obtaining a file with a suffix of JS, the HTML file in the frame label and a file path thereof;

and (3) a classified downloading step: and identifying the obtained HTML file in the frame tag as an embedded HTML file, dividing the JS file into a source JS file and an embedded JS file according to whether a source domain name is input in a file path of the HTML file, and downloading the JS file.

Further, the specific steps of the HTML file feature vector calculation step include:

label identification step: identifying the HTML tags in the DOM tree and converting the HTML tags into tag units;

label unit combining step: combining the tag units with the same attribute to obtain a combined tag unit set;

vector calculation: and calculating the weight value of the corresponding label for the combined label unit set, and constructing the HTML file feature vector according to the weight value of each label.

Further, the specific steps of the JS file feature vector calculation step include:

file analysis: analyzing each function of the JS file to obtain a control flow graph;

and a function vector calculating step: according to the preset attribute parameter characteristics, converting a basic block of a control flow graph into a vector with a fixed length, and calculating a vector of a function;

vector calculation: and constructing a JS file feature vector for the vector of the function.

In another aspect, the present invention provides a system for supporting classification of web pages, the system comprising:

a resource acquisition module: acquiring an HTML file and a JS file of a dynamic loading webpage;

the HTML file feature vector calculation module: calculating a feature vector corresponding to the HTML file according to the DOM tree of the HTML;

the JS file feature vector calculation module: acquiring a control flow graph containing attribute parameters of the JS file, converting basic blocks of the control flow graph into feature vectors, and calculating the feature vectors of the JS file based on the feature vectors of the basic blocks;

the webpage feature vector calculation module: combining the feature vector corresponding to the HTML file with the feature vector of the JS file to obtain the feature vector of the dynamic loading webpage;

the neural network training module: converting the labeled web page into a vector in the above mode, and taking the vector as an input to train the neural network;

the neural network identification module: and converting the webpage to be detected into a vector based on the feature vector calculation module, inputting the trained neural network, and obtaining a classification result.

Compared with the prior art, the method and the system for supporting webpage classification at least have the following beneficial effects:

1) The method and the device have better adaptability according to the statistical information of the HTML tags as part of the characteristics of the webpage compared with the text-based technology. For example, the text-based classification technology with the word-in-band model of Chinese cannot process English web pages, is insensitive to text languages based on tag statistical information, can eliminate the influence caused by inconsistent text languages, and overcomes the defects of language difference and the like in the classification of the web pages based on the text.

2) The method takes JS file information as part of the characteristics of the webpage, and has better adaptability compared with the classification technology based on text, images and other contents. For example, in a dynamically loaded webpage, corresponding text and image cannot be obtained directly through the webpage source code, and normally, the dynamically loaded code is all in the JS file, and the JS file is characterized, so that the defect that the dynamically loaded webpage cannot be well classified in the prior art can be overcome.

3) The invention distinguishes the source file from the embedded file and considers the link information, thereby improving the classification accuracy. For example, for a video website, for avoiding conventional classification techniques and supervision, an embedded frame is used to embed an HTML file and a dedicated JS file, while if the video website belongs to a regular video website, the video resources are generally stored in a specific server, a new HTML file is not embedded, and the JS file is also universal. The difference between the two types of websites can be amplified by distinguishing the source file from the embedded file and introducing the link information, so that the classification accuracy can be improved, and the website classification can be finer grained.

4) The webpage structure and the JS file are statically analyzed, the dynamic loading webpage can be classified by using the neural network under the condition that the JS code is not executed, and large-scale webpage classification detection can be supported.

Drawings

FIG. 1 is a schematic diagram of a method for supporting web page classification in an embodiment;

fig. 2 is a schematic flowchart of an implementation of a method for supporting web page classification in an embodiment.

Detailed Description

The invention will now be described in detail with reference to the drawings and specific examples. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

Examples

The invention relates to a method for supporting webpage classification, which comprises the following steps:

a resource acquisition step: and acquiring an HTML file and a JS file of the dynamic loading webpage.

And calculating the feature vector of the HTML file: and calculating the feature vector corresponding to the HTML file according to the DOM tree of the HTML.

The JS file feature vector calculation step: and analyzing the JS file to obtain a control flow graph containing attribute parameters, converting basic blocks of the control flow graph into vectors, and calculating feature vectors of the JS file based on the basic block feature vectors.

And calculating the webpage feature vector: and combining the obtained HTML file feature vector and the JS file feature vector to obtain the feature vector of the webpage.

Training a neural network: the web pages that have been labeled are converted into vectors in the above manner and used as input to train the neural network.

Neural network identification: and converting the webpage to be detected into a vector according to the mode, and inputting the vector into the trained neural network to obtain a classification result.

Specifically, the specific content of the resource acquisition step is as follows:

and (3) crawling and analyzing: crawling the HTML file of the dynamic loading webpage, analyzing the HTML tag of the dynamic loading webpage, and obtaining a file with a suffix name JS, an HTML file in a frame tag and a file path of the HTML file;

and (3) a classified downloading step: and identifying the obtained HTML file as an embedded HTML file, and dividing the JS file into a source JS file and an embedded JS file according to whether a source domain name is input in a file path or not, and downloading the source JS file and the embedded JS file.

Specifically, the specific steps of the HTML file feature vector calculation step include:

label identification step: identifying the HTML tags in the DOM tree, and converting the HTML tags into tag attribute units according to a certain rule, namely converting the HTML tags into the tag attribute units according to the types of the node tags in the DOM tree and the node positions of the node tags;

label unit combining step: merging tag units containing the same attribute;

vector calculation: calculating weight values of corresponding tags of the combined tag unit sets according to a preset rule, and constructing HTML file feature vectors; specifically, initial weight values of all label types are set according to the importance degree of the label types in the webpage, the weight values of corresponding labels are calculated according to the principle that the weight values decrease along with the depth difference, and the HTML file feature vector is constructed.

Specifically, the step of calculating the JS file feature vector includes:

file analysis: and analyzing each function of the JS file to obtain a control flow graph. Specifically, each function of the source JS file and the embedded JS file is analyzed by using an open source toolkit, and a control flow graph is obtained; the operation processes of the source JS file and the embedded JS file are the same, firstly, the JS file contains information of a website, the use of the JS file can help webpage classification, secondly, the source JS file and the embedded JS file contained in different websites have different characteristics, and the characteristics carried by the JS file of the website can be clearly shown by dividing the source JS file and the embedded JS file, so that the webpage classification is better supported;

vector calculation: constructing JS file feature vectors for the function vectors according to a preset rule; further, the JS file feature vector is constructed according to certain calculations, such as addition or averaging.

Specifically, the step of calculating the webpage feature vector includes: the combination mode can be one or any multiple, for example, the source HTML file feature vector, the embedded HTML file feature vector, the source JS file feature vector and the embedded JS file feature vector can be spliced together in sequence.

Specifically, in the neural network training step, the labeled web page is labeled, and the labels can be one or more than one kind; the network structure of the neural network is not limited to one type, for example, a simple stack of fully connected layers that can employ a ReLu activation function.

The invention also provides a system for supporting webpage classification, which comprises:

a resource acquisition module: the method comprises the steps of acquiring an HTML file and a JS file of a dynamic loading webpage;

the JS file feature vector calculation module: analyzing the JS file to obtain a control flow graph containing attribute parameters, converting basic blocks of the control flow graph into vectors, and calculating feature vectors of the JS file based on the feature vectors of the basic blocks;

the webpage feature vector calculation module: combining the obtained HTML file feature vector and the JS file feature vector to obtain the feature vector of the webpage;

the neural network identification module: and converting the webpage to be detected into a vector according to the mode, and inputting the vector into the trained neural network to obtain a classification result.

The system for supporting webpage classification can be realized through the step flow of the method for supporting webpage classification. Those skilled in the art will understand the method of supporting web page classification as a preferred example of the system for supporting web page classification.

The preferred embodiments are further described below.

As shown in fig. 1, the method for supporting web page classification provided by the embodiment of the invention includes:

step 101: respectively calculating page part characteristics according to the page HTML file structure and JS codes;

step 102: performing webpage classification training according to the webpage characteristics, and detecting;

through the processing, the type of the dynamically loaded Web page can be identified, and when a certain unknown Web page is input, the category of the Web page can be quickly found through identification. As shown in fig. 2, the method specifically comprises the following processing steps:

step 201: acquiring an HTML file and a JS file; and crawling the dynamic loading page, and obtaining files with suffix names of HTML and JS by analyzing src attributes under iframe, script tags in the HTML, for example, obtaining corresponding JS files by analyzing HTML tags of < script src= "view/JS/jquery-3.1.0.js1.5" >/script ". And meanwhile, distinguishing a source HTML file, an embedded HTML file, a source JS file and an embedded JS file according to the labels and the file paths.

Step 202: calculating a feature vector according to the DOM tree of the webpage:

and dividing the node tag types and the node positions in the DOM tree of the HTML file into tag units.

The following is an example of a web page DOM tree, having two child nodes head and body under the root node html, and then containing multiple tags.

<html>

<head>

< title > title of web page

<head>

<body>

</div>

</body>

<html>

For example, with html being 0 layers, any script tag unit of the second layer may be represented as < script,2>, and the same first-occurring img tag may be represented as < img,3>.

After dividing the DOM tree into tag units, determining the conversion from the tag units to feature vectors, comprising the following steps:

first, the same label unit is counted, where the same label unit refers to a label unit having the same node label type and being located at the same level. For example, with html as layer 0, the feature units < script src= "/static/1.Js defer > and < script src="/static/2. Js defer > under the node body are both script tag types, and the tag units thereof may be labeled the same as the second layer. Similarly, the present embodiment merges tag units in units of < type, level, number > and can obtain the following representation of tag feature units:

<html,0,1>

<head,1,1>

<body,1,1>

<title,2,1>

<link,2,2>

<style,2,4>

<script,2,5>

<div,2,2>

<div,3,1>

<img,3,1>

<img,4,1>

since there is a fixed format for any web page

<html>

<head>

</head>

</html>

The information for the html, head, title, body tag is rejected without any distinguishability.

And secondly, constructing a feature vector representation method and calculating a weight value of the response tag.

The dimension of the feature vector is a fixed value set in advance according to the tag type, and here, an example may be 6 dimensions, and the information represented by each dimension is a weight value represented by the tag type, < link, style, script, div, img, video >.

And determining the weight value corresponding to each label type according to a preset rule. Specifically, the weight value represents the importance degree of the corresponding tag type in the web page, each occurrence of the tag type is assigned a weight value, the final weight value is the accumulation of the weight values of the feature unit occurrences, the weight values of the feature unit occurrences in the web page are determined by a predetermined rule, and the method comprises the following steps:

the weight value of the tag type is decreased along with the depth difference of the tag type in the DOM tree from the first occurrence, because on the web page DOM tree, the probability that different tag types appear in the same layer is different, and if the absolute depth decrease method is adopted, the weight value of part of the tags is too low. In practical applications, the weight values of the feature cells may be determined in an equal-ratio decreasing manner, and only feature cells within a limited depth are considered,

for example, the weight value of the highest layer occurrence of each tag type is preset to 1.0, if the layer where the tag type is the highest layer where the tag type is the second occurrence is still 1, and if the relative deviation of the layer where the tag is the second occurrence from the highest layer is 2, the weight value is multiplied by the square of the attenuation factor (assuming that the attenuation factor is preset to 0.5).

Finally, calculating the feature vector of the HTML file

After the weight value of the tag feature is determined, the real value of the tag type in the dimension of the feature vector is determined according to the weight value of the tag type in the DOM tree, and the feature vector corresponding to the web page DOM tree is determined accordingly.

For example, the link, style, script, video labels in the above example all appear at the same layer, and by simple superposition, partial eigenvectors <2,4,5, div, img,0> can be calculated. For div, the second layer appears 2 times, the superposition weight is 1×1×2=2, the third layer appears 1 time, the superposition weight is 1×0.5×1=0.5, and the final weight is 2+0.5=2.5. A similar img final weight is 1.5. Thus, the final eigenvector <2,4,5,2.5,1.5,0> can be derived

Step 202: calculating feature vectors according to the CFG (Control Flow Graph ) of JS; the specific implementation is based on an open source toolkit, for example, the JS file is analyzed through an ast-flow-graph, and a control flow graph of all functions can be directly generated.

After the control flow graph is acquired, the feature vector of the JS file is determined, and the method specifically comprises the following processing procedures: the control flow graph of each function is composed of basic blocks and edges connecting the basic blocks. The characterization is performed on each control block, and the used features include statistical features such as the number of character string constants, the number of numerical constants, the number of calls, and the like, and structural features such as the number of subsequent basic blocks, and the like.

Here, the dimension of a single basic block vector may be 4 dimensions, and the information represented by each dimension is the weight value represented by each feature of the basic block, < number of string constants, number of numerical constants, number of calls, >, number of subsequent basic blocks >

The dimension of the feature vector generated by each basic block corresponds to the number of features selected. And (3) taking all basic block feature vectors of the function as feature vectors of the function through certain calculation, such as addition or averaging, and similarly, calculating the feature vectors of the function to obtain the feature vectors of the JS file.

And determining weight values corresponding to the features according to the reservation rules. In particular, the weight value represents the importance of the corresponding feature in the function.

Step 203: and combining the obtained feature vectors to obtain the webpage feature vectors. Specifically, the source HTML file feature vector, the embedded HTML file feature vector, the source JS file feature vector, and the embedded JS file feature vector are combined together, such as spliced, to be the final feature vector of the web page.

The simplest stitching method, i.e., web page final feature vector=source HTML file feature vector-embedded HTML file feature vector-source JS file feature vector-embedded JS file feature vector, may be selected here.

Step 204: the features are input into a neural network for training. The marked page data set is converted into feature vectors to be input into a neural network, and proper loss functions are selected for training.

Since the dimensions of the HTML document feature vector and the JS document feature vector are both determined by the subscription rules, the feature vector dimensions of the web page are fixed without regard to the problem of filling in vectors.

In practice, for example, two classifications are used, i.e. determining whether the web page belongs to the blog class, where a simple stack of fully connected layers of the Relu activation function may be used, outputting probabilities that arbitrary values are compressed between [0,1] using the sigmoid function and used as prediction results.

Step 205: and inputting the webpage to be tested into the neural network, and outputting the webpage to be tested into the corresponding category of the webpage.

The method and the device have better adaptability according to the statistical information of the HTML tags as part of the characteristics of the webpage compared with the text-based technology. For example, the text-based classification technology with the word-in-band model of Chinese cannot process English web pages, is insensitive to text languages based on tag statistical information, can eliminate the influence caused by inconsistent text languages, and overcomes the defects of language difference and the like in the classification of the web pages based on the text. According to JS file information as part of the characteristics of the webpage, compared with classification technology based on text, images and other contents, the method has better adaptability. For example, in a dynamically loaded webpage, corresponding text and image cannot be obtained directly through the webpage source code, and normally, the dynamically loaded code is all in the JS file, and the JS file is characterized, so that the defect that the dynamically loaded webpage cannot be well classified in the prior art can be overcome.

The invention distinguishes the source file from the embedded file and considers the link information, thereby improving the classification accuracy. For example, for video websites, if the video website belongs to pornography, an embedded frame is adopted to embed an HTML file and a dedicated JS file in order to avoid conventional classification technology and supervision, while if the video website belongs to a regular video website, video resources of the video website are generally stored in a specific server, a new HTML file is not embedded, and the JS file is also universal. The difference between the two types of websites can be amplified by distinguishing the source file from the embedded file and introducing the link information, so that the classification accuracy can be improved, and the website classification can be finer grained.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions may be made without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A method for supporting web page classification, comprising:

a resource acquisition step:

acquiring an HTML file and a JS file of a dynamic loading webpage;

and calculating the feature vector of the HTML file:

the JS file feature vector calculation step:

and calculating the webpage feature vector:

training a neural network:

neural network identification:

2. The method for supporting web page classification as recited in claim 1, wherein the specific step of the resource obtaining step comprises:

3. The method for supporting web page classification as recited in claim 1, wherein the HTML file feature vector calculating step comprises the specific steps of:

4. The method for supporting web page classification according to claim 1, wherein the specific step of the JS file feature vector calculation step includes:

5. A system for supporting web page classification, comprising:

the neural network training module: converting the labeled web page into vectors according to the calculation mode of each feature vector calculation module, and taking the vectors as input to train the neural network;

the neural network identification module: and converting the webpage to be detected into a vector according to the mode, inputting the vector into the trained neural network, and obtaining a classification result.