CN111488953B

CN111488953B - Method for rapidly classifying webpage topics based on HTML source code characteristics

Info

Publication number: CN111488953B
Application number: CN202010597175.0A
Authority: CN
Inventors: 简小云; 朱雨佳; 杨哲; 王莉芳; 陈金辉
Original assignee: Insigma Hengtian Software Ltd
Current assignee: Insigma Hengtian Software Ltd
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2020-10-13
Anticipated expiration: 2040-06-28
Also published as: CN111488953A

Abstract

The invention discloses a method for rapidly classifying webpage topics based on HTML source code characteristics. And then training image data generated by the webpage source code through a deep learning model to obtain webpage layout characteristics contained in the image data, so as to achieve the purpose of quickly and accurately classifying massive webpages by using the webpage layout characteristics. The invention effectively utilizes the webpage layout information contained in the webpage source code and carries out automatic extraction and learning on the layout information, and the constructed classification model has strong robustness and high classification speed.

Description

Method for rapidly classifying webpage topics based on HTML source code characteristics

Technical Field

The technical field related by the invention is rapid classification of massive web pages, and particularly under the condition of not analyzing text semantics, the invention performs rapid multi-classification on web page topics based on HTML source code characteristics, thereby providing convenience for the next step of structuring and efficiently extracting web page information.

Background

With the explosive growth of internet information, it becomes increasingly important how machines can more efficiently extract information from large web page data. The first step of automatically and intelligently extracting webpage information is to quickly identify and classify the webpage; the information extraction modes of different types of web pages are different. The existing methods for web page classification mainly use three types of data of web pages: web page text content, web page layout characteristics, and web page query logs.

In the method for classifying web pages by using web page text contents, the web page contents are mostly expressed by using an N-dimensional vector, and the classification is performed by calculating the similarity between the web page vectors. The components of the webpage vector in each dimension represent the weight of the corresponding feature in the text, wherein the weighting method mainly used is TF-IDF. However, such methods are more affected by noisy text data and the time overhead for text parsing is also higher.

The method for classifying the web pages by using the web page layout characteristics is mainly characterized in that rules are mostly set to divide DOM tree nodes into different types, then the nodes containing web page core information are extracted to construct the layout classification characteristics of a target web page, and the method mainly extracts the statistical characteristics of characters, pictures and links contained in the core information nodes. However, a large amount of feature engineering needs to be performed on a specific webpage, so that at present, how to effectively extract statistical features of webpage layout is mainly studied, and the robustness of the constructed classification model is not verified.

The method for classifying the web pages by using the web page query log mostly classifies the web pages by using the relation between the query words of the web pages and the web pages, because the query words of the user reflect the query intention of the user, and the web pages clicked by the user are generally the result wanted by the user, the query words can reflect the content of the web pages from different angles. At present, most research on how to reduce the sparsity of user click data is carried out on the query log classification and the variation method thereof. However, the query log is not easy to obtain, and the user query log needs to be tracked through a special way, such as deploying a router, which relates to the user data privacy problem.

Disclosure of Invention

The invention aims to provide a method for quickly classifying webpage subjects based on HTML source code characteristics, aiming at the defects of the prior art.

The purpose of the invention is realized by the following technical scheme: the invention provides a method for rapidly classifying webpage subjects based on HTML source code characteristics. The invention has the following implementation steps:

(1) analyzing a webpage HTML source code to extract target information, wherein the extracted information comprises a label identification, a content length contained in a label, a link length contained in the label and a nesting hierarchical relationship to which the label belongs;

(2) generating an information matrix related to the target label according to the information extracted in the step (1), and converting the information matrix into image data containing webpage layout characteristics, wherein the row dimension of the matrix represents the selected label, and the column dimension comprises the number of layers of the label in the source code nesting hierarchical relationship, an index number corresponding to the label, the content length written by the label and the link length;

(3) the image data is learned and classified using a deep learning network.

Further, in the step (2), in the image data design, the image needs to display the ordering of the selected tag in the original HTML code, the hierarchical relationship in the nested structure of the code, and also needs to express the link length and the content length written in the selected tag, and the content length and the link length need to be displayed in a differentiated manner.

Further, in the step (2), the representing the information of each row in the information matrix by using a visual pixel point includes: filling with white pixel points, wherein the filling length is a depth value of a layer; then, filling the content with a gray value corresponding to the label index number, wherein the filling length is the content length value + 1; continuing to fill according to the gray value corresponding to the label index + n, wherein n is the number of the target labels, and the filling length is the link length; and finally, filling and aligning the white pixel points to the width of the image.

Further, the four tabulated data of the matrix in step (2) are as follows:

tag index List: representing the arrangement sequence of the target tags in the HTML codes, wherein the value range of each element in the list is 0 to n-1, and n is the number of the selected target tags;

list of content lengths written in the tag: representing the content length contained in a target tag in an HTML code, setting the minimum value of the content length written in the target tag to be 0, and setting the maximum value to be 150, wherein the minimum value of 0 represents that the target tag has no text;

list of link lengths written in tag: representing the link length contained in a target tag in an HTML code, setting the minimum value of the link length written in the target tag to be 0, and setting the maximum value to be 150, wherein the minimum value of 0 represents that no link exists in the target tag;

list of nested hierarchies of tags in HTML source code: representing the nesting hierarchical relationship to which the target label in the HTML code belongs, setting the value range of each element in the list as [0-6], wherein 0 represents a root node in the nesting hierarchical relationship of the label, and nesting more than 6 layers are set as 6.

Further, in the step (3), the webpage classification problem is set as webpage theme multi-classification, and a convolutional neural network is used for extracting webpage layout features from image data generated by a webpage source code.

Furthermore, softmax is selected as an activation function of the convolutional neural network, and log-likelihood is selected as a loss function of the classification model.

Further, the step (3) includes: mapping the source codes into data sets with the same picture width and different picture heights to serve as input of a classification model; performing feature extraction through a convolution kernel to obtain layout features of the webpage, performing pooling treatment to retain main features and reduce parameters at the same time to obtain a new feature group; performing dimension reduction and feature extraction on the input image feature group through convolution operation, and performing pooling treatment to further reduce the feature image dimension; and finally, obtaining the corresponding category of the image through the full connection layer by the characteristic image.

Furthermore, a pyramid pooling layer is added in the design of the classification model, fixed output is generated for any input, and then the fixed output is used as the input with the fixed size required by the full connection layer; the pyramid pooling layer divides an image into designated small blocks, extracts features of each block and fuses the extracted features, and finally the extracted feature graphs obtain classification results through a full connection layer.

The invention has the beneficial effects that: according to the method, the image data containing the webpage layout characteristics are obtained through automatic analysis of the webpage source codes, and the layout information of the webpage can be effectively reflected through the content length and the link length contained in the selected label, the hierarchical relationship of the selected label and the distance relationship between the selected labels. And then training image data generated by the webpage source code through a deep learning model to obtain webpage layout characteristics contained in the image data, so as to achieve the purpose of quickly and accurately classifying massive webpages by using the webpage layout characteristics. The invention effectively utilizes the webpage layout information contained in the webpage source code and carries out automatic extraction and learning on the layout information, and the constructed classification model has strong robustness and high classification speed.

Drawings

FIG. 1 is a schematic diagram of a matrix of web page layout information extracted from web page source code;

fig. 2 is a schematic diagram of a classification model structure based on a convolutional neural network.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

The invention provides a method for quickly and accurately classifying webpage topics based on HTML source code characteristics. The implementation of the invention mainly comprises the following three steps:

(1) parsing specifies HTML tags (e.g., < a >, < address >, < object >, < div >, < font >, < form >, < frame >, < h1>, < h2>, < h3>, < h4>, < h5>, < h6>, < head >, < hr >, < img >, < input >, < li >, < link >, < menu >, < nav >, < object >, < p >, < span >, < table >, < time >, < title >, < ul >, < video > etc.) for information extraction, where the extracted information includes tag identification, content length contained in the tag (including text length or picture size, screen layout size, etc.), link length contained in the tag, nesting relationship, etc.

(2) Performing data preprocessing on information extracted from a webpage source code to enable the information to become image data containing webpage layout characteristics, wherein row dimensions in a matrix represent a selected label; the column dimension includes the level (such as parent node, child node, grandchild node, etc.) of the tag in the source code nesting hierarchy, the index number corresponding to the tag, and the content length and link length written by the tag.

(3) The webpage classification problem is set as webpage theme multi-classification, and comprises a news website, an e-commerce website, a comment website (such as bean, Hokka, Baidu Bar and the like), a multimedia website, a blog website and the like. Then, extracting webpage layout features of image data generated by the webpage source code by using a convolutional neural network, wherein an activation function of the convolutional neural network is softmax, and the function is as follows:

wherein: k represents the number of classes classified; j =0,1, 2.,. K-1, z represents the value of the score for any real number of elements in the vector.

In the case that the activation function is determined to be softmax, selecting log-likelihood as a loss function of the classification model, wherein the function is as follows:

wherein:

representing the prediction probability corresponding to the category y.

Because the depth of the network is crucial to the performance of the model, after the number of network layers is increased, the network can extract more complex feature patterns, and a specific network (such as a residual network Resnet 34) can be selected to extract the webpage layout features contained in the webpage source code during specific implementation, so that the network can avoid the problems of network degradation and network difficulty in training in the training and learning process.

The following describes the implementation process and implementation effect of the present invention with an example: in this example, the web pages are classified based on their visual features, and first, 4 pieces of list data are extracted from the original HTML code:

1. and the tag index list represents the arrangement sequence of the target tags in the HTML codes, and the value range of each element in the list is [0-30] due to the fact that 31 target tags are selected in total.

2. And in order to filter out part of abnormal information and a few boundary values, setting the minimum value of the content length written in the target tag to be 0 and the maximum value to be 150 (empirical parameters), wherein the minimum value of 0 represents that the target tag has no text.

3. And a link length list written in the tag represents the link length contained in the target tag in the HTML code, and in order to filter out part of abnormal information and a few boundary values, the minimum value of the link length written in the target tag is set to be 0, and the maximum value is set to be 150 (empirical parameter), wherein the minimum value of 0 represents that no link exists in the target tag.

4. The nested hierarchical relationship list of the tags in the HTML source code represents the nested hierarchical relationship to which the target tags in the HTML code belong, and the value range of each element in the list data can be set to be 0-6 according to the statistical data of the nested relationship of the tags, wherein 0 represents a root node in the nested hierarchical relationship of the tags, and more than 6 layers of nesting are set to be 6.

The invention utilizes the 4 list data to generate gray level image data containing the webpage layout characteristics, and then uses a convolution neural network to extract the webpage layout visual characteristics of the image data. In the design of gray scale image data, the image needs to display the ordering of the selected tags in the original HTML code, as well as the hierarchical relationship in the nested structure of the code. In addition, because the number of pictures and the number of characters in the layout of the core structure of different types of web pages are obviously different, and the relationship between the characters and the pictures is obviously different, the link length and the content length written in the selected tag need to be expressed according to the web page layout characteristic gray-scale map generated by the HTML codes, and the content length and the link length need to be displayed in a differentiated manner.

As shown in fig. 1 by way of example, the specified information of the target tag is extracted from the HTML source code, and an information matrix about the target tag can be obtained, where a first column of the matrix represents an index number of the target tag, a second column represents a content length written by the target tag, a third column represents a link length written by the target tag, and a fourth column represents a nested hierarchical relationship of the target tag in the HTML source code. For example, if the extracted < p > tag (assuming index 7) has text of length 4 written therein, no link, and is at the second level in the code level block of HTML, then the element can be represented by the vector <7,4,0,1 >.

Tag information in an HTML source code is converted into a visual picture containing webpage layout information, information of each row in the matrix needs to be represented by a visual pixel point, and fig. 1 illustrates a conversion mode between two matrices. Firstly, filling with white pixel points, wherein the length is a depth value of a layer; then, filling the content with a gray value corresponding to the label index number, wherein the length is the content length value + 1; then, continuously filling according to the gray value corresponding to the label index +31, wherein the filling length is the link length; and finally, filling and aligning the white pixel points to the width of the image.

In fig. 1, the lower left is an image pixel distribution diagram generated from four list data, and the arrangement order of pixel values of each row of the matrix is a hierarchical relationship, a text length, and a link length in this order. When the method is explained according to pixel rows, the number of the leading 255 values represents the nesting level, and the root node has no leading 255 value. In the visualization result diagram on the right, shown as a leading white line,

e.g. lines

2,4, and 5 of the matrix, the text pixel block is preceded by a corresponding white line length, where the white line length represents the depth of the hierarchical relationship.

The next pixel represents the tag index and the extended length of this pixel represents the content length of the selected tag, e.g. row 6 in the matrix of fig. 1, the index of the aside tag is 0, followed by 255, with no extension and therefore a content length of 0. In the 5 th row of the matrix, the index of the a label is 2, the hierarchy depth is 1, no text exists, the link length is 4, and 4 pixels are filled with 2+31=33 gray levels. The null values in the subsequent matrix are filled continuously with 255, i.e. white pixels, as shown in the last row of the matrix of fig. 1.

The invention classifies images based on a convolutional neural network, firstly, a webpage source code is mapped into a data set with the same picture width and different picture heights by using a data processing mode shown in figure 1, and the data is used as the input of a classification model. Fig. 2 shows a structural schematic diagram of a classification model based on a convolutional neural network, which is to perform feature extraction by a convolutional kernel to obtain some layout features of a web page (for example, a right text is very wide, link lengths are aligned, a text is shown on the right lower side of a title, and the like), perform pooling processing to retain main features and reduce parameters to obtain a new feature group. And then, performing dimension reduction and feature extraction on the input image feature group through convolution operation, wherein the dimension of the feature image is still high at the moment, and pooling is needed. And finally, the characteristic image passes through a full connection network layer to obtain the category corresponding to the image.

The parameter of the convolution network in fig. 2 is mainly a convolution kernel, which is completely suitable for any input and can generate any output. The pyramid pooling layer divides an image into designated small blocks, and then extracts features from each block for fusion, so as to achieve the purpose of being compatible with features of multiple scales. And finally, obtaining a classification result by the extracted feature map through a full connection layer.

The foregoing is only a preferred embodiment of the present invention, and although the present invention has been disclosed in the preferred embodiments, it is not intended to limit the present invention. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims

1. A method for rapidly classifying webpage topics based on HTML source code characteristics is characterized by comprising the following steps:

(2) generating an information matrix related to the target label according to the information extracted in the step (1), and converting the information matrix into image data containing webpage layout characteristics, wherein the row dimension of the matrix represents the selected target label, and the column dimension comprises the number of layers of the label in the source code nesting hierarchical relationship, the index number corresponding to the label, the content length written by the label and the link length; the method for representing the information of each line in the information matrix by using the visual pixel points comprises the following steps: filling with white pixel points, wherein the filling length is the number of layers in the hierarchical relationship; then, filling the content with a gray value corresponding to the label index number, wherein the filling length is the content length value + 1; continuing to fill according to the gray value corresponding to the label index + n, wherein n is the number of the selected target labels, and the filling length is the link length; finally, filling and aligning white pixel points to the width of the image;

(3) the image data is learned and classified using a deep learning network.

2. The method for rapidly classifying webpage topics based on HTML source code characteristics as claimed in claim 1, wherein the four list data of the matrix in step (2) are as follows:

3. The method according to claim 1, wherein in step (3), the web page classification problem is set as multi-classification of web page topics, and a convolutional neural network is used to perform web page layout feature extraction on image data generated by web page source codes.

4. The method of claim 3, wherein the activation function of the convolutional neural network is softmax, and the loss function of the classification model is log-likelihood.

5. The method of claim 3, wherein the step (3) comprises: mapping the source codes into data sets with the same picture width and different picture heights to serve as input of a classification model; performing feature extraction through a convolution kernel to obtain layout features of the webpage, performing pooling treatment to retain main features and reduce parameters at the same time to obtain a new feature group; performing dimension reduction and feature extraction on the input image feature group through convolution operation, and performing pooling treatment to further reduce the feature image dimension; and finally, obtaining the corresponding category of the image through the full connection layer by the characteristic image.

6. The method for rapidly classifying webpage topics based on HTML source code characteristics as claimed in claim 5, wherein a pyramid pooling layer is added in the design of the classification model to generate a fixed output for any input, and then used as a fixed-size input required by the full link layer; and the pyramid pooling layer divides one image into designated small blocks, extracts features of each block for fusion, and finally extracts feature graphs to obtain classification results through the full-connection layer.