CN112287274A

CN112287274A - Method, system and storage medium for classifying website list pages

Info

Publication number: CN112287274A
Application number: CN202011162449.XA
Authority: CN
Inventors: 孟剑; 樊晓然; 郭岩; 贺广福; 陈银鹏; 史存会; 俞晓明; 刘悦; 程学旗
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2020-10-27
Filing date: 2020-10-27
Publication date: 2021-01-29
Anticipated expiration: 2040-10-27
Also published as: CN112287274B

Abstract

The invention relates to a method for classifying web site list pages, which comprises the following steps: step 100, acquiring a group of website page sets, wherein the website page sets belong to the same website; step 200, extracting webpage data characteristics aiming at each website page respectively; step 300, creating a global topological structure of the website through a hash table formed by a hyperlink list of the website page and a matching relation between a link address (URL) of the website page and a node number; step 400, inputting the web page data characteristics and the global topological structure of the website into a graph convolution neural network for training the graph convolution neural network to obtain a website list page classifier; step 500, acquiring the website pages to be classified, inputting the webpage data features of the website pages to be classified and the global topology of the website into a website list page classifier according to the webpage data features of the website pages to be classified and the global topology of the website, and judging whether the website pages to be classified are website list pages.

Description

Method, system and storage medium for classifying website list pages

Technical Field

The invention relates to the technical field of webpage classification, in particular to a Board page classification method and system based on network structure characteristics.

Background

With the gradual development of the internet in recent years, networks have become the largest data source. People have long been concerned with the task of collecting data on the internet. One common acquisition mode is customized acquisition, i.e., customized development is performed on a certain or a certain specific website, website link conditions are analyzed, and then a data extraction method is constructed according to the page and network characteristics of the website.

Data in the internet can be divided into different information sources such as news, forums, blogs and the like according to the release and interaction forms of the data, each information source has a specific format, such as a news data source, the data comprises data such as news text, news authors, news titles, news comments and the like, and each news page has a category to which the data belongs. The forum is also divided into blocks, and the data of the forum includes the main posts of the forum, the reply posts of the forum and the like. The customized development of collectors for each information source, and even each website, necessarily results in collectors that cannot be reused. This is a waste to develop. The research on a large number of websites with multiple information sources shows that although the network data structures of different information sources have different forms, the network data structures have certain common characteristics. For example, a website in a news information source, whether classified according to content or a website top page, has a page similar to a list, and the page directly and explicitly lists related news article links according to a certain rule, and depending on the number of all articles under the relevant rule, the page also has related page turning links, which can help to obtain more articles. Similarly, there may be a similar structure for a website in a blog information source, often more prominently a personal home page, or personal timeline. Similar structures exist for web sites in forum information sources.

For this structure, it can be generalized to a Board-Article structure, where the list page is called the Board page and the real data page to be collected is called the Article page. Board pages are usually theme-dependent, i.e. all the Article page links on a Board page tend to surround a uniform theme or have uniform strong features. The characteristic of the Board page ensures that data under the requirement theme can be captured through one Board page, so that the collection of redundant data is avoided. The Board page is used as a portal page, and the Article page has a tree structure instead of an open graph structure, so that the sensing of data change can be realized by scanning the Board page. By analyzing the Board page, the change of the data can be easily obtained, so that the data can be tracked more efficiently. Therefore, how to find the Board page from the website becomes a problem that the customized acquisition must solve.

The Board page is mainly found by the following methods:

(1) based on manual work: i.e. manually screening out the Board pages from the web site. Due to the significant diversity of web pages, the cost of manually screening Board pages is quite expensive when faced with large-scale web sites, especially large-scale web sites. Meanwhile, frequent edition modification of the website also increases instability of the Board page, and the Board page needs to be re-screened at a further manual cost.

(2) Based on the rule: the experience of manually screening the Board page is converted into rules, and a simulator discovers the Board page from a website based on the rules. Similarly, web pages have significant diversity, so that the rule-based method has the inherent defect of weak generalization capability, and the recall rate and accuracy of the Board page cannot be guaranteed.

Therefore, the existing Board page discovery method is mainly based on manual work and rules, mainly depends on intuitive cognition of people on the Board page, cannot fully utilize various features of the Board page, particularly some hidden regular features, so that the generalization capability of the method is weak, the recall rate and accuracy of the Board page cannot be further ensured, and the quality of data acquired in a customized mode can be influenced to a great extent.

Disclosure of Invention

In order to solve the above technical problems, the present invention aims to provide a method and a system for classifying a Board page based on network structure features, which better utilize various features of the page, especially the global structure features of a website, and better capture various implicit features of the Board page by using a graph convolution neural network model, and have better generalization capability.

Specifically, the invention discloses a method for classifying website list pages, which is characterized by comprising the following steps:

step 100, acquiring a group of website page sets, wherein the website page sets belong to the same website;

step 200, extracting webpage data characteristics aiming at each website page respectively, wherein the webpage data characteristics comprise webpage link address (URL) characteristics, Document Object Model (DOM) tree structure characteristics and webpage visual characteristics;

step 300, creating a global topology structure of the website through a hash table formed by a hyperlink list of the website page and a matching relation between a link address (URL) of the website page and a node number;

step 400, inputting the web page data characteristics and the global topological structure of the website into a graph convolution neural network to train the graph convolution neural network to obtain a website list page classifier;

step 500, acquiring a website page to be classified, respectively acquiring the web page data feature of the website page to be classified and the global topology structure of the website according to the step 200 and the step 300, inputting the web page data feature of the website page to be classified and the global topology structure of the website into the website list page classifier acquired in the step 400, and judging whether the website page to be classified is a website list page.

According to the classification method, the global topological structure of the website is represented as an adjacent matrix A, and the adjacent matrix A is a sparse matrix; the web page data features are represented as a feature matrix X.

The classification method is characterized in that the graph convolution neural network is provided with a spectrogram convolution module for semi-monitoring classification of the website page, and the spectrogram convolution module comprises:

the first spectrogram convolution module comprises a first spectrogram convolution layer, a ReLu activation function and a Dropout mechanism; and

the second spectrogram convolution module comprises a second spectrogram convolution layer, a ReLu activation function and a Dropout mechanism;

wherein the first spectrogram convolution layer has a first parameter matrix W thereon⁽⁰⁾Mapping the feature representation of the website page to a corresponding hidden layer representation; the second spectrogram convolution layer has a second parameter matrix W thereon⁽¹⁾For mapping the hidden representation of the web page to a corresponding output.

The classification method according to the above, wherein the graph convolution neural network further comprises an output module connected to the spectrogram convolution module, and the output module is a softmax layer.

According to the classification method, the training formula of the graph convolution neural network is as follows:

wherein the content of the first and second substances,

is the normalized adjacency matrix.

According to the classification method, the first parameter matrix W is used in the training process of the graph convolution neural network⁽⁰⁾And said second parameter matrix W⁽¹⁾The parameters are updated by the gradient descent method, respectively.

To achieve another object of the present invention, the present invention further provides a system for classifying web site list pages, including:

the system comprises a webpage acquisition module, a webpage acquisition module and a webpage display module, wherein the webpage acquisition module is used for acquiring a group of website page sets, and the website page sets belong to the same website;

the feature extraction module is used for extracting webpage data features for each website page respectively, and creating a global topological structure of the website through a hash table formed by a hyperlink list of the website pages and a matching relation between a link address (URL) of the website pages and a node number;

and the webpage classification module is provided with a pre-trained graph convolution neural network classification model, and the graph convolution neural network classification model is used for judging whether the website webpage is a website list page or not according to the webpage data characteristics and the global topological structure of the website.

The classification system as described above, wherein the web page data features include web page link address (URL) features, tree structure features of a Document Object Model (DOM), and web page visual features.

The classification system according to the above, wherein the classification system further comprises: and the training module is used for training the graph convolution neural network.

The classification system according to the above, wherein the graph convolution neural network has a spectrogram convolution module thereon, which is used for semi-supervising the classification of the website pages, and the spectrogram convolution module includes:

The classification system according to the above, wherein the graph convolution neural network further comprises an output module connected to the spectrogram convolution module, and the output module is a softmax layer.

According to the classification system, the training formula of the graph convolution neural network is as follows:

wherein, X is a characteristic matrix used for representing the webpage data characteristics; a is an adjacency matrix used for representing the global topological structure of the website;

is the normalized adjacency matrix.

To achieve another object of the present invention, the present invention further provides a computer-readable storage medium, on which an implementation program for information transfer is stored, the program, when executed by a processor, implementing the steps of the classification method according to any one of the above.

In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

Drawings

FIG. 1 is a flowchart of a method for classifying web site list pages (Board pages) according to an embodiment of the present invention;

FIG. 2 is an architecture diagram of a convolutional neural network model;

fig. 3 is a block diagram of a classification system for web site list pages according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. It is to be understood that the embodiments described below are only a few embodiments of the present invention, and not all embodiments.

In addition, the descriptions related to "first", "second", etc. in the present invention are for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.

The invention is used for solving the problems that the existing Board page classification method in the prior art cannot fully utilize various features of the Board page and has weak generalization capability, the invention extracts page features from multiple angles, uses the page data features and the website global topological structure as the input of a graph convolution neural network, and can extract neighborhood information of the web page in the topological structure in the convolution process by a graph convolution neural network model, thereby providing a Board page discovery method and a system based on network structure features.

In the present invention, the following assumptions exist for the website, the page and their relationship to each other:

1. a web site is made up of pages that have one or more identities, i.e., URLs. Each URL uniquely corresponds to a page.

2. The page itself is composed of HTML, and the HTML contains information of nodes, node attributes, node contents (text), and node styles, wherein URLs of other pages may exist in the node attributes.

3. A page can be considered to point to a page by the URLs of other pages in the current page.

Based on the above assumptions, the website itself constitutes a directed graph, the website pages are nodes in the graph, each node has different characteristics, and the link relationship between the pages described in point 3 serves as an edge, and forms the graph together with the nodes.

In the invention, the characteristics of the Board page are abstracted, namely the definition of the Board page: the Board page is naturally owned in the website and can be regarded as a node in a network formed by the website; the Board page is an aggregate page of the Article page to which it points, so the nodes corresponding to the Board page have specific network structure characteristics in the network.

Since the web site itself is a graph, i.e. the relational data composed of web pages and their internal hyperlinks can be abstracted into a graph, it is reasonable to use this graph-based network method for finding Board pages from web sites. Specifically, the Board page is naturally owned by the website, and points to the Article page, so that the position information of the page in the network node formed by the website can be utilized; the Board page points to the Article page, so that the Article page has little "exclusive" out-of-range with respect to the Board page. Extracting page data features from multiple angles, wherein the page data features comprise characteristics such as URLs, DOM tree structures, HTML tags, webpage visual styles and topological structures; the massive webpage data contains some potential rules, and from the topological view, the pages in different sections and under the same section are connected in a hyperlink mode. The Board page corresponds to a pivot in the figure, and the connection information is more complex than that of a non-Board page. Therefore, it is particularly important for the Board page to construct the topology formed by the link information of the web page or construct the topology of the web site. Therefore, the invention creates the website global topology structure formed by the webpage link information in the website by using the hash table formed by the hyperlink list of the webpage data in the website and the matching between the URL and the node number.

The page data characteristics and the website global topological structure are used as the input of a graph convolution neural network, the Board page discovery problem is converted into the node classification problem on the graph, and therefore the webpages of the website are divided into two types: board pages and non-Board pages.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for classifying a website list page (Board page) according to an embodiment of the present invention. As shown in fig. 1, the classification method includes the steps of:

step 100, a group of website page sets is obtained, and in this embodiment, the website page sets belong to the same website. Specifically, in this embodiment, the method for obtaining the web page mainly obtains the html source code of the web page according to the web page link.

Step 200, extracting webpage data characteristics aiming at each website page respectively, wherein the webpage data characteristics comprise webpage link address (URL) characteristics, Document Object Model (DOM) tree structure characteristics and webpage visual characteristics. Thus, various data characteristics of the page can be fully utilized. Specific contents of the three data characteristics are shown in table 1.

TABLE 1

Step 300, creating a global topological structure of the website through a hash table formed by a hyperlink list of the website page and a matching relation between a link address (URL) of the website page and a node number; therefore, the global topological characteristics of the website can be fully utilized. That is, a hash table formed by matching between a hyperlink list in the captured web page data in the website and the URL and the node number is used to create a topology structure formed by the web page link information of the website, and in the case of a cybergraphe, an adjacency matrix with a size of 771 × 771 is obtained, and the matrix is a sparse matrix.

Step 400, inputting the web page data characteristics and the global topological structure of the website into a graph convolution neural network for training the graph convolution neural network to obtain a website list page classifier; the neural network model has natural learning ability, can mine hidden features, and has strong generalization ability. Therefore, the Board page discovery method using the graph convolution neural network can better capture various implicit characteristics of the Board page, and particularly has better generalization capability by using the global structure characteristics of a website.

The graph convolution neural network model (GCN) used in this embodiment is based on a first order approximation of spectrogram convolution, and two layers of spectrogram convolution are constructed for semi-supervised node classification on the graph, so as to implement Board page discovery. The network model framework is given by the following formula:

wherein the content of the first and second substances,

is a normalized adjacency matrix, X is a characteristic matrix of nodes in the graph, W⁰And W¹Is a parameter matrix, ReLU (·) represents the activation function. The global topological structure of the website is represented as an adjacent matrix A, and the adjacent matrix A is a sparse matrix; the web page data features are represented as a feature matrix X.

FIG. 2 is an architecture diagram of the GCN model. The adjacency matrix A input into the GCN model is represented by a sparse matrix, and the memory consumption is O (| H |), wherein H is the number of edges in the graph. The dimension of the characteristic matrix X is n X d, and n nodes in the representative graph have d-dimensional characteristics. The graph convolution neural network is provided with an input module IM and a spectrogram convolution module GM, the two GCN convolution modules with different dimensions are used for semi-monitoring the classification of the website page in the spectrogram convolution process, wherein each convolution module comprises a GCN convolution layer, a Relu activation function and a Dropout, namely the spectrogram convolution module M comprises: a first spectrogram convolution module GM1, including a first spectrogram convolution layer, a ReLu activation function, and a Dropout mechanism; and a second spectrogram convolution module GM2 comprising a second spectrogram convolution layer, a ReLu activation function and a Dropout mechanism. Wherein the first spectrogram convolution layer has the first spectrumA parameter matrix W⁽⁰⁾Mapping the feature representation of the website page to a corresponding hidden layer representation; having a second parameter matrix W on the second spectrogram convolution layer⁽¹⁾For mapping the hidden representation of the web page to a corresponding output. W in the convolutional layer⁽⁰⁾∈R^(C×R)And the weight matrix is the weight matrix of the first spectrogram convolution layer and is used for mapping the characteristic representation of the node into a corresponding hidden layer state. W⁽¹⁾∈R^(H×F)The weight matrix of the second spectrogram convolution layer is used for mapping the hidden layer representation of the node into corresponding output, and the parameter matrixes W (0) and W (1) are trained by a gradient descent method to update the parameters. In addition, a dropout mechanism is adopted in the training process, and in the training process, a part of every two links is randomly discarded or the value of the link is forcibly set to be 0 in each iteration, so that the network training is accelerated, and the generalization capability of the model is improved. Furthermore, the graph convolution neural network is also provided with an output module OM, the output module OM is connected with a softmax layer behind the spectrogram convolution module, and the representation of each node is processed by a softmax function to obtain the prediction result of each webpage, so that the webpage block of each webpage is predicted. The output module OM outputs the classification result by outputting a set of classification tags, for example, the output module OM outputs a set of classification tags [0, 1 ]]Where 0 denotes a non-list page and 1 denotes a list page. Specifically, the output result is a probability result of 01 distribution, i.e., [0.1,0.9 ]]Representing ten percent may be non-list pages and 90 percent list pages, which is a probability distribution that the values inside are all between 0 and 1, and their sum is 1.

Step 500, acquiring the website pages to be classified, respectively acquiring the webpage data features and the global topology structure of the website pages to be classified according to step 200 and step 300, inputting the webpage data features and the global topology structure of the website pages to be classified into the website list page classifier acquired in step 400, and judging whether the website pages to be classified are website list pages.

Based on the same inventive concept, the present invention further provides a classification system 600 for website list pages, as shown in fig. 3, fig. 3 shows a frame diagram of a classification system for website list pages according to an embodiment of the present invention, and the system includes:

a web page obtaining module 610, configured to obtain a group of website page sets, where the website page sets belong to a same website;

a feature extraction module 620, configured to extract web page data features for each website page, respectively, where the web page data features include a web page link address (URL) feature, a tree structure feature of a Document Object Model (DOM), and a web page visual feature, and create a global topology structure of the website through a hash table formed by a hyperlink list of the website page and a matching relationship between a link address (URL) of the website page and a node number;

the web page classification module 630 is provided with a pre-trained graph convolution neural network classification model, and the graph convolution neural network classification model is used for judging whether the web page of the website is a website list page according to the web page data characteristics and the global topology structure of the website.

Further, the classification system further includes a training module 640 for training the graph convolution neural network.

The graph convolution neural network is provided with a spectrogram convolution module used for semi-monitoring the classification of the website pages, and the spectrogram convolution module comprises:

The graph convolution neural network further comprises an output module which is connected with the spectrogram convolution module, and the output module is a softmax layer.

The training formula of the graph convolution neural network is as follows:

is the normalized adjacency matrix.

Based on the same inventive concept, the present invention also provides a computer-readable storage medium, on which an information transfer implementation program is stored, which, when being executed by a processor, implements the steps of any one of the classification methods described above.

The method extracts page features from multiple angles, takes the page data features and the website global topological structure as the input of the graph convolution neural network, and can extract neighborhood information of the webpage in the topological structure in the convolution process by the graph convolution neural network model. Therefore, compared with the existing Board page discovery method, the Board page discovery method better utilizes various characteristics of the pages, particularly utilizes the link relation among the pages in the website and the global topological structure characteristics of the website, and utilizes the graph convolution neural network model to better capture various implicit characteristics of the Board page to help the building of the Board page identification, so that the model has better universality and generalization capability on the basis of improving the identification accuracy.

The present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof, and it should be understood that various changes and modifications can be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method for classifying web site list pages is characterized by comprising the following steps:

2. The classification method according to claim 1, wherein the global topology of the website is represented as an adjacency matrix a, which is a sparse matrix; the web page data features are represented as a feature matrix X.

3. The classification method according to claim 2, wherein the graph convolution neural network is provided with a spectrogram convolution module for semi-supervising classification of the website pages, and the spectrogram convolution module comprises:

4. The classification method according to claim 3, further comprising an output module connected to the spectrogram convolution module, wherein the output module is a softmax layer.

5. The classification method according to claim 4, wherein the training formula of the graph convolution neural network is as follows:

wherein the content of the first and second substances,

is the normalized adjacency matrix.

6. The classification method according to claim 4 or 5, wherein the first parameter matrix W is used in the training process of the graph-convolution neural network⁽⁰⁾And said second parameter matrix W⁽¹⁾The parameters are updated by the gradient descent method, respectively.

7. A system for classifying web site listing pages, the system comprising:

8. The classification system according to claim 7, wherein the web page data features include web page link address (URL) features, tree structure features of a Document Object Model (DOM), and web page visual features.

9. The classification system as recited in claim 8, further comprising: and the training module is used for training the graph convolution neural network.

10. The classification system according to claim 9, wherein the graph convolution neural network has a spectrogram convolution module thereon for semi-supervising classification of the website pages, the spectrogram convolution module comprising:

11. The classification system according to claim 10, further comprising an output module connected to the spectrogram convolution module, wherein the output module is a softmax layer.

12. The classification system according to claim 11, wherein the training formula of the graph convolutional neural network is:

is the normalized adjacency matrix.

13. A computer-readable storage medium, characterized in that it has stored thereon a program for implementing the transfer of information, which program, when being executed by a processor, implements the steps of the classification method according to any one of claims 1 to 6.