CN112287274A - Method, system and storage medium for classifying website list pages - Google Patents

Method, system and storage medium for classifying website list pages Download PDF

Info

Publication number
CN112287274A
CN112287274A CN202011162449.XA CN202011162449A CN112287274A CN 112287274 A CN112287274 A CN 112287274A CN 202011162449 A CN202011162449 A CN 202011162449A CN 112287274 A CN112287274 A CN 112287274A
Authority
CN
China
Prior art keywords
website
page
spectrogram
module
pages
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011162449.XA
Other languages
Chinese (zh)
Other versions
CN112287274B (en
Inventor
孟剑
樊晓然
郭岩
贺广福
陈银鹏
史存会
俞晓明
刘悦
程学旗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202011162449.XA priority Critical patent/CN112287274B/en
Publication of CN112287274A publication Critical patent/CN112287274A/en
Application granted granted Critical
Publication of CN112287274B publication Critical patent/CN112287274B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method for classifying web site list pages, which comprises the following steps: step 100, acquiring a group of website page sets, wherein the website page sets belong to the same website; step 200, extracting webpage data characteristics aiming at each website page respectively; step 300, creating a global topological structure of the website through a hash table formed by a hyperlink list of the website page and a matching relation between a link address (URL) of the website page and a node number; step 400, inputting the web page data characteristics and the global topological structure of the website into a graph convolution neural network for training the graph convolution neural network to obtain a website list page classifier; step 500, acquiring the website pages to be classified, inputting the webpage data features of the website pages to be classified and the global topology of the website into a website list page classifier according to the webpage data features of the website pages to be classified and the global topology of the website, and judging whether the website pages to be classified are website list pages.

Description

Method, system and storage medium for classifying website list pages
Technical Field
The invention relates to the technical field of webpage classification, in particular to a Board page classification method and system based on network structure characteristics.
Background
With the gradual development of the internet in recent years, networks have become the largest data source. People have long been concerned with the task of collecting data on the internet. One common acquisition mode is customized acquisition, i.e., customized development is performed on a certain or a certain specific website, website link conditions are analyzed, and then a data extraction method is constructed according to the page and network characteristics of the website.
Data in the internet can be divided into different information sources such as news, forums, blogs and the like according to the release and interaction forms of the data, each information source has a specific format, such as a news data source, the data comprises data such as news text, news authors, news titles, news comments and the like, and each news page has a category to which the data belongs. The forum is also divided into blocks, and the data of the forum includes the main posts of the forum, the reply posts of the forum and the like. The customized development of collectors for each information source, and even each website, necessarily results in collectors that cannot be reused. This is a waste to develop. The research on a large number of websites with multiple information sources shows that although the network data structures of different information sources have different forms, the network data structures have certain common characteristics. For example, a website in a news information source, whether classified according to content or a website top page, has a page similar to a list, and the page directly and explicitly lists related news article links according to a certain rule, and depending on the number of all articles under the relevant rule, the page also has related page turning links, which can help to obtain more articles. Similarly, there may be a similar structure for a website in a blog information source, often more prominently a personal home page, or personal timeline. Similar structures exist for web sites in forum information sources.
For this structure, it can be generalized to a Board-Article structure, where the list page is called the Board page and the real data page to be collected is called the Article page. Board pages are usually theme-dependent, i.e. all the Article page links on a Board page tend to surround a uniform theme or have uniform strong features. The characteristic of the Board page ensures that data under the requirement theme can be captured through one Board page, so that the collection of redundant data is avoided. The Board page is used as a portal page, and the Article page has a tree structure instead of an open graph structure, so that the sensing of data change can be realized by scanning the Board page. By analyzing the Board page, the change of the data can be easily obtained, so that the data can be tracked more efficiently. Therefore, how to find the Board page from the website becomes a problem that the customized acquisition must solve.
The Board page is mainly found by the following methods:
(1) based on manual work: i.e. manually screening out the Board pages from the web site. Due to the significant diversity of web pages, the cost of manually screening Board pages is quite expensive when faced with large-scale web sites, especially large-scale web sites. Meanwhile, frequent edition modification of the website also increases instability of the Board page, and the Board page needs to be re-screened at a further manual cost.
(2) Based on the rule: the experience of manually screening the Board page is converted into rules, and a simulator discovers the Board page from a website based on the rules. Similarly, web pages have significant diversity, so that the rule-based method has the inherent defect of weak generalization capability, and the recall rate and accuracy of the Board page cannot be guaranteed.
Therefore, the existing Board page discovery method is mainly based on manual work and rules, mainly depends on intuitive cognition of people on the Board page, cannot fully utilize various features of the Board page, particularly some hidden regular features, so that the generalization capability of the method is weak, the recall rate and accuracy of the Board page cannot be further ensured, and the quality of data acquired in a customized mode can be influenced to a great extent.
Disclosure of Invention
In order to solve the above technical problems, the present invention aims to provide a method and a system for classifying a Board page based on network structure features, which better utilize various features of the page, especially the global structure features of a website, and better capture various implicit features of the Board page by using a graph convolution neural network model, and have better generalization capability.
Specifically, the invention discloses a method for classifying website list pages, which is characterized by comprising the following steps:
step 100, acquiring a group of website page sets, wherein the website page sets belong to the same website;
step 200, extracting webpage data characteristics aiming at each website page respectively, wherein the webpage data characteristics comprise webpage link address (URL) characteristics, Document Object Model (DOM) tree structure characteristics and webpage visual characteristics;
step 300, creating a global topology structure of the website through a hash table formed by a hyperlink list of the website page and a matching relation between a link address (URL) of the website page and a node number;
step 400, inputting the web page data characteristics and the global topological structure of the website into a graph convolution neural network to train the graph convolution neural network to obtain a website list page classifier;
step 500, acquiring a website page to be classified, respectively acquiring the web page data feature of the website page to be classified and the global topology structure of the website according to the step 200 and the step 300, inputting the web page data feature of the website page to be classified and the global topology structure of the website into the website list page classifier acquired in the step 400, and judging whether the website page to be classified is a website list page.
According to the classification method, the global topological structure of the website is represented as an adjacent matrix A, and the adjacent matrix A is a sparse matrix; the web page data features are represented as a feature matrix X.
The classification method is characterized in that the graph convolution neural network is provided with a spectrogram convolution module for semi-monitoring classification of the website page, and the spectrogram convolution module comprises:
the first spectrogram convolution module comprises a first spectrogram convolution layer, a ReLu activation function and a Dropout mechanism; and
the second spectrogram convolution module comprises a second spectrogram convolution layer, a ReLu activation function and a Dropout mechanism;
wherein the first spectrogram convolution layer has a first parameter matrix W thereon(0)Mapping the feature representation of the website page to a corresponding hidden layer representation; the second spectrogram convolution layer has a second parameter matrix W thereon(1)For mapping the hidden representation of the web page to a corresponding output.
The classification method according to the above, wherein the graph convolution neural network further comprises an output module connected to the spectrogram convolution module, and the output module is a softmax layer.
According to the classification method, the training formula of the graph convolution neural network is as follows:
Figure BDA0002744671770000031
wherein the content of the first and second substances,
Figure BDA0002744671770000032
is the normalized adjacency matrix.
According to the classification method, the first parameter matrix W is used in the training process of the graph convolution neural network(0)And said second parameter matrix W(1)The parameters are updated by the gradient descent method, respectively.
To achieve another object of the present invention, the present invention further provides a system for classifying web site list pages, including:
the system comprises a webpage acquisition module, a webpage acquisition module and a webpage display module, wherein the webpage acquisition module is used for acquiring a group of website page sets, and the website page sets belong to the same website;
the feature extraction module is used for extracting webpage data features for each website page respectively, and creating a global topological structure of the website through a hash table formed by a hyperlink list of the website pages and a matching relation between a link address (URL) of the website pages and a node number;
and the webpage classification module is provided with a pre-trained graph convolution neural network classification model, and the graph convolution neural network classification model is used for judging whether the website webpage is a website list page or not according to the webpage data characteristics and the global topological structure of the website.
The classification system as described above, wherein the web page data features include web page link address (URL) features, tree structure features of a Document Object Model (DOM), and web page visual features.
The classification system according to the above, wherein the classification system further comprises: and the training module is used for training the graph convolution neural network.
The classification system according to the above, wherein the graph convolution neural network has a spectrogram convolution module thereon, which is used for semi-supervising the classification of the website pages, and the spectrogram convolution module includes:
the first spectrogram convolution module comprises a first spectrogram convolution layer, a ReLu activation function and a Dropout mechanism; and
the second spectrogram convolution module comprises a second spectrogram convolution layer, a ReLu activation function and a Dropout mechanism;
wherein the first spectrogram convolution layer has a first parameter matrix W thereon(0)Mapping the feature representation of the website page to a corresponding hidden layer representation; the second spectrogram convolution layer has a second parameter matrix W thereon(1)For mapping the hidden representation of the web page to a corresponding output.
The classification system according to the above, wherein the graph convolution neural network further comprises an output module connected to the spectrogram convolution module, and the output module is a softmax layer.
According to the classification system, the training formula of the graph convolution neural network is as follows:
Figure BDA0002744671770000041
wherein, X is a characteristic matrix used for representing the webpage data characteristics; a is an adjacency matrix used for representing the global topological structure of the website;
Figure BDA0002744671770000042
is the normalized adjacency matrix.
To achieve another object of the present invention, the present invention further provides a computer-readable storage medium, on which an implementation program for information transfer is stored, the program, when executed by a processor, implementing the steps of the classification method according to any one of the above.
In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.
Drawings
FIG. 1 is a flowchart of a method for classifying web site list pages (Board pages) according to an embodiment of the present invention;
FIG. 2 is an architecture diagram of a convolutional neural network model;
fig. 3 is a block diagram of a classification system for web site list pages according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. It is to be understood that the embodiments described below are only a few embodiments of the present invention, and not all embodiments.
In addition, the descriptions related to "first", "second", etc. in the present invention are for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
The invention is used for solving the problems that the existing Board page classification method in the prior art cannot fully utilize various features of the Board page and has weak generalization capability, the invention extracts page features from multiple angles, uses the page data features and the website global topological structure as the input of a graph convolution neural network, and can extract neighborhood information of the web page in the topological structure in the convolution process by a graph convolution neural network model, thereby providing a Board page discovery method and a system based on network structure features.
In the present invention, the following assumptions exist for the website, the page and their relationship to each other:
1. a web site is made up of pages that have one or more identities, i.e., URLs. Each URL uniquely corresponds to a page.
2. The page itself is composed of HTML, and the HTML contains information of nodes, node attributes, node contents (text), and node styles, wherein URLs of other pages may exist in the node attributes.
3. A page can be considered to point to a page by the URLs of other pages in the current page.
Based on the above assumptions, the website itself constitutes a directed graph, the website pages are nodes in the graph, each node has different characteristics, and the link relationship between the pages described in point 3 serves as an edge, and forms the graph together with the nodes.
In the invention, the characteristics of the Board page are abstracted, namely the definition of the Board page: the Board page is naturally owned in the website and can be regarded as a node in a network formed by the website; the Board page is an aggregate page of the Article page to which it points, so the nodes corresponding to the Board page have specific network structure characteristics in the network.
Since the web site itself is a graph, i.e. the relational data composed of web pages and their internal hyperlinks can be abstracted into a graph, it is reasonable to use this graph-based network method for finding Board pages from web sites. Specifically, the Board page is naturally owned by the website, and points to the Article page, so that the position information of the page in the network node formed by the website can be utilized; the Board page points to the Article page, so that the Article page has little "exclusive" out-of-range with respect to the Board page. Extracting page data features from multiple angles, wherein the page data features comprise characteristics such as URLs, DOM tree structures, HTML tags, webpage visual styles and topological structures; the massive webpage data contains some potential rules, and from the topological view, the pages in different sections and under the same section are connected in a hyperlink mode. The Board page corresponds to a pivot in the figure, and the connection information is more complex than that of a non-Board page. Therefore, it is particularly important for the Board page to construct the topology formed by the link information of the web page or construct the topology of the web site. Therefore, the invention creates the website global topology structure formed by the webpage link information in the website by using the hash table formed by the hyperlink list of the webpage data in the website and the matching between the URL and the node number.
The page data characteristics and the website global topological structure are used as the input of a graph convolution neural network, the Board page discovery problem is converted into the node classification problem on the graph, and therefore the webpages of the website are divided into two types: board pages and non-Board pages.
Referring to fig. 1, fig. 1 is a flowchart illustrating a method for classifying a website list page (Board page) according to an embodiment of the present invention. As shown in fig. 1, the classification method includes the steps of:
step 100, a group of website page sets is obtained, and in this embodiment, the website page sets belong to the same website. Specifically, in this embodiment, the method for obtaining the web page mainly obtains the html source code of the web page according to the web page link.
Step 200, extracting webpage data characteristics aiming at each website page respectively, wherein the webpage data characteristics comprise webpage link address (URL) characteristics, Document Object Model (DOM) tree structure characteristics and webpage visual characteristics. Thus, various data characteristics of the page can be fully utilized. Specific contents of the three data characteristics are shown in table 1.
TABLE 1
Figure BDA0002744671770000071
Step 300, creating a global topological structure of the website through a hash table formed by a hyperlink list of the website page and a matching relation between a link address (URL) of the website page and a node number; therefore, the global topological characteristics of the website can be fully utilized. That is, a hash table formed by matching between a hyperlink list in the captured web page data in the website and the URL and the node number is used to create a topology structure formed by the web page link information of the website, and in the case of a cybergraphe, an adjacency matrix with a size of 771 × 771 is obtained, and the matrix is a sparse matrix.
Step 400, inputting the web page data characteristics and the global topological structure of the website into a graph convolution neural network for training the graph convolution neural network to obtain a website list page classifier; the neural network model has natural learning ability, can mine hidden features, and has strong generalization ability. Therefore, the Board page discovery method using the graph convolution neural network can better capture various implicit characteristics of the Board page, and particularly has better generalization capability by using the global structure characteristics of a website.
The graph convolution neural network model (GCN) used in this embodiment is based on a first order approximation of spectrogram convolution, and two layers of spectrogram convolution are constructed for semi-supervised node classification on the graph, so as to implement Board page discovery. The network model framework is given by the following formula:
Figure BDA0002744671770000081
wherein the content of the first and second substances,
Figure BDA0002744671770000082
is a normalized adjacency matrix, X is a characteristic matrix of nodes in the graph, W0And W1Is a parameter matrix, ReLU (·) represents the activation function. The global topological structure of the website is represented as an adjacent matrix A, and the adjacent matrix A is a sparse matrix; the web page data features are represented as a feature matrix X.
FIG. 2 is an architecture diagram of the GCN model. The adjacency matrix A input into the GCN model is represented by a sparse matrix, and the memory consumption is O (| H |), wherein H is the number of edges in the graph. The dimension of the characteristic matrix X is n X d, and n nodes in the representative graph have d-dimensional characteristics. The graph convolution neural network is provided with an input module IM and a spectrogram convolution module GM, the two GCN convolution modules with different dimensions are used for semi-monitoring the classification of the website page in the spectrogram convolution process, wherein each convolution module comprises a GCN convolution layer, a Relu activation function and a Dropout, namely the spectrogram convolution module M comprises: a first spectrogram convolution module GM1, including a first spectrogram convolution layer, a ReLu activation function, and a Dropout mechanism; and a second spectrogram convolution module GM2 comprising a second spectrogram convolution layer, a ReLu activation function and a Dropout mechanism. Wherein the first spectrogram convolution layer has the first spectrumA parameter matrix W(0)Mapping the feature representation of the website page to a corresponding hidden layer representation; having a second parameter matrix W on the second spectrogram convolution layer(1)For mapping the hidden representation of the web page to a corresponding output. W in the convolutional layer(0)∈R(C×R)And the weight matrix is the weight matrix of the first spectrogram convolution layer and is used for mapping the characteristic representation of the node into a corresponding hidden layer state. W(1)∈R(H×F)The weight matrix of the second spectrogram convolution layer is used for mapping the hidden layer representation of the node into corresponding output, and the parameter matrixes W (0) and W (1) are trained by a gradient descent method to update the parameters. In addition, a dropout mechanism is adopted in the training process, and in the training process, a part of every two links is randomly discarded or the value of the link is forcibly set to be 0 in each iteration, so that the network training is accelerated, and the generalization capability of the model is improved. Furthermore, the graph convolution neural network is also provided with an output module OM, the output module OM is connected with a softmax layer behind the spectrogram convolution module, and the representation of each node is processed by a softmax function to obtain the prediction result of each webpage, so that the webpage block of each webpage is predicted. The output module OM outputs the classification result by outputting a set of classification tags, for example, the output module OM outputs a set of classification tags [0, 1 ]]Where 0 denotes a non-list page and 1 denotes a list page. Specifically, the output result is a probability result of 01 distribution, i.e., [0.1,0.9 ]]Representing ten percent may be non-list pages and 90 percent list pages, which is a probability distribution that the values inside are all between 0 and 1, and their sum is 1.
Step 500, acquiring the website pages to be classified, respectively acquiring the webpage data features and the global topology structure of the website pages to be classified according to step 200 and step 300, inputting the webpage data features and the global topology structure of the website pages to be classified into the website list page classifier acquired in step 400, and judging whether the website pages to be classified are website list pages.
Based on the same inventive concept, the present invention further provides a classification system 600 for website list pages, as shown in fig. 3, fig. 3 shows a frame diagram of a classification system for website list pages according to an embodiment of the present invention, and the system includes:
a web page obtaining module 610, configured to obtain a group of website page sets, where the website page sets belong to a same website;
a feature extraction module 620, configured to extract web page data features for each website page, respectively, where the web page data features include a web page link address (URL) feature, a tree structure feature of a Document Object Model (DOM), and a web page visual feature, and create a global topology structure of the website through a hash table formed by a hyperlink list of the website page and a matching relationship between a link address (URL) of the website page and a node number;
the web page classification module 630 is provided with a pre-trained graph convolution neural network classification model, and the graph convolution neural network classification model is used for judging whether the web page of the website is a website list page according to the web page data characteristics and the global topology structure of the website.
Further, the classification system further includes a training module 640 for training the graph convolution neural network.
The graph convolution neural network is provided with a spectrogram convolution module used for semi-monitoring the classification of the website pages, and the spectrogram convolution module comprises:
the first spectrogram convolution module comprises a first spectrogram convolution layer, a ReLu activation function and a Dropout mechanism; and
the second spectrogram convolution module comprises a second spectrogram convolution layer, a ReLu activation function and a Dropout mechanism;
wherein the first spectrogram convolution layer has a first parameter matrix W thereon(0)Mapping the feature representation of the website page to a corresponding hidden layer representation; the second spectrogram convolution layer has a second parameter matrix W thereon(1)For mapping the hidden representation of the web page to a corresponding output.
The graph convolution neural network further comprises an output module which is connected with the spectrogram convolution module, and the output module is a softmax layer.
The training formula of the graph convolution neural network is as follows:
Figure BDA0002744671770000101
wherein, X is a characteristic matrix used for representing the webpage data characteristics; a is an adjacency matrix used for representing the global topological structure of the website;
Figure BDA0002744671770000102
is the normalized adjacency matrix.
Based on the same inventive concept, the present invention also provides a computer-readable storage medium, on which an information transfer implementation program is stored, which, when being executed by a processor, implements the steps of any one of the classification methods described above.
The method extracts page features from multiple angles, takes the page data features and the website global topological structure as the input of the graph convolution neural network, and can extract neighborhood information of the webpage in the topological structure in the convolution process by the graph convolution neural network model. Therefore, compared with the existing Board page discovery method, the Board page discovery method better utilizes various characteristics of the pages, particularly utilizes the link relation among the pages in the website and the global topological structure characteristics of the website, and utilizes the graph convolution neural network model to better capture various implicit characteristics of the Board page to help the building of the Board page identification, so that the model has better universality and generalization capability on the basis of improving the identification accuracy.
The present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof, and it should be understood that various changes and modifications can be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (13)

1. A method for classifying web site list pages is characterized by comprising the following steps:
step 100, acquiring a group of website page sets, wherein the website page sets belong to the same website;
step 200, extracting webpage data characteristics aiming at each website page respectively, wherein the webpage data characteristics comprise webpage link address (URL) characteristics, Document Object Model (DOM) tree structure characteristics and webpage visual characteristics;
step 300, creating a global topology structure of the website through a hash table formed by a hyperlink list of the website page and a matching relation between a link address (URL) of the website page and a node number;
step 400, inputting the web page data characteristics and the global topological structure of the website into a graph convolution neural network to train the graph convolution neural network to obtain a website list page classifier;
step 500, acquiring a website page to be classified, respectively acquiring the web page data feature of the website page to be classified and the global topology structure of the website according to the step 200 and the step 300, inputting the web page data feature of the website page to be classified and the global topology structure of the website into the website list page classifier acquired in the step 400, and judging whether the website page to be classified is a website list page.
2. The classification method according to claim 1, wherein the global topology of the website is represented as an adjacency matrix a, which is a sparse matrix; the web page data features are represented as a feature matrix X.
3. The classification method according to claim 2, wherein the graph convolution neural network is provided with a spectrogram convolution module for semi-supervising classification of the website pages, and the spectrogram convolution module comprises:
the first spectrogram convolution module comprises a first spectrogram convolution layer, a ReLu activation function and a Dropout mechanism; and
the second spectrogram convolution module comprises a second spectrogram convolution layer, a ReLu activation function and a Dropout mechanism;
wherein the first spectrogram convolution layer has a first parameter matrix W thereon(0)Mapping the feature representation of the website page to a corresponding hidden layer representation; the second spectrogram convolution layer has a second parameter matrix W thereon(1)For mapping the hidden representation of the web page to a corresponding output.
4. The classification method according to claim 3, further comprising an output module connected to the spectrogram convolution module, wherein the output module is a softmax layer.
5. The classification method according to claim 4, wherein the training formula of the graph convolution neural network is as follows:
Figure FDA0002744671760000021
wherein the content of the first and second substances,
Figure FDA0002744671760000022
is the normalized adjacency matrix.
6. The classification method according to claim 4 or 5, wherein the first parameter matrix W is used in the training process of the graph-convolution neural network(0)And said second parameter matrix W(1)The parameters are updated by the gradient descent method, respectively.
7. A system for classifying web site listing pages, the system comprising:
the system comprises a webpage acquisition module, a webpage acquisition module and a webpage display module, wherein the webpage acquisition module is used for acquiring a group of website page sets, and the website page sets belong to the same website;
the feature extraction module is used for extracting webpage data features for each website page respectively, and creating a global topological structure of the website through a hash table formed by a hyperlink list of the website pages and a matching relation between a link address (URL) of the website pages and a node number;
and the webpage classification module is provided with a pre-trained graph convolution neural network classification model, and the graph convolution neural network classification model is used for judging whether the website webpage is a website list page or not according to the webpage data characteristics and the global topological structure of the website.
8. The classification system according to claim 7, wherein the web page data features include web page link address (URL) features, tree structure features of a Document Object Model (DOM), and web page visual features.
9. The classification system as recited in claim 8, further comprising: and the training module is used for training the graph convolution neural network.
10. The classification system according to claim 9, wherein the graph convolution neural network has a spectrogram convolution module thereon for semi-supervising classification of the website pages, the spectrogram convolution module comprising:
the first spectrogram convolution module comprises a first spectrogram convolution layer, a ReLu activation function and a Dropout mechanism; and
the second spectrogram convolution module comprises a second spectrogram convolution layer, a ReLu activation function and a Dropout mechanism;
wherein the first spectrogram convolution layer has a first parameter matrix W thereon(0)Mapping the feature representation of the website page to a corresponding hidden layer representation; the second spectrogram convolution layer has a second parameter matrix W thereon(1)For mapping the hidden representation of the web page to a corresponding output.
11. The classification system according to claim 10, further comprising an output module connected to the spectrogram convolution module, wherein the output module is a softmax layer.
12. The classification system according to claim 11, wherein the training formula of the graph convolutional neural network is:
Figure FDA0002744671760000031
wherein, X is a characteristic matrix used for representing the webpage data characteristics; a is an adjacency matrix used for representing the global topological structure of the website;
Figure FDA0002744671760000032
is the normalized adjacency matrix.
13. A computer-readable storage medium, characterized in that it has stored thereon a program for implementing the transfer of information, which program, when being executed by a processor, implements the steps of the classification method according to any one of claims 1 to 6.
CN202011162449.XA 2020-10-27 2020-10-27 Method, system and storage medium for classifying website list pages Active CN112287274B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011162449.XA CN112287274B (en) 2020-10-27 2020-10-27 Method, system and storage medium for classifying website list pages

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011162449.XA CN112287274B (en) 2020-10-27 2020-10-27 Method, system and storage medium for classifying website list pages

Publications (2)

Publication Number Publication Date
CN112287274A true CN112287274A (en) 2021-01-29
CN112287274B CN112287274B (en) 2022-10-18

Family

ID=74372974

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011162449.XA Active CN112287274B (en) 2020-10-27 2020-10-27 Method, system and storage medium for classifying website list pages

Country Status (1)

Country Link
CN (1) CN112287274B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115017430A (en) * 2022-06-27 2022-09-06 京东科技控股股份有限公司 List page determination method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5895470A (en) * 1997-04-09 1999-04-20 Xerox Corporation System for categorizing documents in a linked collection of documents
CN102411587A (en) * 2010-09-21 2012-04-11 腾讯科技(深圳)有限公司 Webpage classification method and device
CN106446124A (en) * 2016-09-19 2017-02-22 成都知道创宇信息技术有限公司 Website classification method based on network relation graph
CN108364028A (en) * 2018-03-06 2018-08-03 中国科学院信息工程研究所 A kind of internet site automatic classification method based on deep learning
CN111797299A (en) * 2019-04-09 2020-10-20 Oppo广东移动通信有限公司 Model training method, webpage classification method, device, storage medium and equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5895470A (en) * 1997-04-09 1999-04-20 Xerox Corporation System for categorizing documents in a linked collection of documents
CN102411587A (en) * 2010-09-21 2012-04-11 腾讯科技(深圳)有限公司 Webpage classification method and device
CN106446124A (en) * 2016-09-19 2017-02-22 成都知道创宇信息技术有限公司 Website classification method based on network relation graph
CN108364028A (en) * 2018-03-06 2018-08-03 中国科学院信息工程研究所 A kind of internet site automatic classification method based on deep learning
CN111797299A (en) * 2019-04-09 2020-10-20 Oppo广东移动通信有限公司 Model training method, webpage classification method, device, storage medium and equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
谢振亮等: "基于网站结构挖掘的Web文档自动分类", 《计算机应用》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115017430A (en) * 2022-06-27 2022-09-06 京东科技控股股份有限公司 List page determination method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112287274B (en) 2022-10-18

Similar Documents

Publication Publication Date Title
Chakrabarti et al. Page-level template detection via isotonic smoothing
CN108965245B (en) Phishing website detection method and system based on self-adaptive heterogeneous multi-classification model
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
CN108256104B (en) Comprehensive classification method of internet websites based on multidimensional characteristics
CN112287273B (en) Method, system and storage medium for classifying website list pages
US8868621B2 (en) Data extraction from HTML documents into tables for user comparison
US8959091B2 (en) Keyword assignment to a web page
CN111159395A (en) Chart neural network-based rumor standpoint detection method and device and electronic equipment
CN106296312A (en) Online education resource recommendation system based on social media
CN110457579B (en) Webpage denoising method and system based on cooperative work of template and classifier
CN112287272B (en) Method, system and storage medium for classifying website list pages
US20100211533A1 (en) Extracting structured data from web forums
CN104679825A (en) Web text-based acquiring and screening method of seismic macroscopic anomaly information
CN103514234A (en) Method and device for extracting page information
CN102169496A (en) Anchor text analysis-based automatic domain term generating method
CN111538931A (en) Big data-based public opinion monitoring method and device, computer equipment and medium
CN112287274B (en) Method, system and storage medium for classifying website list pages
CN104346382A (en) Text analysis system and method employing language query
CN102902792B (en) list page identification system and method
CN108595466B (en) Internet information filtering and internet user information and network card structure analysis method
CN107145591A (en) A kind of effective content metadata extracting method of webpage based on title
CN108280102A (en) Internet behavior recording method, device and user terminal
Al-Ghuribi et al. A comprehensive survey on web content extraction algorithms and techniques
CN104462241A (en) Population property classification method and device based on anchor texts and peripheral texts in URLs
CN108182496A (en) A kind of city internet opens data acquisition process analysis method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant