CN112287272B

CN112287272B - Method, system and storage medium for classifying website list pages

Info

Publication number: CN112287272B
Application number: CN202011161424.8A
Authority: CN
Inventors: 孟剑; 郭岩; 贺广福; 史存会; 陈银鹏; 俞晓明; 刘悦; 程学旗
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2020-10-27
Filing date: 2020-10-27
Publication date: 2023-05-23
Anticipated expiration: 2040-10-27
Also published as: CN112287272A

Abstract

The invention relates to a classification method of website list pages, which is based on hypertext markup language tags (HTML tags), and comprises the following steps: step 100, acquiring a group of website webpages; step 200, respectively extracting statistical features and structural features of the website webpages aiming at each website webpage to obtain a feature sequence corresponding to each website webpage; step 300, inputting the characteristic sequence into a neural network to train the neural network, and obtaining a website list page classifier; step 400, obtaining a website webpage to be classified, inputting the feature sequence of the website webpage to be classified into the website list page classifier obtained in the step 300 according to the feature sequence of the website webpage to be classified obtained in the step 200, and judging whether the website webpage to be classified is a website list page.

Description

Method, system and storage medium for classifying website list pages

Technical Field

The invention relates to the technical field of webpage classification, in particular to a classification method and a classification system of a website list page (Board page) based on N-gram characteristics of an HTML Tag.

Background

With the recent development of the internet, networks have become the largest source of data. There has long been a focus on internet data collection tasks. One common collection method is customized collection, namely, customized development is performed on a certain or a certain specific website, the website link condition is analyzed, and then a data extraction method is constructed according to the page and the network characteristics of the website link condition.

The data in the internet can be often divided into different information sources such as news, forum, blogs and the like according to the release and interaction modes, each information source has a specific format, such as a news data source, the data comprises data of news texts, news authors, news topics, news comments and the like, and each news page has the category to which the news pages belong. The same forum is also divided into plates, and the data of the forum comprises the main paste, the reply of the forum and the like. Custom development of collectors for each information source, and even each website, necessarily results in collectors that cannot be reused. This is a waste of development. Through research on a large number of websites with multiple information sources, the network data structures with different information sources have different forms, but have a certain general characteristic. For example, websites in news information sources, whether classified according to content or the first page of the website, have pages similar to a list, the pages directly and explicitly list related news article links according to a certain rule, and depending on the number of all articles under the related rule, related page turning links are also available on the pages, so that more articles can be acquired. Similarly, there may be a similar structure for websites in blog information sources, often more obvious as personal top page, or personal timeline. Similar structures exist for websites in the forum information sources as well.

For this structure it can be generalized to a Board-Article structure, where the list page is called the Board page and the real data page to be collected is called the Article page. The Board pages are often subject matter-dependent, i.e., all of the links to the pages of an art on a Board often surround a unified subject or have unified strong features. This feature of the Board page ensures that data under the demand topic can be captured by one Board page, thereby avoiding the collection of redundant data. The Board page serves as an entry page and the Article page has a tree structure instead of an open graph structure, which enables perception of data changes by scanning the Board page. By analysis of the Board page, changes in the data can be easily obtained, thus tracking the data more efficiently. Therefore, how to find a Board page from a website becomes a problem that the customized collection must solve.

The discovery methods of the Board page mainly comprise the following steps:

(1) Based on manual work: i.e. manually screening out the Board pages from the website. Because of the significant diversity of web pages, manual screening of the Board pages is quite expensive in the face of large-scale web sites, especially large-scale web sites. At the same time, frequent revisions of websites also increase the instability of the Board pages, requiring further manual effort to rescreen the Board pages.

(2) Based on rules: the experience of manually screening the Board pages is converted into rules, and the human simulator discovers the Board pages from the website based on the rules. Similarly, web pages have significant diversity, so that rule-based methods have inherent drawbacks of weak generalization ability, and the recall rate and accuracy of the Board page cannot be guaranteed.

Therefore, the existing method for finding the page mainly relies on the visual knowledge of people on the page, and various features of the page, especially some hidden regular features, cannot be fully utilized, so that the generalization capability of the method is weak, and recall rate and accuracy of the page cannot be ensured, which can influence the quality of customized collected data to a great extent.

Disclosure of Invention

In order to solve the technical problems, the invention aims to provide a classification method and a classification system for a website list page (Board page) based on an HTML Tag. The classification method of the Board pages better utilizes the visual characteristics of the Board pages, better captures various hidden characteristics of the Board pages by utilizing a neural network model, and has better generalization capability.

Specifically, the invention discloses a classification method of website list pages, which is based on hypertext markup language tags (HTML tags), and comprises the following steps:

step 100, acquiring a group of website webpages;

step 200, respectively extracting statistical features and structural features of the website webpages aiming at each website webpage to obtain a feature sequence corresponding to each website webpage;

step 300, inputting the characteristic sequence into a neural network to train the neural network, and obtaining a website list page classifier;

step 400, obtaining a website webpage to be classified, inputting the feature sequence of the website webpage to be classified into the website list page classifier obtained in the step 300 according to the feature sequence of the website webpage to be classified obtained in the step 200, and judging whether the website webpage to be classified is a website list page.

The method for classifying the website list pages comprises the following steps:

the number of occurrences of each of the hypertext markup language tags (HTML tags) and the inverse of the number;

each hypertext markup language Tag (HTML Tag) has a number of links and a reciprocal of the number;

each hypertext markup language Tag (HTML Tag) has a number of times that text is present and a reciprocal of the number of times;

the link length of the website page and the reciprocal of the link length;

the link depth of the website page and the reciprocal of the link depth;

extreme values of the number of texts in hypertext markup language tags (HTML tags) of plain text and the reciprocal of the extreme value;

variance of the number of texts in the hypertext markup language Tag (HTML Tag) and reciprocal of the variance;

the mean value of the number of texts in the hypertext markup language Tag (HTML Tag) and the reciprocal of the mean value;

the mean square error of the number of text in the hypertext markup language Tag (HTML Tag) and the inverse of the mean square error.

The method for classifying web site list pages according to, wherein the structural features include N-gram (N-gram) features, wherein the N-gram features include unigram (uni-gram) features and binary gram (bi-gram) features.

The method for classifying the website list pages comprises the following steps of:

step 210, parsing each website webpage into a Document Object Model (DOM) tree, and expressing the Document Object Model (DOM) tree as an HTML tag sequence;

step 220, classifying each tag element in the HTML tag sequence;

step 230, extracting the N-gram feature for each different class of tag elements in the HTML tag sequence.

The method for classifying web site list pages according to the present invention, wherein in the step 220, the tag elements in the HTML tag sequence are divided into: labels containing external links, labels without external links and text labels;

the tag containing the external link comprises a link address (URL) pointing to the outside, the tag without the external link does not comprise a link address pointing to the outside, and the text tag consists of the tag containing the external link and a part outside the tag without the external link.

The method for classifying the website list pages comprises the steps of selecting a website list page, wherein the neural network is a fully-connected neural network, and the fully-connected neural network comprises an input layer, a hidden layer and an output layer.

According to the classification method of the website list pages, the activation function of the fully-connected neural network is a Gelu function, and the loss function is a cross entropy function.

The method for classifying the website list pages according to the present invention, wherein the training step of the fully-connected neural network comprises:

step 310, inputting the feature sequence into the input layer;

step 320, the hidden layer calculates and trains the fully connected neural network according to the Gelu function and the cross entropy function for the feature sequence to obtain the classification parameters of the fully connected neural network;

step 330, the output layer outputs the classification result of the website webpage according to the classification parameter;

and when the classification result is [0,1], the input website webpage is a list page (Board page).

The method for classifying web site list pages according to the present invention, wherein in the step 320, training of the fully connected neural network is accelerated using a label smoothing method (label smooth), an exponential sliding average method (exponential moving average, EMA), and/or a batch normalization method (batch normalization).

According to the method for classifying web site list pages, in step 320, classification parameters of the fully connected neural network are obtained through a back propagation algorithm and a gradient descent method.

To achieve another object of the present invention, there is also provided a classification system of web site list pages, the classification system being based on hypertext markup language tags (HTML tags), the classification system comprising:

the webpage acquisition module is used for acquiring a group of website webpages to be classified;

the feature extraction module is used for extracting statistical features and structural features of the website webpages according to the website webpages respectively to obtain feature sequences corresponding to the website webpages;

the webpage classification module is provided with a pre-trained neural network classification model, and the neural network classification model is used for judging whether the website webpages to be classified are website list pages or not according to the feature sequence.

The classification system of the website list pages according to the above, wherein the classification system further comprises: the training module is used for training the neural network;

the neural network is a fully-connected neural network, the activation function of the fully-connected neural network is a Gelu function, and the loss function is a cross entropy function; the fully-connected neural network comprises an input layer, a hidden layer and an output layer; the input layer acquires the characteristic sequence; and the hidden layer carries out operation on the characteristic sequence according to a Gelu function and a cross entropy function and trains the fully-connected neural network to obtain the classification parameters of the fully-connected neural network.

To achieve another object of the present invention, there is also provided a computer-readable storage medium having stored thereon an information delivery implementation program which, when executed by a processor, implements the steps of the method for classifying web site list pages as set forth in any one of the above.

The classification method provided by the invention uses the N-gram characteristics of the HTML Tag sequence, so that the classification method can capture the visual characteristics of the HTML pages contained in the DOM tree structure, and the visual characteristics can better describe the cognition of people on the layout and the content of the webpage, thus being one of the best quality characteristics in webpage analysis. In addition, the classification method also uses a neural network model, and the neural network model has natural learning ability, can mine hidden characteristics and has strong generalization ability. Therefore, compared with the existing Board page discovery method, the Board page classification method of the invention better utilizes the visual characteristics of the Board pages, better captures various hidden characteristics of the Board pages by utilizing the neural network model, and has better generalization capability.

In order to make the above features and effects of the present invention more clearly understood, the following specific examples are given with reference to the accompanying drawings.

Drawings

FIG. 1 is a flow chart of a method for classifying web site list pages according to an embodiment of the present invention;

FIG. 2 is a flowchart of extracting N-gram features of a classification method of web site list pages according to an embodiment of the present invention;

FIG. 3 is a diagram showing an example of labels with external links in a method for classifying web site listing pages according to an embodiment of the present invention;

FIG. 4 is a diagram showing an example of labels without external links in a method for classifying web site listing pages according to an embodiment of the present invention;

FIG. 5 is an exemplary diagram of text labels in a method for classifying web site listing pages according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a fully connected neural network in a classification method of web site list pages according to an embodiment of the invention.

FIG. 7 is a block diagram of a classification system for web site listing pages according to an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. It will be apparent that the embodiments described below are only some, but not all, embodiments of the invention.

Furthermore, the description of "first," "second," etc. in this disclosure is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.

The invention is used for solving the problems that various characteristics of the Board page, especially some hidden regular characteristics, cannot be fully utilized in the prior art, so that the generalization capability of the method is weak, and the recall rate and accuracy of the Board page cannot be ensured, and abstracting the discovery problem of the Board page into a two-class problem, namely, dividing web pages of websites into two classes: the Board page and the non-Board page, and the web pages are classified based on the neural network using the structural features of the page DOM (Document Object Model ) tree, and the features of the web page URLs, thereby finding the Board page.

In the present invention, the following assumptions exist for the website, page, and their relationship to each other:

1. a web site is made up of pages that have one or more identifications, or URLs. Each URL uniquely corresponds to a page;

2. the page itself is composed of HTML, which contains node, node attribute, node content (text) and node style information, wherein the node attribute may have other page URLs;

3. by URLs of other pages in the current page, it can be considered that one page points to one page.

Based on the above assumptions, the web sites themselves constitute a directed network in which the nodes have individual characteristics. In the present invention, the features of the Board page are abstracted, i.e. the definition of the Board page: the Board page is naturally owned by the website and can be regarded as a node in a network formed by the website; the Board page is an aggregate page of the Article page to which the Board page points, so that the node corresponding to the Board page has specific network structure characteristics in the network.

The invention mainly utilizes the structural characteristics of a page DOM (Document Object Model ) tree and the characteristics of web page URLs, classifies web pages based on a neural network, and particularly provides a Board page classification method based on the n-gram characteristics of HTML tags.

Referring to fig. 1, fig. 1 is a flowchart illustrating a classification method of a website list page (Board page) according to an embodiment of the invention. As shown in fig. 1, the classification method includes the steps of:

step 100, a set of website webpages is acquired. Specifically, in this embodiment, the method for obtaining the web page mainly obtains html source code of the web page according to the web page link.

And 200, respectively extracting statistical features and structural features of the website webpages aiming at each website webpage to obtain a feature sequence corresponding to each website webpage.

In this embodiment, the statistical features of the web pages of the website include: the number of occurrences of each of the hypertext markup language tags (HTML tags) and the inverse of the number; each hypertext markup language Tag (HTML Tag) has a number of links and a reciprocal of the number; each hypertext markup language Tag (HTML Tag) has a number of times that text is present and a reciprocal of the number of times; the link length of the website page and the reciprocal of the link length; the link depth of the website page and the reciprocal of the link depth; extreme values of the number of texts in hypertext markup language tags (HTML tags) of plain text and the reciprocal of the extreme value; variance of the number of texts in the hypertext markup language Tag (HTML Tag) and reciprocal of the variance; the mean value of the number of texts in the hypertext markup language Tag (HTML Tag) and the reciprocal of the mean value; the mean square error of the number of text in the hypertext markup language Tag (HTML Tag) and the inverse of the mean square error. Specifically, the website page of the present embodiment has some common statistical features, and by using these features, it is possible to help distinguish the Board page from other pages. Common statistical features include: the number of occurrences of each HTML Tag in common; the number of times each HTML Tag has a link; the number of times each HTML Tag has text; the link length of the page itself, the link depth (in url "/" number of occurrences); extreme values of Chinese numbers in the HTML Tag of the plain text; variance, mean and mean square error of the number of texts in the HTML Tag; reciprocal of the above feature (analog FM method).

In addition, for each page, the website webpage of the embodiment includes structural features such as N-gram features in addition to the above common statistical features, so that the present invention can capture visual features of HTML pages contained in the DOM tree structure. The visual features can better describe the cognition of people on the layout and the content of the webpage, and are one of the best quality features in webpage analysis. Thus, the Board page discovery method of the present invention better utilizes the visual characteristics of the Board page than existing Board page discovery methods. HTML tags constitute a tree, the DOM tree, based on the relationship between them. The tree structure itself is a salient feature, and the DOM tree structure is, to some extent, a fundamental constituent element of the visual features formed by HTML pages. It is therefore a significant matter to introduce DOM features other than statistics by using HTML Tag sequences in the DOM tree. In this embodiment, the DOM tree structure features include, for example, N-gram (N-gram) features, and the extracting step of the N-gram features includes, as shown in fig. 2:

at step 210, each of the web site pages is parsed into a Document Object Model (DOM) tree, and the Document Object Model (DOM) tree is expressed as a sequence of HTML tags. The specific steps are as follows:

the HTML Tag sequence in the DOM tree is extracted, i.e., the DOM tree is expressed as an HTML Tag sequence. For example, the DOM tree for a page is as follows:

the DOM tree can be compressed into an html div a external link a div div p yougood-! p div HTML'.

And 220, classifying each tag element in the HTML tag sequence. The elements in the above HTML Tag sequence in the present invention are called token, and the token is classified into the following three types:

(1) Tag a, which contains an external link, is composed of a single tag, and the attribute of the tag contains a link URL pointing to the outside, as shown in fig. 3.

(2) Tag B without an external link, i.e., consisting of a single tag, does not include a link URL pointing to the outside in the attribute of the tag, as shown in fig. 4.

(3) The text label C is composed of the portions other than the above two labels token, i.e., the portion between the labels, as shown in fig. 5.

Step 230, extracting the N-gram features for each different class of Tag elements in the HTML Tag sequence, that is, extracting N-gram features, such as uni-gram features, bi-gram features, and the like, from the HTML Tag sequence.

And 300, inputting the characteristic sequence into a neural network to train the neural network, and obtaining a website list page classifier. The neural network model is used in the invention, has natural learning ability, can mine hidden characteristics, and has strong generalization ability.

Artificial neural networks (Artificial Neural Networks, abbreviated as ANNs) are also simply called Neural Networks (NNs) or Connection models (Connection models), which are mathematical models of algorithms that mimic the behavior of animal neural networks and perform distributed parallel information processing. The network relies on the complexity of the system and achieves the purpose of processing information by adjusting the relationship of the interconnection among a large number of nodes. Neural networks have a strong learning ability and have excellent applications in many fields. Therefore, the invention utilizes the neural network to realize the two classification of the web pages in the website, thereby finding the Board page. The method comprises the steps of extracting common statistical features and N-gram features from training web pages, inputting the common statistical features and N-gram features into a neural network, and learning a web page classifier based on the neural network.

The statistical features and the structural features obtained by the extraction are taken as a feature sequence of each training webpage and input into the neural network, and in the embodiment, a four-layer fully connected neural network is taken as an example for illustration. As shown in fig. 6, fig. 6 shows a schematic diagram of a four-layer fully connected neural network. The fully-connected neural network comprises an input layer IL, a hidden layer HL and an output layer OL, and in this embodiment, the input layer IL used by the fully-connected neural network is 1024 neurons, for example; the hidden layer HL includes three nerve units, namely 2048, 1024 and 512 neurons, but the invention is not limited thereto. The fully connected neural network uses Gelu as the activation function and cross entropy as the loss function.

Specifically, the training step of the fully connected neural network includes: step 310, inputting the feature sequence into the input layer; step 320, the hidden layer calculates and trains the fully connected neural network according to the Gelu function and the cross entropy function for the feature sequence to obtain the classification parameters of the fully connected neural network; step 330, the output layer outputs the classification result of the website webpage according to the classification parameter; when the output layer outputs [0,1], wherein 1 represents that the input web page of the website is a list page (Board page).

Considering that the number of non-Board pages in the website is large, the loss of the Board pages is weighted and the penalty factor is increased. Meanwhile, the network training is accelerated by using a label smoothing method and an EMA (exponential moving average, exponential sliding average) method and a batch normalization (batch normalization) method, so that the model generalization is improved. And finally, obtaining model parameters by using a back propagation algorithm and a gradient descent method to obtain the Board page classifier.

Step 400, obtaining website webpages to be classified, obtaining a characteristic sequence of each website webpage to be classified according to the steps, inputting the characteristic sequence of the website webpages to be classified into the website list page classifier, and judging whether the website webpages to be classified are website list pages.

Based on an inventive concept, the present invention further provides a classification system 500 for web site list pages, the classification system being based on hypertext markup language tags (HTML tags), as shown in fig. 7, fig. 7 shows a frame diagram of a classification system for web site list pages according to an embodiment of the present invention, the classification system comprising:

the web page obtaining module 510 is configured to obtain a set of web pages of a website to be classified;

the feature extraction module 520 is configured to extract, for each website webpage, a statistical feature and a structural feature of the website webpage, so as to obtain a feature sequence corresponding to each website webpage;

the web page classification module 530 has a pre-trained neural network classification model, where the neural network classification model is used to determine whether the web pages of the web sites to be classified are web site list pages according to the feature sequence.

The classification system of web site listing pages of claim 11, wherein the classification system further comprises: the training module is used for training the neural network;

Based on the same inventive concept, the present invention further provides a computer readable storage medium, on which an information transfer implementation program is stored, which when executed by a processor, implements the steps of any of the above classification methods.

Of course, the present invention is capable of other various embodiments and its several details are capable of modification and variation in light of the present invention, as will be apparent to those skilled in the art, without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method of classifying web site listing pages, the method based on hypertext markup language tags (HTML tags), the method comprising:

step 100, acquiring a group of website webpages;

step 400, acquiring a website webpage to be classified, inputting the characteristic sequence of the website webpage to be classified into the website list page classifier obtained in the step 300 according to the characteristic sequence of the website webpage to be classified obtained in the step 200, and judging whether the website webpage to be classified is a website list page;

the structural features include N-gram (N-gram) features, wherein the N-gram features include unigram (uni-gram) features and binary grammar (bi-gram) features;

the extracting step of the N-gram features comprises the following steps:

step 220, classifying each tag element in the HTML tag sequence;

step 230, extracting the N-gram feature for each different category of tag elements in the HTML tag sequence;

the tag elements in the HTML tag sequence are divided into: labels containing external links, labels without external links and text labels;

2. The method of classifying web site listing pages according to claim 1, wherein the statistical features include:

the link length of the website page and the reciprocal of the link length;

the link depth of the website page and the reciprocal of the link depth;

3. The method of classifying web site listing pages according to any one of claims 1 or 2, wherein the neural network is a fully connected neural network comprising an input layer, a hidden layer and an output layer.

4. A method of classifying web site listing pages in accordance with claim 3 wherein the activation function of the fully connected neural network is a Gelu function and the loss function is a cross entropy function.

5. The method for classifying web site listing pages according to claim 4, wherein in the training step of the fully-connected neural network, the method comprises:

step 310, inputting the feature sequence into the input layer;

and when the classification result is [0,1], the input website webpage is a list page.

6. The method of claim 5, wherein in step 320, training of the fully connected neural network is accelerated using a label smoothing method, an exponential sliding average method, and/or a batch normalization method.

7. The method according to claim 5, wherein in the step 320, classification parameters of the fully connected neural network are obtained by a back propagation algorithm and a gradient descent method.

8. A classification system for web site listing pages, the classification system based on hypertext markup language tags (HTML tags), the classification system comprising:

the webpage classification module is provided with a pre-trained neural network classification model, and the neural network classification model is used for judging whether the website webpages to be classified are website list pages or not according to the feature sequence;

the extraction process of the N-gram features comprises the following steps:

parsing each website webpage into a Document Object Model (DOM) tree, and expressing the Document Object Model (DOM) tree as an HTML tag sequence;

classifying each tag element in the HTML tag sequence;

extracting the N-gram features for tag elements of different categories in the HTML tag sequence;

9. The classification system of web site listing pages of claim 8, wherein the classification system further comprises: the training module is used for training the neural network;

10. A computer-readable storage medium, wherein a program for realizing information transfer is stored on the computer-readable storage medium, which when executed by a processor, realizes the steps of the method for classifying web site list pages according to any one of claims 1 to 7.