CN112287272B - Method, system and storage medium for classifying website list pages - Google Patents

Method, system and storage medium for classifying website list pages Download PDF

Info

Publication number
CN112287272B
CN112287272B CN202011161424.8A CN202011161424A CN112287272B CN 112287272 B CN112287272 B CN 112287272B CN 202011161424 A CN202011161424 A CN 202011161424A CN 112287272 B CN112287272 B CN 112287272B
Authority
CN
China
Prior art keywords
website
tag
neural network
features
gram
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011161424.8A
Other languages
Chinese (zh)
Other versions
CN112287272A (en
Inventor
孟剑
郭岩
贺广福
史存会
陈银鹏
俞晓明
刘悦
程学旗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202011161424.8A priority Critical patent/CN112287272B/en
Publication of CN112287272A publication Critical patent/CN112287272A/en
Application granted granted Critical
Publication of CN112287272B publication Critical patent/CN112287272B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a classification method of website list pages, which is based on hypertext markup language tags (HTML tags), and comprises the following steps: step 100, acquiring a group of website webpages; step 200, respectively extracting statistical features and structural features of the website webpages aiming at each website webpage to obtain a feature sequence corresponding to each website webpage; step 300, inputting the characteristic sequence into a neural network to train the neural network, and obtaining a website list page classifier; step 400, obtaining a website webpage to be classified, inputting the feature sequence of the website webpage to be classified into the website list page classifier obtained in the step 300 according to the feature sequence of the website webpage to be classified obtained in the step 200, and judging whether the website webpage to be classified is a website list page.

Description

Method, system and storage medium for classifying website list pages
Technical Field
The invention relates to the technical field of webpage classification, in particular to a classification method and a classification system of a website list page (Board page) based on N-gram characteristics of an HTML Tag.
Background
With the recent development of the internet, networks have become the largest source of data. There has long been a focus on internet data collection tasks. One common collection method is customized collection, namely, customized development is performed on a certain or a certain specific website, the website link condition is analyzed, and then a data extraction method is constructed according to the page and the network characteristics of the website link condition.
The data in the internet can be often divided into different information sources such as news, forum, blogs and the like according to the release and interaction modes, each information source has a specific format, such as a news data source, the data comprises data of news texts, news authors, news topics, news comments and the like, and each news page has the category to which the news pages belong. The same forum is also divided into plates, and the data of the forum comprises the main paste, the reply of the forum and the like. Custom development of collectors for each information source, and even each website, necessarily results in collectors that cannot be reused. This is a waste of development. Through research on a large number of websites with multiple information sources, the network data structures with different information sources have different forms, but have a certain general characteristic. For example, websites in news information sources, whether classified according to content or the first page of the website, have pages similar to a list, the pages directly and explicitly list related news article links according to a certain rule, and depending on the number of all articles under the related rule, related page turning links are also available on the pages, so that more articles can be acquired. Similarly, there may be a similar structure for websites in blog information sources, often more obvious as personal top page, or personal timeline. Similar structures exist for websites in the forum information sources as well.
For this structure it can be generalized to a Board-Article structure, where the list page is called the Board page and the real data page to be collected is called the Article page. The Board pages are often subject matter-dependent, i.e., all of the links to the pages of an art on a Board often surround a unified subject or have unified strong features. This feature of the Board page ensures that data under the demand topic can be captured by one Board page, thereby avoiding the collection of redundant data. The Board page serves as an entry page and the Article page has a tree structure instead of an open graph structure, which enables perception of data changes by scanning the Board page. By analysis of the Board page, changes in the data can be easily obtained, thus tracking the data more efficiently. Therefore, how to find a Board page from a website becomes a problem that the customized collection must solve.
The discovery methods of the Board page mainly comprise the following steps:
(1) Based on manual work: i.e. manually screening out the Board pages from the website. Because of the significant diversity of web pages, manual screening of the Board pages is quite expensive in the face of large-scale web sites, especially large-scale web sites. At the same time, frequent revisions of websites also increase the instability of the Board pages, requiring further manual effort to rescreen the Board pages.
(2) Based on rules: the experience of manually screening the Board pages is converted into rules, and the human simulator discovers the Board pages from the website based on the rules. Similarly, web pages have significant diversity, so that rule-based methods have inherent drawbacks of weak generalization ability, and the recall rate and accuracy of the Board page cannot be guaranteed.
Therefore, the existing method for finding the page mainly relies on the visual knowledge of people on the page, and various features of the page, especially some hidden regular features, cannot be fully utilized, so that the generalization capability of the method is weak, and recall rate and accuracy of the page cannot be ensured, which can influence the quality of customized collected data to a great extent.
Disclosure of Invention
In order to solve the technical problems, the invention aims to provide a classification method and a classification system for a website list page (Board page) based on an HTML Tag. The classification method of the Board pages better utilizes the visual characteristics of the Board pages, better captures various hidden characteristics of the Board pages by utilizing a neural network model, and has better generalization capability.
Specifically, the invention discloses a classification method of website list pages, which is based on hypertext markup language tags (HTML tags), and comprises the following steps:
step 100, acquiring a group of website webpages;
step 200, respectively extracting statistical features and structural features of the website webpages aiming at each website webpage to obtain a feature sequence corresponding to each website webpage;
step 300, inputting the characteristic sequence into a neural network to train the neural network, and obtaining a website list page classifier;
step 400, obtaining a website webpage to be classified, inputting the feature sequence of the website webpage to be classified into the website list page classifier obtained in the step 300 according to the feature sequence of the website webpage to be classified obtained in the step 200, and judging whether the website webpage to be classified is a website list page.
The method for classifying the website list pages comprises the following steps:
the number of occurrences of each of the hypertext markup language tags (HTML tags) and the inverse of the number;
each hypertext markup language Tag (HTML Tag) has a number of links and a reciprocal of the number;
each hypertext markup language Tag (HTML Tag) has a number of times that text is present and a reciprocal of the number of times;
the link length of the website page and the reciprocal of the link length;
the link depth of the website page and the reciprocal of the link depth;
extreme values of the number of texts in hypertext markup language tags (HTML tags) of plain text and the reciprocal of the extreme value;
variance of the number of texts in the hypertext markup language Tag (HTML Tag) and reciprocal of the variance;
the mean value of the number of texts in the hypertext markup language Tag (HTML Tag) and the reciprocal of the mean value;
the mean square error of the number of text in the hypertext markup language Tag (HTML Tag) and the inverse of the mean square error.
The method for classifying web site list pages according to, wherein the structural features include N-gram (N-gram) features, wherein the N-gram features include unigram (uni-gram) features and binary gram (bi-gram) features.
The method for classifying the website list pages comprises the following steps of:
step 210, parsing each website webpage into a Document Object Model (DOM) tree, and expressing the Document Object Model (DOM) tree as an HTML tag sequence;
step 220, classifying each tag element in the HTML tag sequence;
step 230, extracting the N-gram feature for each different class of tag elements in the HTML tag sequence.
The method for classifying web site list pages according to the present invention, wherein in the step 220, the tag elements in the HTML tag sequence are divided into: labels containing external links, labels without external links and text labels;
the tag containing the external link comprises a link address (URL) pointing to the outside, the tag without the external link does not comprise a link address pointing to the outside, and the text tag consists of the tag containing the external link and a part outside the tag without the external link.
The method for classifying the website list pages comprises the steps of selecting a website list page, wherein the neural network is a fully-connected neural network, and the fully-connected neural network comprises an input layer, a hidden layer and an output layer.
According to the classification method of the website list pages, the activation function of the fully-connected neural network is a Gelu function, and the loss function is a cross entropy function.
The method for classifying the website list pages according to the present invention, wherein the training step of the fully-connected neural network comprises:
step 310, inputting the feature sequence into the input layer;
step 320, the hidden layer calculates and trains the fully connected neural network according to the Gelu function and the cross entropy function for the feature sequence to obtain the classification parameters of the fully connected neural network;
step 330, the output layer outputs the classification result of the website webpage according to the classification parameter;
and when the classification result is [0,1], the input website webpage is a list page (Board page).
The method for classifying web site list pages according to the present invention, wherein in the step 320, training of the fully connected neural network is accelerated using a label smoothing method (label smooth), an exponential sliding average method (exponential moving average, EMA), and/or a batch normalization method (batch normalization).
According to the method for classifying web site list pages, in step 320, classification parameters of the fully connected neural network are obtained through a back propagation algorithm and a gradient descent method.
To achieve another object of the present invention, there is also provided a classification system of web site list pages, the classification system being based on hypertext markup language tags (HTML tags), the classification system comprising:
the webpage acquisition module is used for acquiring a group of website webpages to be classified;
the feature extraction module is used for extracting statistical features and structural features of the website webpages according to the website webpages respectively to obtain feature sequences corresponding to the website webpages;
the webpage classification module is provided with a pre-trained neural network classification model, and the neural network classification model is used for judging whether the website webpages to be classified are website list pages or not according to the feature sequence.
The classification system of the website list pages according to the above, wherein the classification system further comprises: the training module is used for training the neural network;
the neural network is a fully-connected neural network, the activation function of the fully-connected neural network is a Gelu function, and the loss function is a cross entropy function; the fully-connected neural network comprises an input layer, a hidden layer and an output layer; the input layer acquires the characteristic sequence; and the hidden layer carries out operation on the characteristic sequence according to a Gelu function and a cross entropy function and trains the fully-connected neural network to obtain the classification parameters of the fully-connected neural network.
To achieve another object of the present invention, there is also provided a computer-readable storage medium having stored thereon an information delivery implementation program which, when executed by a processor, implements the steps of the method for classifying web site list pages as set forth in any one of the above.
The classification method provided by the invention uses the N-gram characteristics of the HTML Tag sequence, so that the classification method can capture the visual characteristics of the HTML pages contained in the DOM tree structure, and the visual characteristics can better describe the cognition of people on the layout and the content of the webpage, thus being one of the best quality characteristics in webpage analysis. In addition, the classification method also uses a neural network model, and the neural network model has natural learning ability, can mine hidden characteristics and has strong generalization ability. Therefore, compared with the existing Board page discovery method, the Board page classification method of the invention better utilizes the visual characteristics of the Board pages, better captures various hidden characteristics of the Board pages by utilizing the neural network model, and has better generalization capability.
In order to make the above features and effects of the present invention more clearly understood, the following specific examples are given with reference to the accompanying drawings.
Drawings
FIG. 1 is a flow chart of a method for classifying web site list pages according to an embodiment of the present invention;
FIG. 2 is a flowchart of extracting N-gram features of a classification method of web site list pages according to an embodiment of the present invention;
FIG. 3 is a diagram showing an example of labels with external links in a method for classifying web site listing pages according to an embodiment of the present invention;
FIG. 4 is a diagram showing an example of labels without external links in a method for classifying web site listing pages according to an embodiment of the present invention;
FIG. 5 is an exemplary diagram of text labels in a method for classifying web site listing pages according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a fully connected neural network in a classification method of web site list pages according to an embodiment of the invention.
FIG. 7 is a block diagram of a classification system for web site listing pages according to an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. It will be apparent that the embodiments described below are only some, but not all, embodiments of the invention.
Furthermore, the description of "first," "second," etc. in this disclosure is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.
The invention is used for solving the problems that various characteristics of the Board page, especially some hidden regular characteristics, cannot be fully utilized in the prior art, so that the generalization capability of the method is weak, and the recall rate and accuracy of the Board page cannot be ensured, and abstracting the discovery problem of the Board page into a two-class problem, namely, dividing web pages of websites into two classes: the Board page and the non-Board page, and the web pages are classified based on the neural network using the structural features of the page DOM (Document Object Model ) tree, and the features of the web page URLs, thereby finding the Board page.
In the present invention, the following assumptions exist for the website, page, and their relationship to each other:
1. a web site is made up of pages that have one or more identifications, or URLs. Each URL uniquely corresponds to a page;
2. the page itself is composed of HTML, which contains node, node attribute, node content (text) and node style information, wherein the node attribute may have other page URLs;
3. by URLs of other pages in the current page, it can be considered that one page points to one page.
Based on the above assumptions, the web sites themselves constitute a directed network in which the nodes have individual characteristics. In the present invention, the features of the Board page are abstracted, i.e. the definition of the Board page: the Board page is naturally owned by the website and can be regarded as a node in a network formed by the website; the Board page is an aggregate page of the Article page to which the Board page points, so that the node corresponding to the Board page has specific network structure characteristics in the network.
The invention mainly utilizes the structural characteristics of a page DOM (Document Object Model ) tree and the characteristics of web page URLs, classifies web pages based on a neural network, and particularly provides a Board page classification method based on the n-gram characteristics of HTML tags.
Referring to fig. 1, fig. 1 is a flowchart illustrating a classification method of a website list page (Board page) according to an embodiment of the invention. As shown in fig. 1, the classification method includes the steps of:
step 100, a set of website webpages is acquired. Specifically, in this embodiment, the method for obtaining the web page mainly obtains html source code of the web page according to the web page link.
And 200, respectively extracting statistical features and structural features of the website webpages aiming at each website webpage to obtain a feature sequence corresponding to each website webpage.
In this embodiment, the statistical features of the web pages of the website include: the number of occurrences of each of the hypertext markup language tags (HTML tags) and the inverse of the number; each hypertext markup language Tag (HTML Tag) has a number of links and a reciprocal of the number; each hypertext markup language Tag (HTML Tag) has a number of times that text is present and a reciprocal of the number of times; the link length of the website page and the reciprocal of the link length; the link depth of the website page and the reciprocal of the link depth; extreme values of the number of texts in hypertext markup language tags (HTML tags) of plain text and the reciprocal of the extreme value; variance of the number of texts in the hypertext markup language Tag (HTML Tag) and reciprocal of the variance; the mean value of the number of texts in the hypertext markup language Tag (HTML Tag) and the reciprocal of the mean value; the mean square error of the number of text in the hypertext markup language Tag (HTML Tag) and the inverse of the mean square error. Specifically, the website page of the present embodiment has some common statistical features, and by using these features, it is possible to help distinguish the Board page from other pages. Common statistical features include: the number of occurrences of each HTML Tag in common; the number of times each HTML Tag has a link; the number of times each HTML Tag has text; the link length of the page itself, the link depth (in url "/" number of occurrences); extreme values of Chinese numbers in the HTML Tag of the plain text; variance, mean and mean square error of the number of texts in the HTML Tag; reciprocal of the above feature (analog FM method).
In addition, for each page, the website webpage of the embodiment includes structural features such as N-gram features in addition to the above common statistical features, so that the present invention can capture visual features of HTML pages contained in the DOM tree structure. The visual features can better describe the cognition of people on the layout and the content of the webpage, and are one of the best quality features in webpage analysis. Thus, the Board page discovery method of the present invention better utilizes the visual characteristics of the Board page than existing Board page discovery methods. HTML tags constitute a tree, the DOM tree, based on the relationship between them. The tree structure itself is a salient feature, and the DOM tree structure is, to some extent, a fundamental constituent element of the visual features formed by HTML pages. It is therefore a significant matter to introduce DOM features other than statistics by using HTML Tag sequences in the DOM tree. In this embodiment, the DOM tree structure features include, for example, N-gram (N-gram) features, and the extracting step of the N-gram features includes, as shown in fig. 2:
at step 210, each of the web site pages is parsed into a Document Object Model (DOM) tree, and the Document Object Model (DOM) tree is expressed as a sequence of HTML tags. The specific steps are as follows:
the HTML Tag sequence in the DOM tree is extracted, i.e., the DOM tree is expressed as an HTML Tag sequence. For example, the DOM tree for a page is as follows:
Figure BDA0002744349710000081
the DOM tree can be compressed into an html div a external link a div div p yougood-! p div HTML'.
And 220, classifying each tag element in the HTML tag sequence. The elements in the above HTML Tag sequence in the present invention are called token, and the token is classified into the following three types:
(1) Tag a, which contains an external link, is composed of a single tag, and the attribute of the tag contains a link URL pointing to the outside, as shown in fig. 3.
(2) Tag B without an external link, i.e., consisting of a single tag, does not include a link URL pointing to the outside in the attribute of the tag, as shown in fig. 4.
(3) The text label C is composed of the portions other than the above two labels token, i.e., the portion between the labels, as shown in fig. 5.
Step 230, extracting the N-gram features for each different class of Tag elements in the HTML Tag sequence, that is, extracting N-gram features, such as uni-gram features, bi-gram features, and the like, from the HTML Tag sequence.
And 300, inputting the characteristic sequence into a neural network to train the neural network, and obtaining a website list page classifier. The neural network model is used in the invention, has natural learning ability, can mine hidden characteristics, and has strong generalization ability.
Artificial neural networks (Artificial Neural Networks, abbreviated as ANNs) are also simply called Neural Networks (NNs) or Connection models (Connection models), which are mathematical models of algorithms that mimic the behavior of animal neural networks and perform distributed parallel information processing. The network relies on the complexity of the system and achieves the purpose of processing information by adjusting the relationship of the interconnection among a large number of nodes. Neural networks have a strong learning ability and have excellent applications in many fields. Therefore, the invention utilizes the neural network to realize the two classification of the web pages in the website, thereby finding the Board page. The method comprises the steps of extracting common statistical features and N-gram features from training web pages, inputting the common statistical features and N-gram features into a neural network, and learning a web page classifier based on the neural network.
The statistical features and the structural features obtained by the extraction are taken as a feature sequence of each training webpage and input into the neural network, and in the embodiment, a four-layer fully connected neural network is taken as an example for illustration. As shown in fig. 6, fig. 6 shows a schematic diagram of a four-layer fully connected neural network. The fully-connected neural network comprises an input layer IL, a hidden layer HL and an output layer OL, and in this embodiment, the input layer IL used by the fully-connected neural network is 1024 neurons, for example; the hidden layer HL includes three nerve units, namely 2048, 1024 and 512 neurons, but the invention is not limited thereto. The fully connected neural network uses Gelu as the activation function and cross entropy as the loss function.
Specifically, the training step of the fully connected neural network includes: step 310, inputting the feature sequence into the input layer; step 320, the hidden layer calculates and trains the fully connected neural network according to the Gelu function and the cross entropy function for the feature sequence to obtain the classification parameters of the fully connected neural network; step 330, the output layer outputs the classification result of the website webpage according to the classification parameter; when the output layer outputs [0,1], wherein 1 represents that the input web page of the website is a list page (Board page).
Considering that the number of non-Board pages in the website is large, the loss of the Board pages is weighted and the penalty factor is increased. Meanwhile, the network training is accelerated by using a label smoothing method and an EMA (exponential moving average, exponential sliding average) method and a batch normalization (batch normalization) method, so that the model generalization is improved. And finally, obtaining model parameters by using a back propagation algorithm and a gradient descent method to obtain the Board page classifier.
Step 400, obtaining website webpages to be classified, obtaining a characteristic sequence of each website webpage to be classified according to the steps, inputting the characteristic sequence of the website webpages to be classified into the website list page classifier, and judging whether the website webpages to be classified are website list pages.
Based on an inventive concept, the present invention further provides a classification system 500 for web site list pages, the classification system being based on hypertext markup language tags (HTML tags), as shown in fig. 7, fig. 7 shows a frame diagram of a classification system for web site list pages according to an embodiment of the present invention, the classification system comprising:
the web page obtaining module 510 is configured to obtain a set of web pages of a website to be classified;
the feature extraction module 520 is configured to extract, for each website webpage, a statistical feature and a structural feature of the website webpage, so as to obtain a feature sequence corresponding to each website webpage;
the web page classification module 530 has a pre-trained neural network classification model, where the neural network classification model is used to determine whether the web pages of the web sites to be classified are web site list pages according to the feature sequence.
The classification system of web site listing pages of claim 11, wherein the classification system further comprises: the training module is used for training the neural network;
the neural network is a fully-connected neural network, the activation function of the fully-connected neural network is a Gelu function, and the loss function is a cross entropy function; the fully-connected neural network comprises an input layer, a hidden layer and an output layer; the input layer acquires the characteristic sequence; and the hidden layer carries out operation on the characteristic sequence according to a Gelu function and a cross entropy function and trains the fully-connected neural network to obtain the classification parameters of the fully-connected neural network.
Based on the same inventive concept, the present invention further provides a computer readable storage medium, on which an information transfer implementation program is stored, which when executed by a processor, implements the steps of any of the above classification methods.
Of course, the present invention is capable of other various embodiments and its several details are capable of modification and variation in light of the present invention, as will be apparent to those skilled in the art, without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (10)

1. A method of classifying web site listing pages, the method based on hypertext markup language tags (HTML tags), the method comprising:
step 100, acquiring a group of website webpages;
step 200, respectively extracting statistical features and structural features of the website webpages aiming at each website webpage to obtain a feature sequence corresponding to each website webpage;
step 300, inputting the characteristic sequence into a neural network to train the neural network, and obtaining a website list page classifier;
step 400, acquiring a website webpage to be classified, inputting the characteristic sequence of the website webpage to be classified into the website list page classifier obtained in the step 300 according to the characteristic sequence of the website webpage to be classified obtained in the step 200, and judging whether the website webpage to be classified is a website list page;
the structural features include N-gram (N-gram) features, wherein the N-gram features include unigram (uni-gram) features and binary grammar (bi-gram) features;
the extracting step of the N-gram features comprises the following steps:
step 210, parsing each website webpage into a Document Object Model (DOM) tree, and expressing the Document Object Model (DOM) tree as an HTML tag sequence;
step 220, classifying each tag element in the HTML tag sequence;
step 230, extracting the N-gram feature for each different category of tag elements in the HTML tag sequence;
the tag elements in the HTML tag sequence are divided into: labels containing external links, labels without external links and text labels;
the tag containing the external link comprises a link address (URL) pointing to the outside, the tag without the external link does not comprise a link address pointing to the outside, and the text tag consists of the tag containing the external link and a part outside the tag without the external link.
2. The method of classifying web site listing pages according to claim 1, wherein the statistical features include:
the number of occurrences of each of the hypertext markup language tags (HTML tags) and the inverse of the number;
each hypertext markup language Tag (HTML Tag) has a number of links and a reciprocal of the number;
each hypertext markup language Tag (HTML Tag) has a number of times that text is present and a reciprocal of the number of times;
the link length of the website page and the reciprocal of the link length;
the link depth of the website page and the reciprocal of the link depth;
extreme values of the number of texts in hypertext markup language tags (HTML tags) of plain text and the reciprocal of the extreme value;
variance of the number of texts in the hypertext markup language Tag (HTML Tag) and reciprocal of the variance;
the mean value of the number of texts in the hypertext markup language Tag (HTML Tag) and the reciprocal of the mean value;
the mean square error of the number of text in the hypertext markup language Tag (HTML Tag) and the inverse of the mean square error.
3. The method of classifying web site listing pages according to any one of claims 1 or 2, wherein the neural network is a fully connected neural network comprising an input layer, a hidden layer and an output layer.
4. A method of classifying web site listing pages in accordance with claim 3 wherein the activation function of the fully connected neural network is a Gelu function and the loss function is a cross entropy function.
5. The method for classifying web site listing pages according to claim 4, wherein in the training step of the fully-connected neural network, the method comprises:
step 310, inputting the feature sequence into the input layer;
step 320, the hidden layer calculates and trains the fully connected neural network according to the Gelu function and the cross entropy function for the feature sequence to obtain the classification parameters of the fully connected neural network;
step 330, the output layer outputs the classification result of the website webpage according to the classification parameter;
and when the classification result is [0,1], the input website webpage is a list page.
6. The method of claim 5, wherein in step 320, training of the fully connected neural network is accelerated using a label smoothing method, an exponential sliding average method, and/or a batch normalization method.
7. The method according to claim 5, wherein in the step 320, classification parameters of the fully connected neural network are obtained by a back propagation algorithm and a gradient descent method.
8. A classification system for web site listing pages, the classification system based on hypertext markup language tags (HTML tags), the classification system comprising:
the webpage acquisition module is used for acquiring a group of website webpages to be classified;
the feature extraction module is used for extracting statistical features and structural features of the website webpages according to the website webpages respectively to obtain feature sequences corresponding to the website webpages;
the webpage classification module is provided with a pre-trained neural network classification model, and the neural network classification model is used for judging whether the website webpages to be classified are website list pages or not according to the feature sequence;
the structural features include N-gram (N-gram) features, wherein the N-gram features include unigram (uni-gram) features and binary grammar (bi-gram) features;
the extraction process of the N-gram features comprises the following steps:
parsing each website webpage into a Document Object Model (DOM) tree, and expressing the Document Object Model (DOM) tree as an HTML tag sequence;
classifying each tag element in the HTML tag sequence;
extracting the N-gram features for tag elements of different categories in the HTML tag sequence;
the tag elements in the HTML tag sequence are divided into: labels containing external links, labels without external links and text labels;
the tag containing the external link comprises a link address (URL) pointing to the outside, the tag without the external link does not comprise a link address pointing to the outside, and the text tag consists of the tag containing the external link and a part outside the tag without the external link.
9. The classification system of web site listing pages of claim 8, wherein the classification system further comprises: the training module is used for training the neural network;
the neural network is a fully-connected neural network, the activation function of the fully-connected neural network is a Gelu function, and the loss function is a cross entropy function; the fully-connected neural network comprises an input layer, a hidden layer and an output layer; the input layer acquires the characteristic sequence; and the hidden layer carries out operation on the characteristic sequence according to a Gelu function and a cross entropy function and trains the fully-connected neural network to obtain the classification parameters of the fully-connected neural network.
10. A computer-readable storage medium, wherein a program for realizing information transfer is stored on the computer-readable storage medium, which when executed by a processor, realizes the steps of the method for classifying web site list pages according to any one of claims 1 to 7.
CN202011161424.8A 2020-10-27 2020-10-27 Method, system and storage medium for classifying website list pages Active CN112287272B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011161424.8A CN112287272B (en) 2020-10-27 2020-10-27 Method, system and storage medium for classifying website list pages

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011161424.8A CN112287272B (en) 2020-10-27 2020-10-27 Method, system and storage medium for classifying website list pages

Publications (2)

Publication Number Publication Date
CN112287272A CN112287272A (en) 2021-01-29
CN112287272B true CN112287272B (en) 2023-05-23

Family

ID=74372392

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011161424.8A Active CN112287272B (en) 2020-10-27 2020-10-27 Method, system and storage medium for classifying website list pages

Country Status (1)

Country Link
CN (1) CN112287272B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112884053B (en) * 2021-02-28 2022-04-15 江苏匠算天诚信息科技有限公司 Website classification method, system, equipment and medium based on image-text mixed characteristics
CN113806667B (en) * 2021-09-26 2023-10-03 上海交通大学 Method and system for supporting webpage classification
CN114817811B (en) * 2022-05-07 2024-03-19 盐城天眼察微科技有限公司 Website analysis method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020067A (en) * 2011-09-21 2013-04-03 北京百度网讯科技有限公司 Method and device for determining webpage type
CN103544210A (en) * 2013-09-02 2014-01-29 烟台中科网络技术研究所 System and method for identifying webpage types
CN109144513A (en) * 2018-08-22 2019-01-04 上海嘉道信息技术有限公司 A kind of method of automatic extraction list page
CN110110075A (en) * 2017-12-25 2019-08-09 中国电信股份有限公司 Web page classification method, device and computer readable storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180025012A1 (en) * 2016-07-19 2018-01-25 Fortinet, Inc. Web page classification based on noise removal

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020067A (en) * 2011-09-21 2013-04-03 北京百度网讯科技有限公司 Method and device for determining webpage type
CN103544210A (en) * 2013-09-02 2014-01-29 烟台中科网络技术研究所 System and method for identifying webpage types
CN110110075A (en) * 2017-12-25 2019-08-09 中国电信股份有限公司 Web page classification method, device and computer readable storage medium
CN109144513A (en) * 2018-08-22 2019-01-04 上海嘉道信息技术有限公司 A kind of method of automatic extraction list page

Also Published As

Publication number Publication date
CN112287272A (en) 2021-01-29

Similar Documents

Publication Publication Date Title
CN112287272B (en) Method, system and storage medium for classifying website list pages
CN112287273B (en) Method, system and storage medium for classifying website list pages
CN111310476B (en) Public opinion monitoring method and system using aspect-based emotion analysis method
CN102163187B (en) Document marking method and device
CN105139237A (en) Information push method and apparatus
CN111950273A (en) Network public opinion emergency automatic identification method based on emotion information extraction analysis
CN105677931A (en) Information search method and device
CN111767725A (en) Data processing method and device based on emotion polarity analysis model
CN110457579B (en) Webpage denoising method and system based on cooperative work of template and classifier
CN111881398B (en) Page type determining method, device and equipment and computer storage medium
CN103559199A (en) Web information extraction method and web information extraction device
CN110674298A (en) Deep learning mixed topic model construction method
US20240126827A1 (en) Transferable Neural Architecture for Structured Data Extraction From Web Documents
CN111625715A (en) Information extraction method and device, electronic equipment and storage medium
CN111309861B (en) Site extraction method, apparatus, electronic device, and computer-readable storage medium
CN112989208A (en) Information recommendation method and device, electronic equipment and storage medium
CN112287274B (en) Method, system and storage medium for classifying website list pages
CN108595466B (en) Internet information filtering and internet user information and network card structure analysis method
CN110175288B (en) Method and system for filtering character and image data for teenager group
EP2605150A1 (en) Method for identifying the named entity that corresponds to an owner of a web page
CN116776889A (en) Guangdong rumor detection method based on graph convolution network and external knowledge embedding
CN113806667B (en) Method and system for supporting webpage classification
Kamel et al. Robust sentiment fusion on distribution of news
CN114218364A (en) Question-answer knowledge base expansion method and device
Liu et al. User interest detection on web pages for building personalized information agent

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant