CN110457579B

CN110457579B - Webpage denoising method and system based on cooperative work of template and classifier

Info

Publication number: CN110457579B
Application number: CN201910694087.XA
Authority: CN
Inventors: 王运锋; 严金承
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2019-07-30
Filing date: 2019-07-30
Publication date: 2022-03-22
Anticipated expiration: 2039-07-30
Also published as: CN110457579A

Abstract

The invention discloses a webpage denoising method and system based on cooperative work of a template and a classifier, wherein the denoising method comprises the following steps: analyzing the obtained original HTML document, deleting irrelevant tag nodes, and generating a simplified DOM tree meeting the requirement; calculating the characteristics of each block level node in the DOM tree of the target webpage to obtain an original node set; adding the original node set into a cache node set of a corresponding website, and triggering a template generation algorithm to update the template node set of the corresponding website when the number of elements in the cache node set reaches a preset threshold value; filtering an original node set of a target webpage by using a template node set of a website to which the target webpage belongs to obtain a filtered target webpage node set; and classifying the filtered target webpage node set by using a trained classifier, reserving the classification result as a node of the main content, and extracting a main content text from the node. The method has the advantages of less manual intervention and high efficiency, and is suitable for denoising various theme type webpages.

Description

Webpage denoising method and system based on cooperative work of template and classifier

Technical Field

The invention relates to the technical field of webpage denoising, in particular to a webpage denoising method and system based on cooperative work of a template and a classifier.

Background

With the continuous development of internet technology, the amount of information in the internet is increasing and increasing explosively. Massive web page information is a main embodiment of internet information and is a natural data mine in many other research fields, including: search engine, public opinion analysis, natural language processing, etc. However, besides the main content, the web page is also accompanied with other information unrelated to the main content, such as some commercial advertisements, navigation bars, copyright information, announcement messages and the like, and the information can be called as web page noise, so how to remove the noise content in the web page and extract the main content of the web page for the analysis and use in the field has important research significance and practical value.

At present, the main methods for webpage denoising include a rule-based denoising method, a template-based denoising method, a visual content-based denoising method, and the like. The rule-based method is to preset some heuristic rules and screen out text contents meeting the rules, and the method is only suitable for some simple webpages, and complex heuristic rules are needed for webpages with complex structures, so that the method has limitations. The template-based method is high in denoising speed, but a template suitable for a specific website webpage is often required to be constructed manually and cannot be used as a general webpage denoiser, in 2010, lilililii et al, in a thesis Document information extraction method research based on an HTML tree and a template, webpage similarity calculation is adopted to classify different webpages, a corresponding template is constructed for each class, the template utilizes position information of main body content, when the main body content is dispersed to multiple Document Object Model (DOM) nodes, the nearest father node containing the main body content is selected as the template, the extracted main body information may contain a large amount of noise, and the denoising effect is greatly influenced. The denoising method based on visual contents firstly divides a webpage into different blocks, predicts the importance degree of the webpage blocks by utilizing manual labeling and a neural network and a support vector machine, and finally selects the webpage block with the highest importance degree, but the method has large calculation amount and low efficiency.

Disclosure of Invention

The invention aims to solve the technical problem of providing a webpage denoising method and system based on cooperative work of a template and a classifier, which can automatically generate a denoising template for preprocessing, perform classification judgment on DOM nodes by the cooperative classifier, and finally extract main body information; the method has the advantages of less manual intervention and high efficiency, and is suitable for denoising various theme type webpages.

In order to solve the technical problems, the invention adopts the technical scheme that:

a webpage denoising method based on cooperative work of a template and a classifier comprises the following steps:

step 1: downloading a target webpage and acquiring an original HTML document;

step 2: analyzing an original HTML document, deleting irrelevant tag nodes, correcting a DOM tree, and generating a simplified DOM tree meeting the requirement;

and step 3: calculating the characteristics of each block level node in a DOM tree of a target webpage to obtain an original node set of the target webpage;

and 4, step 4: generating a template, namely adding the original node set into a cache node set of a corresponding website, and triggering a template generation algorithm to update the template node set of the corresponding website when the number of elements in the cache node set reaches a preset threshold value;

and 5: filtering an original node set of a target webpage by using a template node set of a website to which the target webpage belongs, and outputting a filtered target webpage node set;

step 6: training a classifier, namely marking some nodes as noise and a main body in advance, and training the classifier by using the marked nodes as training samples until the classifier achieves a preset classification effect;

and 7: and classifying the filtered target webpage node set by using a trained classifier, reserving the classification result as a node of the main content, and extracting a main content text from the node.

Further, the step 1 specifically comprises: the method comprises the steps of webpage downloading and webpage discovery; the webpage downloading is responsible for downloading the target webpage and storing the target webpage into the database in a classified manner according to the difference of domain names and addresses of the target webpage, and the webpage finding is responsible for finding a new webpage address meeting the requirement and adding the new webpage address to the list to be crawled.

Further, the step 2 specifically comprises: pretreatment and correction are included; preprocessing is responsible for deleting tags which do not contain text content, including annotations, scripts and styles, and correcting namely correctable errors of a DOM tree, including "< >" matching errors and tag pair matching errors.

Further, in step 3, the node characteristics include: the node text content length-to-document text content length ratio, the node text content length, the node text content punctuation mark length-to-node text content length ratio, the node link label number-to-document link label number ratio, the node picture label number-to-document picture label number ratio, the node weight fraction, the node internal link character-to-text content length ratio, the node internal link label number plus picture label number-to-node text content length ratio.

Further, in step 6, a classifier model adopted by the classifier is a Support Vector Machine (SVM) or a Classification And Regression Tree (CART).

A webpage denoising system based on cooperative work of a template and a classifier comprises a webpage crawler module, an HTML preprocessing module, a DOM tree feature vector calculation module, a template generation module, a template preprocessing module, a classifier training module and a classifier prediction module;

the webpage crawler module is used for downloading a target webpage and acquiring an original HTML document;

the HTML preprocessing module is used for analyzing an original HTML document, deleting irrelevant tag nodes, correcting the DOM tree and generating a simplified DOM tree meeting the requirement;

the DOM tree feature vector calculation module is used for calculating the feature of each block level node in the DOM tree of the target webpage to obtain an original node set of the target webpage;

the template generation module is used for adding the original node set into a cache node set of the corresponding website, and when the number of elements in the cache node set reaches a preset threshold value, triggering a template generation algorithm to update the template node set of the corresponding website;

the template preprocessing module is used for filtering an original node set of a target webpage by using a template node set of a website to which the target webpage belongs and outputting a filtered target webpage node set;

the classifier training module is used for training a classifier, namely, some nodes are marked as a noise class and a main body class in advance, and the marked nodes are used as training samples to train the classifier until the classifier achieves a preset classification effect;

the classifier prediction module is used for classifying the filtered target webpage node set by using a trained classifier, reserving the classification result as a node of the main content, and extracting a main content text from the node.

Compared with the prior art, the invention has the beneficial effects that: the template and the classifier are used for cooperative work, denoising processing is carried out in two stages, and the denoising effect is good. In the first stage, public noise information of a target website can be automatically identified as a template to carry out noise filtration on a target webpage; in the second stage, the webpage denoising problem is regarded as a classification problem, and the classifier is utilized to screen out main body information. The invention has high processing speed in the first stage, does not need manual intervention, and greatly lightens the processing burden in the second stage due to filtering partial noise information. The method has wide adaptability and is a universal theme type webpage denoising method.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a schematic diagram of the system of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

As shown in FIG. 1, the denoising method of the present invention comprises the following steps:

firstly, acquiring an original HTML document through a webpage crawler technology, wherein the original HTML document comprises webpage downloading and webpage discovery. The webpage downloading is responsible for downloading a target webpage and storing the target webpage into a database in a classified manner according to different domain names and addresses of the target webpage; the web page discovery is responsible for discovering a new web page address that meets the requirements and adding it to the list to be crawled.

And secondly, processing the original HTML document, including preprocessing and correcting. Wherein, the preprocessing is responsible for deleting the labels which do not contain text content, such as comments, scripts, styles and the like; the correction is a correctable error for correcting the DOM tree, and comprises a "< >" matching error, a tag pair matching error and the like. And after processing, outputting the simplified DOM tree meeting the requirement.

And thirdly, performing feature calculation on each block level node in the DOM tree, storing the feature calculation in a node structure, and outputting an original node set origin nodes corresponding to the DOM tree. The involved features include: the node text content length-to-document text content length ratio, the node text content length, the node text content punctuation mark length-to-node text content length ratio, the node link label number-to-document link label number ratio, the node picture label number-to-document picture label number ratio, the node weight fraction, the node internal link character-to-text content length ratio, the node internal link label number plus picture label number-to-node text content length ratio. When the above features are statistically calculated, the contents of sub-block-level nodes below the block-level node are excluded, and the feature vector of each block-level node is calculated from the top.

And fourthly, automatically generating a template of a webpage of a certain site, maintaining a template node set Pattern and a cache node set TempNodes for each website, adding an original node set origin node of a target website into the cache node set TempNodes of the corresponding site, counting each node in the cache node set TempNodes once the number of elements in the cache node set TempNodes exceeds a set threshold, wherein the nodes with higher text repetition frequency are generally nodes carrying the copyright information of the website, repeated advertisements and other noise contents, adding the nodes into the template node set Pattern, and the set is the template of the website and records the common noise information of the webpages under the website.

And fifthly, filtering partial noise information in the target webpage through the template node set pattern nodes, and outputting a filtered target webpage node set PreNodes, wherein the PreNodes is origin Nodes-pattern Nodes.

And sixthly, training an SVM or CART classifier. Marking some nodes as noise and a main body in advance, training a classifier by using the marked nodes as training samples, stopping training when the classifier achieves a preset classification effect, and outputting a trained classifier.

And seventhly, classifying the nodes in the filtered target webpage node set PreNodes by using a classifier class, dividing the nodes into a noise node set and a main content node set RstNodes, and finally outputting texts in the main content node set RstNodes.

As shown in fig. 2, the system of the method includes: the system comprises a web crawler module 101, a web preprocessing module 102, a DOM tree feature vector calculation module 103, a template generation module 104, a database system 105, a template preprocessing module 106, a classifier training module 107 and a classifier prediction module 108.

The web crawler module 101: the system is responsible for circularly and uninterruptedly grabbing new target webpages meeting the requirements;

the preprocessing module 102: the module is connected with the module 101, and is used for deleting irrelevant tags from the target webpage, correcting error tag pairs and outputting a simplified DOM tree;

DOM tree feature vector calculation module 103: it is connected with the module 102, and calculates the characteristic vector of the simplified DOM tree and outputs the original node set origin nodes of the target webpage;

the template generation module 104: the template generation module is connected with a module 103, and is used for carrying out template generation processing on an original node set origin node to generate a template node set pattern nodes;

the database 105 is connected with the module 104 and is used for persisting the generated template node set pattern nodes;

and the template preprocessing module 106 is connected with the module 103 to acquire the original node set OriginNodes generated by the module 103, and meanwhile, the template preprocessing module 106 is also connected with the database 105 to query the template node set Pattern nodes of the website to which the target webpage belongs. Outputting a filtered target webpage node set PreNodes;

classifier training module 107: is responsible for training the classifier class;

and a classifier prediction module 108, connected to the module 106, for receiving the filtered target webpage node set PreNodes output by the module 106. Meanwhile, a classifier prediction module 108 is connected to the module 107 for receiving 107 the classifier class provided by the module. The set is divided into two sets of noise and main content by a classifier class, and the main content is output.

The technical effect of the invention is verified by the specific examples below.

Step S201: and taking out a Uniform Resource Locator (URL) from the queue to be crawled, downloading the webpage, screening URLs meeting conditions in the webpage, adding the URLs into the queue to be crawled, and transferring to the step S201 to realize uninterrupted webpage acquisition. And simultaneously preprocessing the webpage, comprising the following steps: and deleting the irrelevant label and correcting the wrong label pair. Then, the webpage is analyzed to be a DOM tree, and the operation is switched to S202 in a parallel mode;

step S202: the feature vector is calculated for each block-level node from the bottom up for the DOM tree output in step S201. The feature components include: the node text content length-to-document text content length ratio, the node text content length, the node text content punctuation mark length-to-node text content length ratio, the node link label number-to-document link label number ratio, the node picture label number-to-document picture label number ratio, the node weight fraction, the node internal link character-to-text content length ratio, the node internal link label number plus picture label number-to-node text content length ratio. When the above features are calculated statistically, the contents of the descendant block nodes are not counted into the node, and each feature vector is stored into the node, so that the whole DOM tree can obtain an original node set OriginNodes, and the process goes to step S203. If the original node set OriginNodes is used for classifier training, the S204 is switched to in a parallel mode;

step S203: adding the original node set origin nodes output in the step S202 to a cache area maintained for a website to which the target webpage belongs, extracting common noise information once the number of elements in the cache area reaches a set threshold value, adding an extraction result to a template node set pattern nodes set, and turning to a step S205; otherwise, go directly to S205.

Step S204: the original node set origin nodes need to be marked manually and used for classifier class training, and once the classifier class achieves the effect of meeting the system requirements, the operation can be stopped. This step is not necessary unless the current classifier class does not meet the system requirements, requiring a new classifier to be trained. And after the training of the classifier class is finished, the step S206 is carried out to update the current classifier class.

Step S205: the template node set patterns are used for filtering the original node set OriginNodes, the effect is equivalent to filtering partial noise information of a target webpage, the information is common noise information of the target website, and the information comprises the following steps: website copyright information, partial advertisement, website webpage structure information and the like, wherein the filtered node set is a filtered target webpage node set PreNodes, and the step is turned to S206.

Step S206: and classifying the filtered target webpage node set PreNodes by using the current classifier class, and outputting a classification result as the content in the main content node.

Through the mode, the main-body type webpage 24334 is obtained for websites such as reference messages, daily reports of people, daily reports of Sichuan, urban reports of Chinese and western cities, flight news, fox searching news, news of new waves, headlines of today, phoenix nets, light nets, world nets, governments of Sichuan people, urban people governments and the like, and denoising is carried out. 2000 random samples are taken for inspection, the average denoising accuracy rate is 98.64%, and the average recall rate is 93.46%. The method is applied to a public opinion analysis system, improves the corpus quality of the system, and has great significance for improving the accuracy of the public opinion analysis system.

Claims

1. A webpage denoising method based on cooperative work of a template and a classifier is characterized by comprising the following steps:

step 1: downloading a target webpage and acquiring an original HTML document;

and step 3: calculating the characteristics of each block level node in a DOM tree of a target webpage to obtain an original node set of the target webpage; the node features include: the node text content length-to-document text content length ratio, the node text content length, the node text content punctuation mark length-to-node text content length ratio, the node link label number-to-document link label number ratio, the node picture label number-to-document picture label number ratio, the node weight fraction, the node internal link character-to-text content length ratio, the node internal link label number plus picture label number-to-node text content length ratio; when the characteristics are calculated in a statistical mode, the content of sub-block level nodes below the block level node is excluded, and the characteristic vector of each block level node is calculated from the top to the bottom;

and 4, step 4: generating a template, namely automatically generating a template of a webpage of a certain website, maintaining a template node set and a cache node set for each website, adding an original node set into the cache node set of the corresponding website, triggering a template generation algorithm when the number of elements in the cache node set reaches a preset threshold value, namely counting each node in the cache node set, and adding a node with the text repetition frequency exceeding a set value into the template node set of the corresponding website so as to update the template node set of the corresponding website;

step 6: training a classifier, namely marking some nodes as noise and a main body in advance, adding the marked nodes into a set, and training the classifier by using the nodes in the set as training samples until the classifier achieves a preset classification effect;

and 7: and classifying the nodes in the filtered target webpage node set by using a trained classifier, reserving the classification result as the node of the main content, and extracting the main content text from the node.

2. The method for denoising a webpage based on the cooperative work of the template and the classifier as claimed in claim 1, wherein the step 1 specifically comprises: the method comprises the steps of webpage downloading and webpage discovery; the webpage downloading is responsible for downloading the target webpage and storing the target webpage into the database in a classified manner according to the difference of domain names and addresses of the target webpage, and the webpage finding is responsible for finding a new webpage address meeting the requirement and adding the new webpage address to the list to be crawled.

3. The method for denoising a webpage based on the cooperative work of the template and the classifier as claimed in claim 1, wherein the step 2 is specifically: pretreatment and correction are included; preprocessing is responsible for deleting tags which do not contain text content, including annotations, scripts and styles, and correcting namely correctable errors of a DOM tree, including "< >" matching errors and tag pair matching errors.

4. The method for denoising web pages based on the cooperation of the template and the classifier as claimed in claim 1, wherein in step 6, the classifier model adopted by the classifier is a support vector machine or a classification regression tree.

5. A webpage denoising system based on cooperative work of a template and a classifier is characterized by comprising a webpage crawler module, an HTML preprocessing module, a DOM tree feature vector calculation module, a template generation module, a template preprocessing module, a classifier training module and a classifier prediction module;

the node features include: the node text content length-to-document text content length ratio, the node text content length, the node text content punctuation mark length-to-node text content length ratio, the node link label number-to-document link label number ratio, the node picture label number-to-document picture label number ratio, the node weight fraction, the node internal link character-to-text content length ratio, the node internal link label number plus picture label number-to-node text content length ratio; when the characteristics are calculated in a statistical mode, the content of sub-block level nodes below the block level node is excluded, and the characteristic vector of each block level node is calculated from the top to the bottom;

the template generation module is used for adding the original node set into a cache node set of a corresponding website, when the number of elements in the cache node set reaches a preset threshold value, a template generation algorithm is triggered, namely counting each node in the cache node set, and adding the node with the text repetition frequency exceeding a set value into the template node set of the corresponding website so as to update the template node set of the corresponding website;

the classifier prediction module is used for classifying the nodes in the filtered target webpage node set by using the trained classifier, reserving the nodes with classification results as main content, and extracting a main content text from the nodes.