CN102298638A

CN102298638A - Method and system for extracting news webpage contents by clustering webpage labels

Info

Publication number: CN102298638A
Application number: CN2011102704180A
Authority: CN
Inventors: 高勇; 王放; 许欢庆; 郭永福; 陈沛
Original assignee: Beijing Zhongsou Network Technology Co ltd
Current assignee: Beijing Zhongsou Network Technology Co ltd
Priority date: 2011-08-31
Filing date: 2011-08-31
Publication date: 2011-12-28

Abstract

The invention provides a method and a system for extracting news webpage contents by clustering webpage labels. The method comprises the following steps of: preprocessing webpage contents, namely resolving the webpage contents into a document object model (DOM) tree and counting information of each node of the DOM tree; deleting the nodes of the DOM tree in an elicitation mode; deleting the nodes of the DOM tree regularly; and deleting the nodes of the DOM tree on the basis of the clustering of a label structure to generate and output the final DOM tree.

Description

Use the webpage label clustering to extract the method and system of news web page content

Technical field

The present invention relates generally to news web page contents extraction field, more particularly, the present invention relates to a kind of method and system that uses the webpage label clustering to extract the news web page content.

Background technology

At news (perhaps information) search field, body is extracted the link that is absolutely necessary, and the quality that its text extracts has just determined the quality and the user experience of news search.

Whether the body extracting method is of all kinds at present, by using the mode of template to be divided into two big classes: extract and extract based on the non-template mode based on template (or wrapper) mode.

In extracting based on template way, at first define template, coding is resolved the execution template and is obtained data then.Can be divided into again according to the template generating mode: artificial template extraction and automatic moulding plate extract.In artificial template extraction, at the targeted sites of extracting, artificial hand-coding template, template can be the canonical matching way, also can be the first matching way of simple string matching.In automatic moulding plate extracts, utilize machine learning algorithm, obtain a part of web data from the targeted website earlier and carry out learning training, obtain template, program is utilized the template extraction data then.

The non-template mode is extracted mostly and is realized based on statistics and mode of learning.At present main algorithm have rule-based, based on piecemeal, based on vision etc.More representational is the page partitioning algorithm based on vision of Microsoft, extracts through page piece, and divider extracts and 3 steps of semantic chunk reconstruct, determines the main semantic chunk of webpage.

The shortcoming of manual compiling template way is need expend huge human resources to write template, and along with the variation of targeted website, safeguards that the cost of template is also very big.The shortcoming of automatic moulding plate mode is the algorithm complexity, also needs simultaneously to the monitoring of targeted website cycle, to safeguard the variation of template.No matter be manually or automatically to produce template, the data of its hypothesis website are to produce by template, some large-scale website basic problems are little, just different inlet possibility template differences, but for numerous medium and small websites, its templating is not fine, utilizes template extraction can only extract most information, has more chance to comprise junk information.

Because rule is complicated, and performance is not high, not too be fit to the application of news search engine based on the page partitioning algorithm of vision.

Therefore, the method that needs a kind of effective extraction news web page content.

Summary of the invention

In order to solve above-mentioned shortcoming of the prior art and problem the present invention is proposed.The present invention is directed to the deficiency of news web page contents extraction technology, design is extracted news content based on the algorithm of label clustering, avoids the drawback of artificial rule and template thereof.

According to an aspect of the present invention, provide a kind of method of using the webpage label clustering to extract the news web page content, having comprised: web page contents has been carried out pre-service, comprise the information that described web page contents is resolved to each node of dom tree and the described dom tree of statistics; The node of the described dom tree of heuristic deletion; Formal style is deleted the node of described dom tree; And delete the node of described dom tree, thereby generate final dom tree with output based on the label construction cluster.

In described method, the node of deleting described dom tree based on the label construction cluster can comprise: to all nodes statistics label construction information through the described dom tree of heuristic deletion and formal style deletion; The label construction information of being added up is carried out similar cluster calculation, thereby obtain a plurality of classes; Choose the class of content maximum in described a plurality of class, and with the common father node of node in the selected class as content node; And, handle all other nodes according to described content node, to form described final dom tree.

Described method may further include carries out fine processing to described web page contents.

Described method can comprise gets the MD5 Hash to the label construction information of being added up, and carries out strict cluster by the MD5 value is identical.

Described information comprises: punctuate, symbolic number, number of characters, link number, picture number.And described symbol and number of characters are divided into by link: the Chinese symbolic number in the anchor text, the English symbol number in the anchor text, the Chinese number of words in the anchor text, the English number of words in the anchor text; Chinese symbolic number in the non-anchor text, the English symbol number in the non-anchor text, the Chinese number of words in the non-anchor text, the English number of words in the non-anchor text.

In described method, the node that formal style is deleted described dom tree can comprise: the statistics link number of node and the ratio of non-anchor text character number, if described ratio greater than threshold value, but be deletion of node then with this vertex ticks.Described threshold value links the ratio of number and overall non-anchor text character number based on this webpage overall situation or rule of thumb is worth.

According to another aspect of the present invention, a kind of system that uses the webpage label clustering to extract the news web page content is provided, comprise: pretreatment module is used for web page contents is carried out pre-service, with the information that described web page contents is resolved to dom tree and adds up each node of described dom tree; Heuristic removing module is used for the corresponding node of label object of heuristic deletion and described dom tree appointment; The formal style removing module is used for the node of the ratio of formal style Remove Links number and non-anchor text character number greater than assign thresholds; And label construction cluster removing module, be used for based on label construction cluster deletion of node.

Described system may further include: the fine processing module is used for described web page contents is carried out fine processing.

Description of drawings

By the description of carrying out below in conjunction with accompanying drawing, above-mentioned and other aspects of some one exemplary embodiment of the present invention, feature and advantage will become apparent to those skilled in the art, wherein:

Fig. 1 is the process flow diagram of method that the extraction news web page content of the one exemplary embodiment according to the present invention is shown;

Fig. 2 is the process flow diagram based on the process of label construction cluster deletion of node that according to a present invention one exemplary embodiment is shown; And

Fig. 3 is the block diagram that the system of one exemplary embodiment according to the present invention is shown.

Embodiment

Describing below to help complete understanding one exemplary embodiment of the present invention with reference to the accompanying drawings is provided.It comprises various details to help understanding, and they should be thought it only is exemplary.Therefore, those of ordinary skills will be appreciated that, can make various changes and modification to the embodiments described herein, and can not deviate from scope and spirit of the present invention.Equally, for clarity and conciseness, omitted description to known function and structure.

Describe the present invention below with reference to the accompanying drawings in detail.

Fig. 1 is the process flow diagram of method 100 that the extraction news web page content of the one exemplary embodiment according to the present invention is shown.

As shown in fig. 1, method 100 starts from step 110.In step 110, web page contents is carried out pre-service, with the information that web page contents is resolved to DOM (Document Object Model, DOM Document Object Model) tree and add up each node of dom tree.How web page contents being resolved to dom tree is known for those skilled in the art, therefore is not described in detail at this.

Web page contents is carried out pre-service can be comprised: arrangement web page contents label, will not match or inc label is organized into closed label by rule, and afterwards the web page contents of putting in order is resolved to dom tree; And the information of adding up each node.

The information of node can comprise: punctuate, symbolic number, number of characters (for example, Chinese, English, numeral, other etc.), link number, picture number etc.Wherein, symbol and character can be divided into by link again: the Chinese symbolic number in the anchor text, the English symbol number in the anchor text, the Chinese number of words in the anchor text, the English number of words in the anchor text; Chinese symbolic number in the non-anchor text, the English symbol number in the non-anchor text, the Chinese number of words in the non-anchor text, the English number of words in the non-anchor text.

In step 120, heuristic deletion of node.The operand of this step is the label object of appointment in the dom tree.For example, the heuristic html tag that relates to that removes mainly contains META, HR, IMG, STYLE, SCRIPT, NOSCRIPT, INPUT, SELECT, EMBED, BUTTON, OPTION, OPTGROUP.

Why will delete above-mentioned html tag object, be because they can not influence content itself.For example, punctuate and content are more in Script the Script section, but whether the page itself is showed content, and the style pattern is to show content, also can not influence content itself, so on dom tree, but this two category node is labeled as deletion of node.The deletion mode take from low and on mode, guarantee simultaneously that in deletion during this node, the statistical information of this node is also deleted in its father node.

In step 130, the formal style deletion of node.The operand of this step is table, content tabs such as div.In this step, the link number that can computing node and the ratio of non-anchor text character number if ratio is greater than threshold value (for example, threshold value can be 0.05), but can this vertex ticks be deletion of node then.

In step 140,, thereby generate final dom tree with output based on label construction cluster deletion of node.The process of coming deletion of node based on the label construction cluster is described below with reference to Fig. 2.

Fig. 2 is the process flow diagram based on the process 200 of label construction cluster deletion of node that according to a present invention one exemplary embodiment is shown.

In step 210, to all nodes statistics label construction information through the dom tree of heuristic deletion and formal style deletion.

In step 220, the label construction information of being added up is carried out similar cluster calculation, thereby obtain a plurality of classes.For each node, from its father node to node itself paths is arranged, this path is referred to as the structure path of node, and all labels on the path are formed the character string of (such as connecting to form with for example separator of "-" or "/" etc.) as path values.Therefore, the label construction information of being added up being carried out similar cluster calculation promptly is that path values is carried out similar cluster calculation.For example, can come all path values are done similar cluster calculation by any in K mean algorithm, C mean algorithm, the EM algorithm etc.Each label construction is (or a plurality of) class, and each class comprises: the ancestor node pointer, and it is ancestors' pointer of the content node that finally finds; The speech number is used for determining the class of content maximum; Rubbish speech number and deletion ratio can be used to eliminate the noise in the content territory; And node set, can find common ancestor node by node set.

In step 230, choose the class of content (effectively punctuate meets number and non-anchor text character number) maximum, and with the common father node of node in such content node as whole news web page.

In step 240, handle all nodes under the described content node, be used for further abating the noise, thereby form final dom tree.

In addition, in step 220,, can get the MD5 Hash, carry out strict cluster by the MD5 value is identical to the label construction information (being path values) of being added up in order further to improve program efficiency.

Method of the present invention is described above, yet can refines with the fine processing news content method of the present invention.Some statement is understood in some finance and economic websites more, and for example: " Sohu's security statement: this channel information content system quotes from cooperation medium and cooperative institution; do not represent Sohu's security self viewpoint and position; the investor is to the careful judgement of this information in suggestion, enters the market in view of the above, and is at your own risk." in this case, if with the part of such statement as news content, with regard to a word or shorter, then such statement can influence the calculating of news content, such as fingerprint calculating, the calculating of news similarity etc. as if the news actual content.In order to eliminate such adverse effect, can adopt two kinds of following retrofit content way: (1) is for all nodes of news node content the inside, if its deletion rate (same paths node encumbrance/same paths node sum) higher (more than 90%) is then deleted this node; (2) make a vocabulary, the vocabulary content is speech string and part of speech, wherein the speech string is the string of 3-4 the non-speech of Chinese character, its attribute flags is statement, navigation, copyright, advertisement or the like, cut speech by the corresponding content in the label being done the maximum forward coupling, obtain the speech property set of this label correspondence, according to the choice of ratio (speech community set number/text size) decision node.

Fig. 3 is the block diagram that the system 300 of the one exemplary embodiment according to the present invention is shown.

As shown in Figure 3, system 300 can comprise pretreatment module 310, heuristic removing module 320, formal style removing module 330, label construction cluster removing module 340.Alternatively, system 300 can also comprise fine processing module (not shown among Fig. 3).

Particularly, 310 pairs of web page contents of pretreatment module carry out pre-service, with the information that web page contents is resolved to dom tree and add up each node of dom tree.

Heuristic removing module 320 is used for the corresponding node of label object of heuristic deletion and dom tree appointment.

Formal style removing module 330 is used for the node of the ratio of formal style Remove Links number and non-anchor text character number greater than assign thresholds.

Label construction cluster removing module 340 is used for based on label construction cluster deletion of node.Describe how to come deletion of node in detail in conjunction with Fig. 2 owing to top, so be not described in detail in this based on the label construction cluster.

According to the present invention, provide a kind of method and system that uses the webpage label clustering to realize the news web page contents extraction.Those of ordinary skills will recognize that method of the present invention and advantage can obtain following advantage: (1) need not template based on single web page analysis, saves a large amount of artificial; (2) algorithm is simple, the analysis efficiency height; (3) can be for follow-up fingerprint calculate, content clustering, the media event cluster provides quality data to guarantee.

Respectively system and method embodiment of the present invention is described respectively above being to be noted that, but the details that an embodiment is described also can be applicable to another embodiment.

Ultimate principle of the present invention has below been described in conjunction with specific embodiments, but, it is to be noted, for those of ordinary skill in the art, can understand the whole or any steps or the parts of method and system of the present invention can be realized with software, hardware, firmware or their combination, and this is that those of ordinary skills use their basic programming skill just can realize under the situation of having read explanation of the present invention.

Therefore, purpose of the present invention can also realize by a software module of operation or one group of software module on any calculation element.Described calculation element can be known fexible unit.Therefore, purpose of the present invention also can be only by providing the program product that comprises the program code of realizing described method or system to realize.That is to say that such program product also constitutes the present invention, and the storage medium that stores such program product also constitutes the present invention.Obviously, described storage medium can be any storage medium that is developed in any known storage medium or future.

Though this instructions comprises many specific implementations details, but these details should be interpreted as the restriction of the scope of the content that maybe can advocate any invention, and should be interpreted as can be specific to the description of the feature of the specific embodiment of specific invention.Some characteristics combination of describing in the situation of the embodiment that separates in this manual can also be realized in single embodiment.On the contrary, also each feature of describing in the situation of single embodiment can be realized in a plurality of embodiments discretely or in any suitable sub-portfolio, realizes.In addition, work although may describe feature as in the above in some combination, even initial opinion so, but can in some cases the one or more features from the combination of being advocated be left out from combination, and the combination of being advocated can be pointed to the variant of sub-portfolio or sub-portfolio.

Similarly, though described operation with certain order in the accompanying drawings, this should be interpreted as need with shown in certain order or carry out such operation or need to carry out the result that all illustrated operations just can reach expectation with continuous order.In some cases, multitask and parallel processing can be favourable.In addition, the separation of in the above-described embodiments various system components should be interpreted as and all need such separation in all embodiments, and should be understood that, usually can be with described program assembly and the system integration to becoming single software product together or being encapsulated as a plurality of software products.

Computer program (being also referred to as program, software, software application, script or code) can be write by programming language in any form, described programming language comprises compiling or interpretative code or illustrative or procedural language, and it can be disposed in any form, comprises as stand-alone program or as module, assembly, subroutine or other unit of being suitable for using in computing environment.Computer program there is no need corresponding to the file in the file system.Can be at the file that keeps other program or data (for example with procedure stores, be stored in the one or more scripts in the marking language document) a part, the single file that is exclusively used in question program or a plurality of coordinative file (for example, storing the file of one or more modules, subroutine or partial code) in.

Above-mentioned embodiment does not constitute limiting the scope of the invention.Those skilled in the art should be understood that, depend on designing requirement and other factors, and various modifications, combination, sub-portfolio and alternative can take place.Any modification of being done within the spirit and principles in the present invention, be equal to and replace and improvement etc., all should be included within the protection domain of the present invention.

Claims

1. method of using the webpage label clustering to extract the news web page content comprises:

Web page contents is carried out pre-service, comprise the information that described web page contents is resolved to each node of dom tree and the described dom tree of statistics;

The node of the described dom tree of heuristic deletion;

Formal style is deleted the node of described dom tree; And

Delete the node of described dom tree based on the label construction cluster, thereby generate final dom tree with output.

2. method according to claim 1, wherein, the node of deleting described dom tree based on the label construction cluster comprises:

To all nodes statistics label construction information through the described dom tree of heuristic deletion and formal style deletion;

The label construction information of being added up is carried out similar cluster calculation, thereby obtain a plurality of classes;

Choose the class of content maximum in described a plurality of class, and with the common father node of node in the selected class as content node; And

Handle all nodes under the described content node, to form described final dom tree.

3. method according to claim 1, wherein, described method further comprises carries out fine processing to described web page contents.

4. method according to claim 2 wherein, is got the MD5 Hash to the label construction information of being added up, and carries out strict cluster by the MD5 value is identical.

5. method according to claim 1, wherein, described information comprises: punctuate, symbolic number, number of characters, link number, picture number.

6. method according to claim 1, wherein, described symbol and number of characters are divided into by link: the Chinese symbolic number in the anchor text, the English symbol number in the anchor text, the Chinese number of words in the anchor text, the English number of words in the anchor text; Chinese symbolic number in the non-anchor text, the English symbol number in the non-anchor text, the Chinese number of words in the non-anchor text, the English number of words in the non-anchor text.

7. method according to claim 1, wherein, the node that formal style is deleted described dom tree comprises: the statistics link number of node and the ratio of non-anchor text character number, if described ratio greater than threshold value, but be deletion of node then with this vertex ticks.

8. method according to claim 1, wherein, described threshold value links the ratio of number and overall non-anchor text character number based on this webpage overall situation or rule of thumb is worth.

9. system that uses the webpage label clustering to extract the news web page content comprises:

Pretreatment module is used for web page contents is carried out pre-service, with the information that described web page contents is resolved to dom tree and adds up each node of described dom tree;

Heuristic removing module is used for the corresponding node of label object of heuristic deletion and described dom tree appointment;

The formal style removing module is used for the node of the ratio of formal style Remove Links number and non-anchor text character number greater than assign thresholds; And

Label construction cluster removing module is used for based on label construction cluster deletion of node.

10. system according to claim 9, wherein, described system further comprises:

The fine processing module is used for described web page contents is carried out fine processing.