CN102541874A

CN102541874A - Webpage text content extracting method and device

Info

Publication number: CN102541874A
Application number: CN2010105915066A
Authority: CN
Inventors: 周奕; 周宇煜; 吴淑燕
Original assignee: China Mobile Communications Group Co Ltd
Current assignee: China Mobile Communications Group Co Ltd
Priority date: 2010-12-16
Filing date: 2010-12-16
Publication date: 2012-07-04
Anticipated expiration: 2030-12-16
Also published as: CN102541874B

Abstract

The invention discloses a webpage text content extracting method and device. The method comprises the following steps of: acquiring two webpages which belong to a catalogue at the same hierarchy below the same site; for each acquired webpage, respectively executing the following steps of: dividing the webpage into content blocks; determining label density and/or link density of each content block; selecting the content block the label density and/or link density of which meets corresponding preset conditions; extracting the content block with the text content of being not consistent with the text contexts of the content blocks selected from another webpage; and determining the extracted content block as the text content of the webpage. By adopting the technical scheme of the invention, the problem that accuracy is lower when the text content of the webpage is extracted in the prior art can be solved.

Description

Web page text method for extracting content and device

Technical field

The present invention relates to the internet information processing technology field, relate in particular to a kind of Web page text method for extracting content and device.

Background technology

Along with developing rapidly of Internet technology; Information on the webpage is more and more abundanter; In order better to use the information on the webpage, the technology of network information can be effectively organized and utilized in the continuous pursuit of people, however webpage and neat, clean unlike the traditional text that kind; Wherein comprising a large amount of noise contents; The script that for example adds in order to strengthen user interactivity, the navigation link that adds for the ease of the user browses, and from the commercial advertisement link of considering to be added etc.; Above-mentioned noise content has not only influenced the info web effectiveness of retrieval; But also caused the accuracy of retrieval lower, the accurate extraction of Web page text content not only can filtering web page in contents such as navigation information, advertising message, copyright information, peer link to the interference of result for retrieval, can also carry out automatic word segmentation, named entity recognition, autoabstract, classification and automatic cluster etc. automatically to webpage.

As shown in Figure 1, be Web page text method for extracting content process flow diagram in the prior art, its concrete treatment scheme is following:

Step 11 to single piece of webpage, is confirmed the total and Chinese character number of i character capable and (i+1) capable content;

Step 12, calculate i capable and (i+1) row content text density, can calculate text density divided by the character sum with the Chinese character number;

Step 13 compares the text density and the preset threshold value that calculate;

Step 14 if comparative result is not less than preset threshold value for text density, is then confirmed capable and (i+1) behavior body matter of i, if comparative result be text density less than preset threshold value, then confirm the capable and non-body matter of (i+1) behavior of i;

Step 15 if determine capable and (i+1) behavior body matter of i, confirms according to the method described above that then i is capable, (i+1) goes and whether (i+2) row is body matter;

Step 16 if determine the capable and non-body matter of (i+1) behavior of i, confirms according to the method described above then whether (i+2) row and (i+3) row are body matter.

Step 17 is carried out above-mentioned steps, until all row of this webpage of traversal.

In the said method,, just think that this continuous multiple line content is a body matter if the text density of multiple line content is not less than predetermined threshold value continuously; But in now a lot of webpages, there is the higher non-body matter of a lot of degree of disturbances, for example personal information, short essay chapter, disclaimer etc.; The text density of these non-body matters is bigger; Greater than preset threshold value, therefore possibly be mistaken as body matter, thereby make that the extraction accuracy of body matter is lower probably.

Summary of the invention

The embodiment of the invention provides a kind of Web page text method for extracting content and device, in order to solve the lower problem of accuracy of the extraction Web page text content that prior art exists.

Embodiment of the invention technical scheme is following:

A kind of Web page text method for extracting content, the method comprising the steps of: two webpages that obtain to belong to same level catalogue under the same website; To each webpage that obtains, carry out respectively: this webpage is divided into each content blocks; The label density of each content blocks of confirming to mark off and/or link density; And select label density and/or link density and satisfy corresponding pre-conditioned content blocks; In each content blocks of selecting, extract all inconsistent content blocks of content of text of each content blocks of selecting in content of text and another webpage; With the content blocks that extracts, confirm as the body matter of this webpage.

A kind of Web page text contents extraction device comprises: obtain the unit, be used to obtain to belong to two webpages of same level catalogue under the same website; Division unit is used for to obtaining each webpage that the unit obtains this webpage being divided into each content blocks; First confirms the unit, is used for to obtaining each webpage that the unit obtains, and confirms the label density and/or the link density of each content blocks that division unit marks off; Selected cell is used for to obtaining each webpage that the unit obtains, and selects the satisfied corresponding pre-conditioned content blocks of label density and/or link density; Extraction unit is used for to obtaining each webpage that the unit obtains, and in each content blocks that selected cell is selected, extracts all inconsistent content blocks of content of text of each content blocks of selecting in content of text and another webpage; Second confirms the unit, is used for the content blocks that extraction unit extracts, confirming as the body matter of this webpage to obtaining each webpage that the unit obtains.

In the embodiment of the invention technical scheme; Because belonging to the webpage of same level catalogue under the same website is all generated by same template; Its structure of web page is similar or identical, so the embodiment of the invention is at first selected alternative body matter piece according to label density and/or link density to two webpages that belong to same level catalogue under the same website; Then in the content blocks of selecting; Remove two non-body matter pieces that webpage Chinese version content is identical, thereby extract the body matter piece, this has just improved the accuracy of extracting the Web page text content effectively.

Description of drawings

Fig. 1 is in the prior art, Web page text method for extracting content schematic flow sheet;

Fig. 2 is in the embodiment of the invention, Web page text method for extracting content schematic flow sheet;

Fig. 3 is in the embodiment of the invention, the concrete realization flow synoptic diagram of Web page text method for extracting content;

Fig. 4 is in the embodiment of the invention, Web page text contents extraction apparatus structure synoptic diagram.

Embodiment

At length set forth to the main realization principle of embodiment of the invention technical scheme, embodiment and to the beneficial effect that should be able to reach below in conjunction with each accompanying drawing.

As shown in Figure 2, be Web page text method for extracting content process flow diagram in the embodiment of the invention, its concrete treatment scheme is following:

Step 21, acquisition belongs to two webpages of same level catalogue under the same website;

The embodiment of the invention proposes; The different pages of same level catalogue under the same website; Normally generated by same HTML (HTML, Hyper Text Mark-up Language) template, therefore the structure of web page between the different web pages under the same level catalogue is identical or similar under the same website; Under the for example same website in the different pages of same level catalogue; All comprise personal information, disclaimer or the copyright statement etc. of identical content, the position of these contents in the different pages maybe be different, but content is identical.

Step 22, each webpage to obtaining is divided into each content blocks with this webpage respectively;

When webpage is divided into content blocks, need earlier the webpage pre-service that standardizes is made it to meet the html language standard; Then pretreated webpage is carried out structuring and handle, generate DOM Document Object Model (DOM, Document Object Model) tree; Obtain the HTML structuring statement of webpage,, webpage is carried out the sense of vision piecemeal handle according to <table>or < div>mark in the dom tree that generates; Be divided into each content blocks, wherein can but be not limited to adopt the mode of multistage piecemeal to divide content blocks, for example adopt the mode of two-stage piecemeal to divide content blocks; Earlier webpage is divided into each one-level content blocks; Respectively each one-level content blocks is divided into each secondary content blocks then, other multistage partitioned modes and aforesaid way are similar, repeat no more here.

After webpage is divided into each content blocks, can but be not limited to identify content blocks through the numberings at different levels of webpage numbering and content blocks, the mode that for example adopts the two-stage piecemeal is carried out content blocks when dividing to webpage, uses C _i(j; K) identify the content blocks that marks off, wherein i representes that this content blocks is the content blocks in i the webpage, and j representes that this content blocks is a j one-level content blocks of i webpage; K representes that this content blocks is a k secondary content blocks of j one-level content blocks of i webpage, that is to say C _i(j, k) in i webpage of sign, k secondary content blocks in j one-level content blocks.

Step 23, to each webpage that obtains, the label density and/or the link density of each content blocks of confirming to mark off;

The embodiment of the invention proposes; Can confirm the label density of each content blocks; Select alternative body matter piece according to label density, also can confirm the link density of each content blocks, select alternative body matter piece according to link density; Can also confirm the label density and link density of each content blocks, select alternative body matter piece according to label density and link density.

Wherein, the label density of content blocks is label number and the ratio of text number of words in this content blocks, and the link density of content blocks is link number and the ratio of text number of words in this content blocks.

If content blocks C _i(j, the content of text in k) is T _i(j, k), the text number of words is N _i(j, k), the label number is Q _i(j, k), the link number is P _i(j k), then confirms label density Y through following manner _i(j is k) with link density X _i(j, k):

Y_{i} (j, k) = \frac{Q_{i} (j, k)}{N_{i} (j, k)}

X_{i} (j, k) = \frac{P_{i} (j, k)}{N_{i} (j, k)}

Step 24 to each webpage that obtains, is selected the satisfied corresponding pre-conditioned content blocks of label density and/or link density;

If select alternative body matter piece according to label density, then its process can but be not limited to following:

The label density threshold of each content blocks that at first obtains to mark off is selected the content blocks that label density is not more than the corresponding label density threshold then, with the content blocks of selecting, confirms as and satisfies pre-conditioned content blocks, is alternative body matter piece;

If select alternative body matter piece based on link density, then its process can but be not limited to following:

The link density threshold of each content blocks that at first obtains to mark off is selected the content blocks that link density is not more than corresponding link density threshold then, with the content blocks of selecting, confirms as and satisfies pre-conditioned content blocks, is alternative body matter piece;

If select alternative body matter piece based on label density and link density, then its process can but be not limited to following:

The label density threshold and link density threshold of each content blocks that at first obtains to mark off; Select label density then and be not more than the corresponding label density threshold; And link density is not more than the content blocks of corresponding link density threshold; With the content blocks of selecting, confirm as and satisfy pre-conditioned content blocks, be alternative body matter piece.

Wherein the label density threshold can but be not limited to obtain through following manner, be specially:

At first, respectively according to label density, confirm the label density variance of this content blocks, and, confirm the label density threshold of this content blocks according to the label density variance of determining to each content blocks that marks off;

Wherein link density threshold can but be not limited to obtain through following manner, be specially:

At first, based on link density, confirm the link density variance of this content blocks, and, confirm the link density threshold of this content blocks respectively respectively based on the link density variance of determining to each content blocks that marks off;

Content blocks C _i(j, the label density in k) is Y _i(j, k), link density is X _i(j, k), according to label density Y _i(j k), can confirm content blocks C _i(j, label density variance D (Y k) _i(j, k)) is according to link density X _i(j k), can confirm content blocks C _i(j, link density variance D (X k) _i(j, k)) is according to content blocks C _i(j, label density variance D (Y k) _i(j, k)) can further determine content blocks C _i(j, label density threshold B (Y) k) is according to content blocks C _i(j, link density variance D (X k) _i(j, k)) can further determine content blocks C _i(j, link density threshold B (X) k).

With label density Y _i(j k) compares with label density threshold B (Y), if comparative result is Y _i(j is k) greater than B (Y), then with Y _i(j, value k) is changed to 0, if comparative result is Y _i(j k) is not more than B (Y), then with Y _i(j, value k) is changed to 1, that is:

\{\begin{matrix} Y_{i} (j, k) = 0 & Y_{i} (j, k) > B (Y) \\ Y_{i} (j, k) = 1 & Y_{i} (j, k) \leq B (Y) \end{matrix}

To link density X _i(j k) compares with label density threshold B (X), if comparative result is X _i(j is k) greater than B (X), then with X _i(j, value k) is changed to 0, if comparative result is X _i(j k) is not more than B (X), then with X _i(j, value k) is changed to 1, that is:

\{\begin{matrix} X_{i} (j, k) = 0 & X_{i} (j, k) > B (X) \\ X_{i} (j, k) = 1 & X_{i} (j, k) \leq B (X) \end{matrix}

If select alternative body matter piece according to label density, then with Y _i(j is that 1 content blocks is chosen as alternative body matter piece k), promptly satisfies corresponding pre-conditioned content blocks;

If select alternative body matter piece according to link density, then with X _i(j is that 1 content blocks is chosen as alternative body matter piece k), promptly satisfies corresponding pre-conditioned content blocks;

If select alternative body matter piece according to label density and link density, then to content blocks C _i(j k), calculates X _i(j, k) * Y _i(j k), if result of calculation is 1, then is chosen as alternative body matter piece with this content blocks, promptly satisfies corresponding pre-conditioned content blocks, wherein passes through above-mentioned computing, X _i(j, k) and Y _i(j, value k) is 1 or 0, then has only the X of working as _i(j, k) and Y _i(j, value k) is at 1 o'clock, X _i(j, k) * Y _i(j, result of calculation k) just is 1.

In the embodiment of the invention; Owing to select alternative body matter piece according to label density and/or link density; Rather than confirm body matter according to text density; Thereby removed the higher a part of non-body matter of degree of disturbance earlier, therefore can improve the accuracy of extracting the Web page text content effectively.

Step 25 to each webpage that obtains, in each content blocks of selecting, extracts all inconsistent content blocks of content of text of each content blocks of selecting in content of text and another webpage;

The embodiment of the invention proposes, and the body text content under the same website between the different pages of same level catalogue differs greatly, and noise content is identical; Therefore can after selecting alternative body matter piece, in two pages, further remove the identical content blocks of content of text; These content blocks are noise content; Therefore be non-content of text, remaining alternative body matter piece is the Web page text content.

Wherein, can but be not limited to adopt the mode of poll to extract all inconsistent content blocks of content of text of each content blocks of selecting in content of text and another webpage, for example:

The alternative body matter piece of selecting to webpage 1 is: content blocks C ₁(1,1), C ₁(1,2), the alternative body matter piece of selecting to webpage 2 is: content blocks C ₂(1,3), C ₂(2,2), C ₂(3,1) are at first with content blocks C ₁The content of text T of (1,1) ₁(1,1) and content blocks C ₂The content of text T of (1,3) ₂(1,3) compares, and comparative result is inconsistent, then with content blocks C ₁The content of text T of (1,1) ₁(1,1) and content blocks C ₂The content of text T of (2,2) ₂(2,2) compare, and comparative result is consistent, then confirms content blocks C ₁(1,1) and content blocks C ₂(2,2) are non-body matter, therefore remove in the alternative content blocks of webpage 1, remove content blocks C ₁Content blocks C in the alternative content blocks of webpage 2, is removed in (1,1) ₂(2,2);

With content blocks C ₁The content of text T of (1,2) ₁(1,2) and content blocks C ₂The content of text T of (1,3) ₂(1,3) compares, and comparative result is inconsistent, then with content blocks C ₁The content of text T of (1,2) ₁(1,2) and content blocks C ₂The content of text T of (3,1) ₂(3,1) compare, and comparative result is inconsistent, then confirms content blocks C ₁(1,2), C ₂(1,3) and C ₂(3,1) are body matter, and promptly the body matter in the webpage 1 is C ₁(1,2), the body matter in the webpage 2 are C ₂(1,3) and C ₂(3,1).

Though the noise content under the same website between the different pages of same level catalogue is identical, residing position maybe be different in the page, and for example personal information is positioned at the upper left side in the page 1; In the page 2, then be positioned at the lower left,, then require content all identical with the position if search identical subtree according to the coordinate of node in dom tree; So just maybe content is identical; But the different noise content in position is thought body matter by mistake, and the application embodiment adopts the mode of above-mentioned poll, in alternative body matter piece, extracts the Web page text content; Therefore it is identical just can to remove content; But the noise content that the position is different, only the content blocks that content of text is different is extracted as body matter, thereby has improved the accuracy of extracting the Web page text content effectively.

Step 26 to each webpage that obtains, with the content blocks that extracts, is confirmed as the body matter of this webpage.

Can know by above-mentioned processing procedure; In the embodiment of the invention technical scheme, all generated by same template owing to belong to the webpage of same level catalogue under the same website, its structure of web page is similar or identical; Therefore the embodiment of the invention is to two webpages that belong to same level catalogue under the same website; At first select alternative body matter piece, in the content blocks of selecting, remove two non-body matter pieces that webpage Chinese version content is identical then according to label density and/or link density; Thereby extract the body matter piece, this has just improved the accuracy of extracting the Web page text content effectively.

Provide more detailed embodiment below.

As shown in Figure 3, be the concrete realization flow figure of Web page text method for extracting content in the embodiment of the invention, its concrete processing procedure is following:

Step 31, acquisition belong to the webpage 1 and webpage 2 of same level catalogue under the same website;

Step 32 to each webpage that obtains pre-service that standardizes, makes it to meet the html language standard;

Step 33 is carried out structuring to pretreated each webpage and is handled, and generates dom tree;

Step 34 according to <table>or < div>mark in the dom tree that generates, is carried out the sense of vision piecemeal with webpage and is handled;

Step 35 is calculated the label density of each content blocks and is linked density;

Step 36 to each content blocks, compares label density and label density threshold, will link density and compare with the link density threshold;

Step 37 with the content blocks that is not more than corresponding link density threshold, is confirmed as alternative body matter piece;

Step 38, the mode of employing poll is carried out similarity relatively with alternative body matter piece in the webpage 1 and the alternative body matter piece in the webpage 2;

Step 39 to each webpage, according to comparative result, extracts all inconsistent content blocks of content of text of alternative body matter piece in content of text and another webpage, and the content blocks that extracts is the body matter of this webpage.

Accordingly, the embodiment of the invention provides a kind of Web page text contents extraction device, and its structure is as shown in Figure 4, comprise obtaining unit 41, division unit 42, first definite unit 43, selected cell 44, extraction unit 45 and second definite unit 46, wherein:

Obtain unit 41, be used to obtain to belong to two webpages of same level catalogue under the same website;

Division unit 42 is used for to obtaining each webpage that unit 41 obtains this webpage being divided into each content blocks;

First confirms unit 43, is used for to obtaining each webpage that unit 41 obtains, and confirms the label density and/or the link density of each content blocks that division unit 42 marks off;

Selected cell 44 is used for to obtaining each webpage that unit 41 obtains, and selects the satisfied corresponding pre-conditioned content blocks of label density and/or link density;

Extraction unit 45 is used in each content blocks that selected cell 44 is selected, extracting all inconsistent content blocks of content of text of each content blocks of selecting in content of text and another webpage to obtaining each webpage that unit 41 obtains;

Second confirms unit 46, is used for the content blocks that extraction unit 45 extracts, confirming as the body matter of this webpage to obtaining each webpage that unit 41 obtains.

Preferably, to each content blocks that division unit 42 marks off, said first confirms the ratio of unit 43 with label number in this content blocks and text number of words, confirms as the label density of this content blocks, and

With the ratio of link number in this content blocks and text number of words, confirm as the link density of this content blocks.

Preferably, selected cell 44 specifically comprises acquisition subelement, chooser unit and definite subelement, wherein:

Obtain subelement, be used for to obtaining each webpage that unit 41 obtains, the label density threshold of each content blocks that acquisition division unit 42 marks off and/or link density threshold;

The chooser unit be used for selecting label density and being not more than the corresponding label density threshold to obtaining each webpage that unit 41 obtains, and/or link density is not more than the content blocks of corresponding link density threshold;

Confirm subelement, be used for the content blocks that the chooser unit is selected, confirming as and satisfying pre-conditioned content blocks to obtaining each webpage that unit 41 obtains.

More preferably, to each content blocks that division unit 42 marks off, said acquisition subelement is confirmed the label density threshold of this content blocks according to the label density variance of this content blocks, and

Based on the link density variance of this content blocks, confirm the link density threshold of this content blocks.

Preferably, said division unit 42 specifically comprises the pre-service subelement, generates subelement and division subelement, wherein:

The pre-service subelement is used for to obtaining each webpage that unit 41 obtains, to the pre-service that standardizes of this webpage;

Generate subelement, be used for according to the pretreated webpage of pre-service subelement, generating corresponding dom tree to obtaining each webpage that unit 41 obtains;

Divide subelement, be used for based on generating the dom tree that subelement generates, this webpage being divided into each content blocks to obtaining each webpage that unit 41 obtains.

Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, belong within the scope of claim of the present invention and equivalent technologies thereof if of the present invention these are revised with modification, then the present invention also is intended to comprise these changes and modification interior.

Claims

1. a Web page text method for extracting content is characterized in that, comprising:

Acquisition belongs to two webpages of same level catalogue under the same website;

To each webpage that obtains, carry out respectively:

This webpage is divided into each content blocks;

The label density of each content blocks of confirming to mark off and/or link density; And select label density and/or link density and satisfy corresponding pre-conditioned content blocks;

In each content blocks of selecting, extract all inconsistent content blocks of content of text of each content blocks of selecting in content of text and another webpage;

With the content blocks that extracts, confirm as the body matter of this webpage.

2. Web page text method for extracting content as claimed in claim 1 is characterized in that, the label density of each content blocks of confirming to mark off specifically comprises:

To each content blocks that marks off,, confirm as the label density of this content blocks with the ratio of label number in this content blocks and text number of words;

The link density of each content blocks of confirming to mark off specifically comprises:

3. Web page text method for extracting content as claimed in claim 1 is characterized in that, selects the satisfied corresponding pre-conditioned content blocks of label density and/or link density, specifically comprises:

The label density threshold of each content blocks that acquisition marks off and/or link density threshold;

Select label density and be not more than the corresponding label density threshold, and/or link density is not more than the content blocks of corresponding link density threshold;

With the content blocks of selecting, confirm as and satisfy pre-conditioned content blocks.

4. Web page text method for extracting content as claimed in claim 3 is characterized in that, the label density threshold of each content blocks that acquisition marks off specifically comprises:

To each content blocks that marks off, respectively according to label density, confirm the label density variance of this content blocks, and, confirm the label density threshold of this content blocks according to the label density variance of determining;

The link density threshold of each content blocks that acquisition marks off specifically comprises:

To each content blocks that marks off, based on link density, confirm the link density variance of this content blocks, and, confirm the link density threshold of this content blocks respectively based on the link density variance of determining.

5. Web page text method for extracting content as claimed in claim 1 is characterized in that, this webpage is divided into each content blocks, specifically comprises:

To the pre-service that standardizes of this webpage;

Based on pretreated webpage, generate corresponding DOM Document Object Model dom tree;

Dom tree according to generating is divided into each content blocks with this webpage.

6. a Web page text contents extraction device is characterized in that, comprising:

Obtain the unit, be used to obtain to belong to two webpages of same level catalogue under the same website;

Division unit is used for to obtaining each webpage that the unit obtains this webpage being divided into each content blocks;

First confirms the unit, is used for to obtaining each webpage that the unit obtains, and confirms the label density and/or the link density of each content blocks that division unit marks off;

Selected cell is used for to obtaining each webpage that the unit obtains, and selects the satisfied corresponding pre-conditioned content blocks of label density and/or link density;

Extraction unit is used for to obtaining each webpage that the unit obtains, and in each content blocks that selected cell is selected, extracts all inconsistent content blocks of content of text of each content blocks of selecting in content of text and another webpage;

Second confirms the unit, is used for the content blocks that extraction unit extracts, confirming as the body matter of this webpage to obtaining each webpage that the unit obtains.

7. Web page text contents extraction device as claimed in claim 6; It is characterized in that to each content blocks that division unit marks off, said first confirms the ratio of unit with label number in this content blocks and text number of words; Confirm as the label density of this content blocks, and

8. Web page text contents extraction device as claimed in claim 6 is characterized in that selected cell specifically comprises:

Obtain subelement, be used for to obtaining each webpage that the unit obtains, the label density threshold of each content blocks that the acquisition division unit marks off and/or link density threshold;

The chooser unit be used for selecting label density and being not more than the corresponding label density threshold to obtaining each webpage that the unit obtains, and/or link density is not more than the content blocks of corresponding link density threshold;

Confirm subelement, be used for the content blocks that the chooser unit is selected, confirming as and satisfying pre-conditioned content blocks to obtaining each webpage that the unit obtains.

9. Web page text contents extraction device as claimed in claim 8 is characterized in that, to each content blocks that division unit marks off, said acquisition subelement is confirmed the label density threshold of this content blocks according to the label density variance of this content blocks, and

10. Web page text contents extraction device as claimed in claim 6 is characterized in that said division unit specifically comprises:

The pre-service subelement is used for to obtaining each webpage that the unit obtains, to the pre-service that standardizes of this webpage;

Generate subelement, be used for,, generate corresponding DOM Document Object Model dom tree according to the pretreated webpage of pre-service subelement to obtaining each webpage that the unit obtains;

Divide subelement, be used for based on generating the dom tree that subelement generates, this webpage being divided into each content blocks to obtaining each webpage that the unit obtains.