CN102541874A - Webpage text content extracting method and device - Google Patents

Webpage text content extracting method and device Download PDF

Info

Publication number
CN102541874A
CN102541874A CN2010105915066A CN201010591506A CN102541874A CN 102541874 A CN102541874 A CN 102541874A CN 2010105915066 A CN2010105915066 A CN 2010105915066A CN 201010591506 A CN201010591506 A CN 201010591506A CN 102541874 A CN102541874 A CN 102541874A
Authority
CN
China
Prior art keywords
content blocks
webpage
content
density
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010105915066A
Other languages
Chinese (zh)
Other versions
CN102541874B (en
Inventor
周奕
周宇煜
吴淑燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN 201010591506 priority Critical patent/CN102541874B/en
Publication of CN102541874A publication Critical patent/CN102541874A/en
Application granted granted Critical
Publication of CN102541874B publication Critical patent/CN102541874B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a webpage text content extracting method and device. The method comprises the following steps of: acquiring two webpages which belong to a catalogue at the same hierarchy below the same site; for each acquired webpage, respectively executing the following steps of: dividing the webpage into content blocks; determining label density and/or link density of each content block; selecting the content block the label density and/or link density of which meets corresponding preset conditions; extracting the content block with the text content of being not consistent with the text contexts of the content blocks selected from another webpage; and determining the extracted content block as the text content of the webpage. By adopting the technical scheme of the invention, the problem that accuracy is lower when the text content of the webpage is extracted in the prior art can be solved.

Description

Web page text method for extracting content and device
Technical field
The present invention relates to the internet information processing technology field, relate in particular to a kind of Web page text method for extracting content and device.
Background technology
Along with developing rapidly of Internet technology; Information on the webpage is more and more abundanter; In order better to use the information on the webpage, the technology of network information can be effectively organized and utilized in the continuous pursuit of people, however webpage and neat, clean unlike the traditional text that kind; Wherein comprising a large amount of noise contents; The script that for example adds in order to strengthen user interactivity, the navigation link that adds for the ease of the user browses, and from the commercial advertisement link of considering to be added etc.; Above-mentioned noise content has not only influenced the info web effectiveness of retrieval; But also caused the accuracy of retrieval lower, the accurate extraction of Web page text content not only can filtering web page in contents such as navigation information, advertising message, copyright information, peer link to the interference of result for retrieval, can also carry out automatic word segmentation, named entity recognition, autoabstract, classification and automatic cluster etc. automatically to webpage.
As shown in Figure 1, be Web page text method for extracting content process flow diagram in the prior art, its concrete treatment scheme is following:
Step 11 to single piece of webpage, is confirmed the total and Chinese character number of i character capable and (i+1) capable content;
Step 12, calculate i capable and (i+1) row content text density, can calculate text density divided by the character sum with the Chinese character number;
Step 13 compares the text density and the preset threshold value that calculate;
Step 14 if comparative result is not less than preset threshold value for text density, is then confirmed capable and (i+1) behavior body matter of i, if comparative result be text density less than preset threshold value, then confirm the capable and non-body matter of (i+1) behavior of i;
Step 15 if determine capable and (i+1) behavior body matter of i, confirms according to the method described above that then i is capable, (i+1) goes and whether (i+2) row is body matter;
Step 16 if determine the capable and non-body matter of (i+1) behavior of i, confirms according to the method described above then whether (i+2) row and (i+3) row are body matter.
Step 17 is carried out above-mentioned steps, until all row of this webpage of traversal.
In the said method,, just think that this continuous multiple line content is a body matter if the text density of multiple line content is not less than predetermined threshold value continuously; But in now a lot of webpages, there is the higher non-body matter of a lot of degree of disturbances, for example personal information, short essay chapter, disclaimer etc.; The text density of these non-body matters is bigger; Greater than preset threshold value, therefore possibly be mistaken as body matter, thereby make that the extraction accuracy of body matter is lower probably.
Summary of the invention
The embodiment of the invention provides a kind of Web page text method for extracting content and device, in order to solve the lower problem of accuracy of the extraction Web page text content that prior art exists.
Embodiment of the invention technical scheme is following:
A kind of Web page text method for extracting content, the method comprising the steps of: two webpages that obtain to belong to same level catalogue under the same website; To each webpage that obtains, carry out respectively: this webpage is divided into each content blocks; The label density of each content blocks of confirming to mark off and/or link density; And select label density and/or link density and satisfy corresponding pre-conditioned content blocks; In each content blocks of selecting, extract all inconsistent content blocks of content of text of each content blocks of selecting in content of text and another webpage; With the content blocks that extracts, confirm as the body matter of this webpage.
A kind of Web page text contents extraction device comprises: obtain the unit, be used to obtain to belong to two webpages of same level catalogue under the same website; Division unit is used for to obtaining each webpage that the unit obtains this webpage being divided into each content blocks; First confirms the unit, is used for to obtaining each webpage that the unit obtains, and confirms the label density and/or the link density of each content blocks that division unit marks off; Selected cell is used for to obtaining each webpage that the unit obtains, and selects the satisfied corresponding pre-conditioned content blocks of label density and/or link density; Extraction unit is used for to obtaining each webpage that the unit obtains, and in each content blocks that selected cell is selected, extracts all inconsistent content blocks of content of text of each content blocks of selecting in content of text and another webpage; Second confirms the unit, is used for the content blocks that extraction unit extracts, confirming as the body matter of this webpage to obtaining each webpage that the unit obtains.
In the embodiment of the invention technical scheme; Because belonging to the webpage of same level catalogue under the same website is all generated by same template; Its structure of web page is similar or identical, so the embodiment of the invention is at first selected alternative body matter piece according to label density and/or link density to two webpages that belong to same level catalogue under the same website; Then in the content blocks of selecting; Remove two non-body matter pieces that webpage Chinese version content is identical, thereby extract the body matter piece, this has just improved the accuracy of extracting the Web page text content effectively.
Description of drawings
Fig. 1 is in the prior art, Web page text method for extracting content schematic flow sheet;
Fig. 2 is in the embodiment of the invention, Web page text method for extracting content schematic flow sheet;
Fig. 3 is in the embodiment of the invention, the concrete realization flow synoptic diagram of Web page text method for extracting content;
Fig. 4 is in the embodiment of the invention, Web page text contents extraction apparatus structure synoptic diagram.
Embodiment
At length set forth to the main realization principle of embodiment of the invention technical scheme, embodiment and to the beneficial effect that should be able to reach below in conjunction with each accompanying drawing.
As shown in Figure 2, be Web page text method for extracting content process flow diagram in the embodiment of the invention, its concrete treatment scheme is following:
Step 21, acquisition belongs to two webpages of same level catalogue under the same website;
The embodiment of the invention proposes; The different pages of same level catalogue under the same website; Normally generated by same HTML (HTML, Hyper Text Mark-up Language) template, therefore the structure of web page between the different web pages under the same level catalogue is identical or similar under the same website; Under the for example same website in the different pages of same level catalogue; All comprise personal information, disclaimer or the copyright statement etc. of identical content, the position of these contents in the different pages maybe be different, but content is identical.
Step 22, each webpage to obtaining is divided into each content blocks with this webpage respectively;
When webpage is divided into content blocks, need earlier the webpage pre-service that standardizes is made it to meet the html language standard; Then pretreated webpage is carried out structuring and handle, generate DOM Document Object Model (DOM, Document Object Model) tree; Obtain the HTML structuring statement of webpage,, webpage is carried out the sense of vision piecemeal handle according to <table>or < div>mark in the dom tree that generates; Be divided into each content blocks, wherein can but be not limited to adopt the mode of multistage piecemeal to divide content blocks, for example adopt the mode of two-stage piecemeal to divide content blocks; Earlier webpage is divided into each one-level content blocks; Respectively each one-level content blocks is divided into each secondary content blocks then, other multistage partitioned modes and aforesaid way are similar, repeat no more here.
After webpage is divided into each content blocks, can but be not limited to identify content blocks through the numberings at different levels of webpage numbering and content blocks, the mode that for example adopts the two-stage piecemeal is carried out content blocks when dividing to webpage, uses C i(j; K) identify the content blocks that marks off, wherein i representes that this content blocks is the content blocks in i the webpage, and j representes that this content blocks is a j one-level content blocks of i webpage; K representes that this content blocks is a k secondary content blocks of j one-level content blocks of i webpage, that is to say C i(j, k) in i webpage of sign, k secondary content blocks in j one-level content blocks.
Step 23, to each webpage that obtains, the label density and/or the link density of each content blocks of confirming to mark off;
The embodiment of the invention proposes; Can confirm the label density of each content blocks; Select alternative body matter piece according to label density, also can confirm the link density of each content blocks, select alternative body matter piece according to link density; Can also confirm the label density and link density of each content blocks, select alternative body matter piece according to label density and link density.
Wherein, the label density of content blocks is label number and the ratio of text number of words in this content blocks, and the link density of content blocks is link number and the ratio of text number of words in this content blocks.
If content blocks C i(j, the content of text in k) is T i(j, k), the text number of words is N i(j, k), the label number is Q i(j, k), the link number is P i(j k), then confirms label density Y through following manner i(j is k) with link density X i(j, k):
Y i ( j , k ) = Q i ( j , k ) N i ( j , k )
X i ( j , k ) = P i ( j , k ) N i ( j , k )
Step 24 to each webpage that obtains, is selected the satisfied corresponding pre-conditioned content blocks of label density and/or link density;
If select alternative body matter piece according to label density, then its process can but be not limited to following:
The label density threshold of each content blocks that at first obtains to mark off is selected the content blocks that label density is not more than the corresponding label density threshold then, with the content blocks of selecting, confirms as and satisfies pre-conditioned content blocks, is alternative body matter piece;
If select alternative body matter piece based on link density, then its process can but be not limited to following:
The link density threshold of each content blocks that at first obtains to mark off is selected the content blocks that link density is not more than corresponding link density threshold then, with the content blocks of selecting, confirms as and satisfies pre-conditioned content blocks, is alternative body matter piece;
If select alternative body matter piece based on label density and link density, then its process can but be not limited to following:
The label density threshold and link density threshold of each content blocks that at first obtains to mark off; Select label density then and be not more than the corresponding label density threshold; And link density is not more than the content blocks of corresponding link density threshold; With the content blocks of selecting, confirm as and satisfy pre-conditioned content blocks, be alternative body matter piece.
Wherein the label density threshold can but be not limited to obtain through following manner, be specially:
At first, respectively according to label density, confirm the label density variance of this content blocks, and, confirm the label density threshold of this content blocks according to the label density variance of determining to each content blocks that marks off;
Wherein link density threshold can but be not limited to obtain through following manner, be specially:
At first, based on link density, confirm the link density variance of this content blocks, and, confirm the link density threshold of this content blocks respectively respectively based on the link density variance of determining to each content blocks that marks off;
Content blocks C i(j, the label density in k) is Y i(j, k), link density is X i(j, k), according to label density Y i(j k), can confirm content blocks C i(j, label density variance D (Y k) i(j, k)) is according to link density X i(j k), can confirm content blocks C i(j, link density variance D (X k) i(j, k)) is according to content blocks C i(j, label density variance D (Y k) i(j, k)) can further determine content blocks C i(j, label density threshold B (Y) k) is according to content blocks C i(j, link density variance D (X k) i(j, k)) can further determine content blocks C i(j, link density threshold B (X) k).
With label density Y i(j k) compares with label density threshold B (Y), if comparative result is Y i(j is k) greater than B (Y), then with Y i(j, value k) is changed to 0, if comparative result is Y i(j k) is not more than B (Y), then with Y i(j, value k) is changed to 1, that is:
Y i ( j , k ) = 0 Y i ( j , k ) > B ( Y ) Y i ( j , k ) = 1 Y i ( j , k ) &le; B ( Y )
To link density X i(j k) compares with label density threshold B (X), if comparative result is X i(j is k) greater than B (X), then with X i(j, value k) is changed to 0, if comparative result is X i(j k) is not more than B (X), then with X i(j, value k) is changed to 1, that is:
X i ( j , k ) = 0 X i ( j , k ) > B ( X ) X i ( j , k ) = 1 X i ( j , k ) &le; B ( X )
If select alternative body matter piece according to label density, then with Y i(j is that 1 content blocks is chosen as alternative body matter piece k), promptly satisfies corresponding pre-conditioned content blocks;
If select alternative body matter piece according to link density, then with X i(j is that 1 content blocks is chosen as alternative body matter piece k), promptly satisfies corresponding pre-conditioned content blocks;
If select alternative body matter piece according to label density and link density, then to content blocks C i(j k), calculates X i(j, k) * Y i(j k), if result of calculation is 1, then is chosen as alternative body matter piece with this content blocks, promptly satisfies corresponding pre-conditioned content blocks, wherein passes through above-mentioned computing, X i(j, k) and Y i(j, value k) is 1 or 0, then has only the X of working as i(j, k) and Y i(j, value k) is at 1 o'clock, X i(j, k) * Y i(j, result of calculation k) just is 1.
In the embodiment of the invention; Owing to select alternative body matter piece according to label density and/or link density; Rather than confirm body matter according to text density; Thereby removed the higher a part of non-body matter of degree of disturbance earlier, therefore can improve the accuracy of extracting the Web page text content effectively.
Step 25 to each webpage that obtains, in each content blocks of selecting, extracts all inconsistent content blocks of content of text of each content blocks of selecting in content of text and another webpage;
The embodiment of the invention proposes, and the body text content under the same website between the different pages of same level catalogue differs greatly, and noise content is identical; Therefore can after selecting alternative body matter piece, in two pages, further remove the identical content blocks of content of text; These content blocks are noise content; Therefore be non-content of text, remaining alternative body matter piece is the Web page text content.
Wherein, can but be not limited to adopt the mode of poll to extract all inconsistent content blocks of content of text of each content blocks of selecting in content of text and another webpage, for example:
The alternative body matter piece of selecting to webpage 1 is: content blocks C 1(1,1), C 1(1,2), the alternative body matter piece of selecting to webpage 2 is: content blocks C 2(1,3), C 2(2,2), C 2(3,1) are at first with content blocks C 1The content of text T of (1,1) 1(1,1) and content blocks C 2The content of text T of (1,3) 2(1,3) compares, and comparative result is inconsistent, then with content blocks C 1The content of text T of (1,1) 1(1,1) and content blocks C 2The content of text T of (2,2) 2(2,2) compare, and comparative result is consistent, then confirms content blocks C 1(1,1) and content blocks C 2(2,2) are non-body matter, therefore remove in the alternative content blocks of webpage 1, remove content blocks C 1Content blocks C in the alternative content blocks of webpage 2, is removed in (1,1) 2(2,2);
With content blocks C 1The content of text T of (1,2) 1(1,2) and content blocks C 2The content of text T of (1,3) 2(1,3) compares, and comparative result is inconsistent, then with content blocks C 1The content of text T of (1,2) 1(1,2) and content blocks C 2The content of text T of (3,1) 2(3,1) compare, and comparative result is inconsistent, then confirms content blocks C 1(1,2), C 2(1,3) and C 2(3,1) are body matter, and promptly the body matter in the webpage 1 is C 1(1,2), the body matter in the webpage 2 are C 2(1,3) and C 2(3,1).
Though the noise content under the same website between the different pages of same level catalogue is identical, residing position maybe be different in the page, and for example personal information is positioned at the upper left side in the page 1; In the page 2, then be positioned at the lower left,, then require content all identical with the position if search identical subtree according to the coordinate of node in dom tree; So just maybe content is identical; But the different noise content in position is thought body matter by mistake, and the application embodiment adopts the mode of above-mentioned poll, in alternative body matter piece, extracts the Web page text content; Therefore it is identical just can to remove content; But the noise content that the position is different, only the content blocks that content of text is different is extracted as body matter, thereby has improved the accuracy of extracting the Web page text content effectively.
Step 26 to each webpage that obtains, with the content blocks that extracts, is confirmed as the body matter of this webpage.
Can know by above-mentioned processing procedure; In the embodiment of the invention technical scheme, all generated by same template owing to belong to the webpage of same level catalogue under the same website, its structure of web page is similar or identical; Therefore the embodiment of the invention is to two webpages that belong to same level catalogue under the same website; At first select alternative body matter piece, in the content blocks of selecting, remove two non-body matter pieces that webpage Chinese version content is identical then according to label density and/or link density; Thereby extract the body matter piece, this has just improved the accuracy of extracting the Web page text content effectively.
Provide more detailed embodiment below.
As shown in Figure 3, be the concrete realization flow figure of Web page text method for extracting content in the embodiment of the invention, its concrete processing procedure is following:
Step 31, acquisition belong to the webpage 1 and webpage 2 of same level catalogue under the same website;
Step 32 to each webpage that obtains pre-service that standardizes, makes it to meet the html language standard;
Step 33 is carried out structuring to pretreated each webpage and is handled, and generates dom tree;
Step 34 according to <table>or < div>mark in the dom tree that generates, is carried out the sense of vision piecemeal with webpage and is handled;
Step 35 is calculated the label density of each content blocks and is linked density;
Step 36 to each content blocks, compares label density and label density threshold, will link density and compare with the link density threshold;
Step 37 with the content blocks that is not more than corresponding link density threshold, is confirmed as alternative body matter piece;
Step 38, the mode of employing poll is carried out similarity relatively with alternative body matter piece in the webpage 1 and the alternative body matter piece in the webpage 2;
Step 39 to each webpage, according to comparative result, extracts all inconsistent content blocks of content of text of alternative body matter piece in content of text and another webpage, and the content blocks that extracts is the body matter of this webpage.
Accordingly, the embodiment of the invention provides a kind of Web page text contents extraction device, and its structure is as shown in Figure 4, comprise obtaining unit 41, division unit 42, first definite unit 43, selected cell 44, extraction unit 45 and second definite unit 46, wherein:
Obtain unit 41, be used to obtain to belong to two webpages of same level catalogue under the same website;
Division unit 42 is used for to obtaining each webpage that unit 41 obtains this webpage being divided into each content blocks;
First confirms unit 43, is used for to obtaining each webpage that unit 41 obtains, and confirms the label density and/or the link density of each content blocks that division unit 42 marks off;
Selected cell 44 is used for to obtaining each webpage that unit 41 obtains, and selects the satisfied corresponding pre-conditioned content blocks of label density and/or link density;
Extraction unit 45 is used in each content blocks that selected cell 44 is selected, extracting all inconsistent content blocks of content of text of each content blocks of selecting in content of text and another webpage to obtaining each webpage that unit 41 obtains;
Second confirms unit 46, is used for the content blocks that extraction unit 45 extracts, confirming as the body matter of this webpage to obtaining each webpage that unit 41 obtains.
Preferably, to each content blocks that division unit 42 marks off, said first confirms the ratio of unit 43 with label number in this content blocks and text number of words, confirms as the label density of this content blocks, and
With the ratio of link number in this content blocks and text number of words, confirm as the link density of this content blocks.
Preferably, selected cell 44 specifically comprises acquisition subelement, chooser unit and definite subelement, wherein:
Obtain subelement, be used for to obtaining each webpage that unit 41 obtains, the label density threshold of each content blocks that acquisition division unit 42 marks off and/or link density threshold;
The chooser unit be used for selecting label density and being not more than the corresponding label density threshold to obtaining each webpage that unit 41 obtains, and/or link density is not more than the content blocks of corresponding link density threshold;
Confirm subelement, be used for the content blocks that the chooser unit is selected, confirming as and satisfying pre-conditioned content blocks to obtaining each webpage that unit 41 obtains.
More preferably, to each content blocks that division unit 42 marks off, said acquisition subelement is confirmed the label density threshold of this content blocks according to the label density variance of this content blocks, and
Based on the link density variance of this content blocks, confirm the link density threshold of this content blocks.
Preferably, said division unit 42 specifically comprises the pre-service subelement, generates subelement and division subelement, wherein:
The pre-service subelement is used for to obtaining each webpage that unit 41 obtains, to the pre-service that standardizes of this webpage;
Generate subelement, be used for according to the pretreated webpage of pre-service subelement, generating corresponding dom tree to obtaining each webpage that unit 41 obtains;
Divide subelement, be used for based on generating the dom tree that subelement generates, this webpage being divided into each content blocks to obtaining each webpage that unit 41 obtains.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, belong within the scope of claim of the present invention and equivalent technologies thereof if of the present invention these are revised with modification, then the present invention also is intended to comprise these changes and modification interior.

Claims (10)

1. a Web page text method for extracting content is characterized in that, comprising:
Acquisition belongs to two webpages of same level catalogue under the same website;
To each webpage that obtains, carry out respectively:
This webpage is divided into each content blocks;
The label density of each content blocks of confirming to mark off and/or link density; And select label density and/or link density and satisfy corresponding pre-conditioned content blocks;
In each content blocks of selecting, extract all inconsistent content blocks of content of text of each content blocks of selecting in content of text and another webpage;
With the content blocks that extracts, confirm as the body matter of this webpage.
2. Web page text method for extracting content as claimed in claim 1 is characterized in that, the label density of each content blocks of confirming to mark off specifically comprises:
To each content blocks that marks off,, confirm as the label density of this content blocks with the ratio of label number in this content blocks and text number of words;
The link density of each content blocks of confirming to mark off specifically comprises:
With the ratio of link number in this content blocks and text number of words, confirm as the link density of this content blocks.
3. Web page text method for extracting content as claimed in claim 1 is characterized in that, selects the satisfied corresponding pre-conditioned content blocks of label density and/or link density, specifically comprises:
The label density threshold of each content blocks that acquisition marks off and/or link density threshold;
Select label density and be not more than the corresponding label density threshold, and/or link density is not more than the content blocks of corresponding link density threshold;
With the content blocks of selecting, confirm as and satisfy pre-conditioned content blocks.
4. Web page text method for extracting content as claimed in claim 3 is characterized in that, the label density threshold of each content blocks that acquisition marks off specifically comprises:
To each content blocks that marks off, respectively according to label density, confirm the label density variance of this content blocks, and, confirm the label density threshold of this content blocks according to the label density variance of determining;
The link density threshold of each content blocks that acquisition marks off specifically comprises:
To each content blocks that marks off, based on link density, confirm the link density variance of this content blocks, and, confirm the link density threshold of this content blocks respectively based on the link density variance of determining.
5. Web page text method for extracting content as claimed in claim 1 is characterized in that, this webpage is divided into each content blocks, specifically comprises:
To the pre-service that standardizes of this webpage;
Based on pretreated webpage, generate corresponding DOM Document Object Model dom tree;
Dom tree according to generating is divided into each content blocks with this webpage.
6. a Web page text contents extraction device is characterized in that, comprising:
Obtain the unit, be used to obtain to belong to two webpages of same level catalogue under the same website;
Division unit is used for to obtaining each webpage that the unit obtains this webpage being divided into each content blocks;
First confirms the unit, is used for to obtaining each webpage that the unit obtains, and confirms the label density and/or the link density of each content blocks that division unit marks off;
Selected cell is used for to obtaining each webpage that the unit obtains, and selects the satisfied corresponding pre-conditioned content blocks of label density and/or link density;
Extraction unit is used for to obtaining each webpage that the unit obtains, and in each content blocks that selected cell is selected, extracts all inconsistent content blocks of content of text of each content blocks of selecting in content of text and another webpage;
Second confirms the unit, is used for the content blocks that extraction unit extracts, confirming as the body matter of this webpage to obtaining each webpage that the unit obtains.
7. Web page text contents extraction device as claimed in claim 6; It is characterized in that to each content blocks that division unit marks off, said first confirms the ratio of unit with label number in this content blocks and text number of words; Confirm as the label density of this content blocks, and
With the ratio of link number in this content blocks and text number of words, confirm as the link density of this content blocks.
8. Web page text contents extraction device as claimed in claim 6 is characterized in that selected cell specifically comprises:
Obtain subelement, be used for to obtaining each webpage that the unit obtains, the label density threshold of each content blocks that the acquisition division unit marks off and/or link density threshold;
The chooser unit be used for selecting label density and being not more than the corresponding label density threshold to obtaining each webpage that the unit obtains, and/or link density is not more than the content blocks of corresponding link density threshold;
Confirm subelement, be used for the content blocks that the chooser unit is selected, confirming as and satisfying pre-conditioned content blocks to obtaining each webpage that the unit obtains.
9. Web page text contents extraction device as claimed in claim 8 is characterized in that, to each content blocks that division unit marks off, said acquisition subelement is confirmed the label density threshold of this content blocks according to the label density variance of this content blocks, and
Based on the link density variance of this content blocks, confirm the link density threshold of this content blocks.
10. Web page text contents extraction device as claimed in claim 6 is characterized in that said division unit specifically comprises:
The pre-service subelement is used for to obtaining each webpage that the unit obtains, to the pre-service that standardizes of this webpage;
Generate subelement, be used for,, generate corresponding DOM Document Object Model dom tree according to the pretreated webpage of pre-service subelement to obtaining each webpage that the unit obtains;
Divide subelement, be used for based on generating the dom tree that subelement generates, this webpage being divided into each content blocks to obtaining each webpage that the unit obtains.
CN 201010591506 2010-12-16 2010-12-16 Webpage text content extracting method and device Active CN102541874B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010591506 CN102541874B (en) 2010-12-16 2010-12-16 Webpage text content extracting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010591506 CN102541874B (en) 2010-12-16 2010-12-16 Webpage text content extracting method and device

Publications (2)

Publication Number Publication Date
CN102541874A true CN102541874A (en) 2012-07-04
CN102541874B CN102541874B (en) 2013-11-06

Family

ID=46348795

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010591506 Active CN102541874B (en) 2010-12-16 2010-12-16 Webpage text content extracting method and device

Country Status (1)

Country Link
CN (1) CN102541874B (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103064827A (en) * 2013-01-16 2013-04-24 盘古文化传播有限公司 Method and device for extracting webpage content
CN103116760A (en) * 2013-02-18 2013-05-22 人民搜索网络股份公司 Method and device for identifying text-missing web pages
CN103258000A (en) * 2013-03-29 2013-08-21 北界创想(北京)软件有限公司 Method and device for clustering high-frequency keywords in webpages
CN103309961A (en) * 2013-05-30 2013-09-18 北京智海创讯信息技术有限公司 Webpage content extraction method based on Markov random field
CN103425765A (en) * 2013-08-06 2013-12-04 优视科技有限公司 Method and device for extracting webpage text and method and system for webpage preview
CN103970755A (en) * 2013-01-28 2014-08-06 腾讯科技(深圳)有限公司 Novel catalog entry identification method, device and system
CN104268192A (en) * 2014-09-20 2015-01-07 广州金山网络科技有限公司 Webpage information extracting method, device and terminal
CN104484451A (en) * 2014-12-25 2015-04-01 北京国双科技有限公司 Web page information extraction method and web page information extraction device
WO2015165324A1 (en) * 2014-04-30 2015-11-05 广州市动景计算机科技有限公司 Webpage text extraction method and device, and webpage advertisement handling method and device
CN105095466A (en) * 2015-07-31 2015-11-25 山东大学 Web text information extraction method
CN105335382A (en) * 2014-06-27 2016-02-17 优视科技有限公司 Webpage text extraction method and device
CN105808644A (en) * 2016-02-25 2016-07-27 浪潮软件集团有限公司 Method and device for determining text node
CN106227858A (en) * 2016-07-28 2016-12-14 北京橘子文化传媒有限公司 A kind of mobile Internet webpage or the accurate extracting method of media platform article content
CN106802899A (en) * 2015-11-26 2017-06-06 北京搜狗科技发展有限公司 web page text extracting method and device
CN106855859A (en) * 2015-12-08 2017-06-16 北京搜狗科技发展有限公司 A kind of webpage context extraction method and device
CN106960057A (en) * 2017-04-05 2017-07-18 上海威固信息技术有限公司 A kind of method that Web page text is extracted based on information density
CN107103012A (en) * 2016-01-28 2017-08-29 阿里巴巴集团控股有限公司 Recognize method, device and the server of violated webpage
CN107203527A (en) * 2016-03-16 2017-09-26 北大方正集团有限公司 The text extracting method and system of news web page
CN108763591A (en) * 2018-06-21 2018-11-06 湖南星汉数智科技有限公司 A kind of webpage context extraction method, device, computer installation and computer readable storage medium
CN110020312A (en) * 2017-12-11 2019-07-16 北京京东尚科信息技术有限公司 The method and apparatus for extracting Web page text
CN110020247A (en) * 2017-12-22 2019-07-16 中移(苏州)软件技术有限公司 A kind of webpage key modules extracting method and device
CN110334300A (en) * 2019-07-10 2019-10-15 哈尔滨工业大学 Text aid reading method towards the analysis of public opinion
CN112749528A (en) * 2019-10-31 2021-05-04 腾讯科技(深圳)有限公司 Text processing method and device, electronic equipment and computer readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1763740A (en) * 2003-09-18 2006-04-26 富士通株式会社 Info web piece extracting method and device
CN1786947A (en) * 2004-12-07 2006-06-14 国际商业机器公司 System, method and program for extracting web page core content based on web page layout

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1763740A (en) * 2003-09-18 2006-04-26 富士通株式会社 Info web piece extracting method and device
CN1786947A (en) * 2004-12-07 2006-06-14 国际商业机器公司 System, method and program for extracting web page core content based on web page layout

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103064827A (en) * 2013-01-16 2013-04-24 盘古文化传播有限公司 Method and device for extracting webpage content
CN103970755A (en) * 2013-01-28 2014-08-06 腾讯科技(深圳)有限公司 Novel catalog entry identification method, device and system
CN103970755B (en) * 2013-01-28 2018-12-11 腾讯科技(深圳)有限公司 A kind of recognition methods of listing of novel item, device and system
CN103116760A (en) * 2013-02-18 2013-05-22 人民搜索网络股份公司 Method and device for identifying text-missing web pages
CN103258000A (en) * 2013-03-29 2013-08-21 北界创想(北京)软件有限公司 Method and device for clustering high-frequency keywords in webpages
CN103309961B (en) * 2013-05-30 2015-07-15 北京智海创讯信息技术有限公司 Webpage content extraction method based on Markov random field
CN103309961A (en) * 2013-05-30 2013-09-18 北京智海创讯信息技术有限公司 Webpage content extraction method based on Markov random field
CN103425765A (en) * 2013-08-06 2013-12-04 优视科技有限公司 Method and device for extracting webpage text and method and system for webpage preview
WO2015165324A1 (en) * 2014-04-30 2015-11-05 广州市动景计算机科技有限公司 Webpage text extraction method and device, and webpage advertisement handling method and device
CN105335382A (en) * 2014-06-27 2016-02-17 优视科技有限公司 Webpage text extraction method and device
CN105335382B (en) * 2014-06-27 2018-11-16 优视科技有限公司 The extracting method and device of Web page text
CN104268192A (en) * 2014-09-20 2015-01-07 广州金山网络科技有限公司 Webpage information extracting method, device and terminal
CN104268192B (en) * 2014-09-20 2018-08-07 广州猎豹网络科技有限公司 A kind of webpage information extracting method, device and terminal
CN104484451B (en) * 2014-12-25 2017-12-19 北京国双科技有限公司 The extracting method and device of Webpage information
CN104484451A (en) * 2014-12-25 2015-04-01 北京国双科技有限公司 Web page information extraction method and web page information extraction device
CN105095466A (en) * 2015-07-31 2015-11-25 山东大学 Web text information extraction method
CN106802899A (en) * 2015-11-26 2017-06-06 北京搜狗科技发展有限公司 web page text extracting method and device
CN106802899B (en) * 2015-11-26 2020-11-24 北京搜狗科技发展有限公司 Webpage text extraction method and device
CN106855859A (en) * 2015-12-08 2017-06-16 北京搜狗科技发展有限公司 A kind of webpage context extraction method and device
CN107103012A (en) * 2016-01-28 2017-08-29 阿里巴巴集团控股有限公司 Recognize method, device and the server of violated webpage
CN105808644A (en) * 2016-02-25 2016-07-27 浪潮软件集团有限公司 Method and device for determining text node
CN107203527B (en) * 2016-03-16 2019-06-28 北大方正集团有限公司 The text extracting method and system of news web page
CN107203527A (en) * 2016-03-16 2017-09-26 北大方正集团有限公司 The text extracting method and system of news web page
CN106227858B (en) * 2016-07-28 2019-06-25 北京橘子文化传媒有限公司 A kind of accurate extracting method of mobile Internet webpage or media platform article content
CN106227858A (en) * 2016-07-28 2016-12-14 北京橘子文化传媒有限公司 A kind of mobile Internet webpage or the accurate extracting method of media platform article content
CN106960057A (en) * 2017-04-05 2017-07-18 上海威固信息技术有限公司 A kind of method that Web page text is extracted based on information density
CN110020312A (en) * 2017-12-11 2019-07-16 北京京东尚科信息技术有限公司 The method and apparatus for extracting Web page text
CN110020312B (en) * 2017-12-11 2022-09-06 北京京东尚科信息技术有限公司 Method and device for extracting webpage text
CN110020247A (en) * 2017-12-22 2019-07-16 中移(苏州)软件技术有限公司 A kind of webpage key modules extracting method and device
CN110020247B (en) * 2017-12-22 2021-05-14 中移(苏州)软件技术有限公司 Webpage key module extraction method and device
CN108763591A (en) * 2018-06-21 2018-11-06 湖南星汉数智科技有限公司 A kind of webpage context extraction method, device, computer installation and computer readable storage medium
CN110334300A (en) * 2019-07-10 2019-10-15 哈尔滨工业大学 Text aid reading method towards the analysis of public opinion
CN112749528A (en) * 2019-10-31 2021-05-04 腾讯科技(深圳)有限公司 Text processing method and device, electronic equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN102541874B (en) 2013-11-06

Similar Documents

Publication Publication Date Title
CN102541874B (en) Webpage text content extracting method and device
CN101727461B (en) Method for extracting content of web page
CN105630941B (en) Web body matter abstracting methods based on statistics and structure of web page
CN107590219A (en) Webpage personage subject correlation message extracting method
CN103853760B (en) Method and device for extracting contents of bodies of web pages
CN105426539A (en) Dictionary-based lucene Chinese word segmentation method
CN105956052A (en) Building method of knowledge map based on vertical field
CN102270206A (en) Method and device for capturing valid web page contents
WO2010125463A1 (en) Method and apparatus for identifying synonyms and using synonyms to search
CN103473338A (en) Webpage content extraction method and webpage content extraction system
CN104598577A (en) Extraction method for webpage text
CN103810251A (en) Method and device for extracting text
CN102750390A (en) Automatic news webpage element extracting method
CN104750820A (en) Filtering method and device for corpuses
CN102117289A (en) Method and device for extracting comment content from webpage
CN102799638B (en) In-page navigation generation method facing barrier-free access to webpage contents
CN107145591B (en) Title-based webpage effective metadata content extraction method
CN106528509B (en) Webpage information extraction method and device
CN103455572B (en) Obtain the method and device of video display main body in webpage
CN111814476A (en) Method and device for extracting entity relationship
CN105243120A (en) Retrieval method and apparatus
CN107577713A (en) Text handling method based on electric power dictionary
CN104615728B (en) A kind of webpage context extraction method and device
CN106897287B (en) Webpage release time extraction method and device for webpage release time extraction
CN102591976A (en) Text characteristic extracting method and document copy detection system based on sentence level

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant