CN102810097A - Method and device for extracting webpage text content - Google Patents

Method and device for extracting webpage text content Download PDF

Info

Publication number
CN102810097A
CN102810097A CN2011101475837A CN201110147583A CN102810097A CN 102810097 A CN102810097 A CN 102810097A CN 2011101475837 A CN2011101475837 A CN 2011101475837A CN 201110147583 A CN201110147583 A CN 201110147583A CN 102810097 A CN102810097 A CN 102810097A
Authority
CN
China
Prior art keywords
content
blocks
text
link text
template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011101475837A
Other languages
Chinese (zh)
Other versions
CN102810097B (en
Inventor
朱海军
姜吉发
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Autonavi Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Autonavi Software Co Ltd filed Critical Autonavi Software Co Ltd
Priority to CN201110147583.7A priority Critical patent/CN102810097B/en
Publication of CN102810097A publication Critical patent/CN102810097A/en
Application granted granted Critical
Publication of CN102810097B publication Critical patent/CN102810097B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method and a device for extracting webpage text content. The method includes steps of dividing a webpage with requirement on text content extraction into different content blocks; executing operations, including determining link text length and non-link text length of the content blocks, to the different divided content blocks respectively; determining the link text density of the corresponding content block according to the determined link text length and non-link text length; and determining that the content blocks are the text content of the webpage when the link text density is not higher than a first specified threshold value. By the method and the device for extracting webpage text content, the problem of low accuracy in webpage text content extraction in the prior art is solved.

Description

Web page text method for extracting content and device
Technical field
The present invention relates to the internet information processing technology field, relate in particular to a kind of Web page text method for extracting content and device.
Background technology
Along with developing rapidly of Internet technology, the information on the webpage is more and more abundanter, in order better to use the information on the webpage; The technology of network information can be effectively organized and utilized in the continuous pursuit of people; But also make webpage neat, clean simultaneously, wherein comprising a large amount of noise contents, the script that for example adds in order to strengthen user interactivity unlike the traditional text that kind; The navigation link that adds for the ease of the user browses, and from the commercial advertisement link of considering to be added etc.
Web page text extracts and is meant from HTML (HTML; Hyper Text Mark-upLanguage) in the page; The literal chain advertisement of removal navigation bar, sidebar, picture, copyright etc. and the irrelevant information of text; The body matter of webpage is extracted, and the Web page text contents extraction is one of step indispensable in the search engine.
The method of extracting Web page text in the prior art mainly comprises based on the method for distilling of visual signature with based on the method for distilling of adding up, and introduces respectively in the face of two kinds of method for distilling down.
1, based on the method for distilling of visual signature
At first the visual signature based on webpage excavates the structure of webpage; Wherein the visual signature of webpage comprises font, font size, background color, white space, positional information etc.; Visual signature according to webpage is divided into each visual information piece with webpage, then to each visual information piece, according to the visual signature rule of this visual information piece; Judge whether this visual information piece is the body matter of webpage; For example, the title division in the Web page text content generally is fixing font size, and body matter is immediately following title division; And the font size of the body matter generally font size than title division is little, therefore can extract the body matter of webpage according to above-mentioned visual signature rule.
Above-mentioned method for distilling based on visual signature mainly extracts body matter according to the visual signature of webpage, and visual signature can not be distinguished the boundary of body matter and non-body matter sometimes very accurately, and the extraction accuracy is lower.In addition; Along with the continuous development of network technology, web page format is more and more abundanter, when extracting body matter to the webpage of some form; Possibly can't obtain visual signatures such as font, font size, background color, white space, positional information, the visual signature that perhaps obtains is inaccurate.Accuracy when therefore adopting method for distilling based on visual signature to extract body matter is lower.
2, based on the method for distilling of adding up
It has been generally acknowledged that the less part of variation is generally redundant content in the webpage; Be noise, like navigation bar, side advertisement, copyright information etc., and the part that often changes in the webpage is generally the body matter of webpage; Therefore can make up a training set that comprises a large amount of webpages; From training set, count the zone that changes less regional and frequent variation, and then sum up corresponding web page template, when extracting the body matter of webpage; Corresponding web page template compares in the webpage that this desire is extracted body matter and the training set, and then extracts the body matter of webpage.
Above-mentioned method for distilling based on statistics is main to extract body matter according to the web page template that sums up; And to each webpage; The distribution of its body matter is compared with web page template maybe be slightly different; That is to say that when using unified web page template extraction Web page text, the accuracy of extraction is relatively low.In addition, along with the continuous development of network technology, web page format is more and more abundanter; The webpage of a lot of forms can not corresponding unified web page template, and for example, webpage 1 all is the webpage of certain website with webpage 2; Webpage 2 carries out the webpage after the correcting for this website; Webpage 1 carries out the webpage before the correcting for this website, and body matter residing position in webpage 1 and webpage 2 is different, and the web page template of webpage 1 correspondence can not be applicable to webpage 2 so; If still webpage 2 is carried out the extraction of body matter, can make that then the accuracy of extracting is lower according to this web page template.
Therefore there is the lower problem of accuracy of extracting in the Web page text extractive technique of prior art.
Summary of the invention
The embodiment of the invention provides a kind of Web page text method for extracting content and device, in order to solve the lower problem of extraction Web page text content accuracy that prior art exists.
Embodiment of the invention technical scheme is following:
A kind of Web page text method for extracting content, the method comprising the steps of: the webpage that will extract body matter is divided into each content blocks; Each content blocks to marking off is carried out respectively: link text length and the non-link text length of confirming this content blocks; And, confirm the link text density that this content blocks is corresponding according to link text length of determining and non-link text length; When link text density is not more than the first preset defined threshold, confirm the body matter of this content blocks for this webpage.
A kind of Web page text contents extraction device comprises: the content blocks division unit, and the webpage that is used for need are extracted body matter is divided into each content blocks; First text size is confirmed the unit, is used for to each content blocks, confirms the link text length and the non-link text length of this content blocks respectively; The first link text density is confirmed the unit, is used for confirming link text length and the non-link text length determined the unit according to first text size, confirms the link text density that this content blocks is corresponding; The first link text density judging unit is used to judge that whether link text density that the first link text density confirms to determine the unit is greater than the first preset defined threshold; Body matter is confirmed the unit, is used in the judged result of the first link text density judging unit confirming the body matter of this content blocks for this webpage for not the time.
Embodiment of the invention technical scheme need be extracted the webpage of body matter to each, through link text length what (being link text density) of shared ratio in content blocks; Determine whether body matter for webpage; When link text length proportion is many more, think that then to become the possibility of body matter more little, if surpass first defined threshold; Then confirm as non-body matter, otherwise in like manner.It is thus clear that; The embodiment of the invention is when extracting the Web page text content; Be to carry out, can not receive the influence of different web pages form difference to each webpage that need extract body matter, and link text length what of proportion in content blocks; Can be comparatively objective and accurate reflect this content blocks become Web page text possibility what, and then improve to extract the accuracy of Web page text effectively.
Description of drawings
Fig. 1 is in the embodiment of the invention, Web page text method for extracting content schematic flow sheet;
Fig. 2 is in the embodiment of the invention, the concrete realization flow synoptic diagram of Web page text method for extracting content;
Fig. 3 is in the embodiment of the invention, Web page text contents extraction apparatus structure synoptic diagram.
Embodiment
At length set forth to the main realization principle of embodiment of the invention technical scheme, embodiment and to the beneficial effect that should be able to reach below in conjunction with each accompanying drawing.
As shown in Figure 1, be Web page text method for extracting content process flow diagram in the embodiment of the invention, its concrete treatment scheme is following:
Step 11, the webpage that need is extracted body matter is divided into each content blocks.
Webpage is described one or more themes through the literal that becomes section usually, wherein also comprise contents such as picture and link, but these contents is not the main body of webpage, relative Web page text content, and its content is less.
Webpage is divided into each content blocks is meant according to each containers labels in the webpage rightly, webpage is divided into a plurality of content blocks.That is to say, the content of each containers labels centering in the webpage is divided into a content blocks.Specifically comprise following substep:
The webpage that need the are extracted body matter pre-service that standardizes;
Each containers labels that obtains in the pretreated webpage is right;
Each containers labels according to obtaining is right, and pretreated webpage is divided into a plurality of content blocks.
Above-mentioned with the webpage pre-service that standardizes; Make it to meet HTML (HTML; HyperText Mark-up Language) standard; The pre-service that standardizes mainly comprises unified web page coding form, simplifies the irrelevant processing sections such as code segment of label, deletion and text to webpage, introduces respectively below.
1, unifies the web page coding form
Because the employed coded format in each website is not necessarily identical; The coded format that is each webpage maybe be different; Therefore in order correctly from webpage, to extract body matter; Need carry out the conversion of coded format to each webpage that adopts the different coding form to encode, convert unified coded format to, can but be not limited to carry out the conversion of coded format by means of the charset attribute of meta label.
2, simplify label
The fundamental purpose of simplifying label is to extract to handle for the ease of the later stage body matter further to improve the accuracy of extracting the result, simplifies the content that label mainly comprises the following aspects:
Because different tag attributes plays a different role in web displaying; For example through tag attributes is set; Some content is not shown in webpage; For this type of tag attributes, should avoid this type of tag attributes to appear in the body matter that extracts with its deletion in simplifying the process of label.For example, the tag attributes in " < tdheight=" 29 " align=" right ">" is deleted, simplify and be " < td>".
Because HTML code case-insensitive; For the ease of subsequent treatment, can when simplifying label, all labels all be carried out the capital and small letter Unified Treatment; Such as unifying to convert into capitalization, for example the label td in " < tdheight=" 29 " align=" right ">" is simplified and be " TD ".
According to tag attributes label is carried out handled, such as deletion, replacement etc.Such as representing delete flag, represent to replace mark with " _ ATTR_REP ", when simplifying label with " _ ATTR_DEL "; Can be according to corresponding delete flag or replacement mark; Processing is deleted or replaced to tag attributes, if tag attributes corresponding be labeled as " _ ATTR_DEL ", the label that then will comprise this tag attributes to and the content Delete All that comprises; Be labeled as " _ ATTR_REP " as if the tag attributes correspondence then replaces the label of this tag attributes.
In the practical application, simplify label and can also comprise a lot of contents, those skilled in the art can be according to the specific descriptions form of each webpage, and design does not limit here voluntarily.
3, the irrelevant code segment of deletion and text
In HTML code, some code uses for certain function that realizes the page, and is irrelevant with the body matter of webpage; For example CSS code and Script code; When standardizing pre-service, just need be with these code deletions, table 1 has been listed needs the label of the code of deletion correspondence right.
Table 1
Start-tag End-tag Remarks
<SCRIPT>; </SCRIPT>; Scripted code
<STYLE>; </STYLE>; The CSS code
<FORM>; </FORM>; List
Above-mentioned standardization pre-service is to extract Web page text conventional steps before, and those skilled in the art can carry out adaptive change and distortion on the basis of foregoing description, and the embodiment of the invention does not limit this.
Because the embodiment of the invention is when extracting the body matter of webpage; Webpage has been carried out the standardization pre-service; So with regard to having avoided since in the HTML code write mistake or code lack of standardization cause can not the correct extraction body matter problem, make that therefore the fault-tolerance of the Web page text method for extracting content that the embodiment of the invention proposes is stronger.
After the performance specification pre-service, need webpage be divided into each content blocks, concrete: each containers labels that obtains in the pretreated webpage earlier is right, and is right according to each containers labels that obtains, and pretreated webpage is divided into a plurality of content blocks.
Containers labels commonly used to have < TABLE>label to, < TR>label to, < TD>label to, < DIV>label to, < P>label equity.Start-tag that each containers labels is right and the content between the end-tag are the content blocks of this containers labels to correspondence.
The embodiment of the invention according to containers labels to webpage being divided into each content blocks; Because it is right all to comprise containers labels in the general webpage; Therefore according to the method highly versatile of containers labels, do not receive the restriction of webpage format, do not receive the influence of website revision the division content blocks.
Step 12 to each content blocks that marks off, is confirmed the link text length and the non-link text length of each content blocks respectively.
Wherein, the text in the embodiment of the invention is meant the character string in the content, and text size promptly refers to the length of character string so.
In the embodiment of the invention; To each content blocks that marks off; When confirming the link text length of this content blocks, confirm the text size of each link in this content blocks earlier respectively, the quantity of the character that is promptly comprised in each link; According to the text size of each link in this content blocks, confirm the link text length of this content blocks then.
Concrete, can but be not limited to text size sum with each link in this content blocks, as the link text length of this content blocks, promptly confirm the link text length of content blocks through following method:
Len ( LinkText ) i = &Sigma; j = 0 n Len ( LinkText ) ij
Wherein, Len (LinkText) iLink text length for content blocks i;
N is the number of links that content blocks i is comprised;
Len (LinkText) IjBe the text size of j link among the content blocks i, wherein 1≤j≤n.
In the embodiment of the invention; To each content blocks that marks off; When confirming the non-link text length of this content blocks; Can determine the total text size of this content blocks earlier, then that this content blocks is total text size deducts the link text length of this content blocks, and the text size that obtains is the non-link text length L en (NonLinkText) of this content blocks iAlso can be, calculate the method for each non-link text length sum in this content blocks, or the like, the embodiment of the invention does not limit this.
Step 13 to each content blocks that marks off, respectively according to link text length and the non-link text length determined, is confirmed the link text density that this content blocks is corresponding.
The Web page text literal of the big section of employing usually represents, and it has been generally acknowledged that, the content of link part such as navigation link, advertising message link etc., is not a body matter in the webpage, and for the non-content that links part in the webpage, is the body matter of webpage.In the embodiment of the invention, core concept be through the shared ratio of link text length what, determine whether to be body matter.Concrete, when link text length proportion is many more, think that then to become the possibility of body matter more little, when link text length proportion is more little, think that then to become the possibility of body matter big more.
In this step; According to link text length and non-link text length, confirm that the embodiment of the link text density (being the shared ratio of above-mentioned link text length) that this content blocks is corresponding has a lot, can be ratio according to link text length and non-link text length; Confirm the link text density that this content blocks is corresponding; Also can be according to the ratio of link text length and the total text size of this content blocks, confirm the link text density that this content blocks is corresponding, or the like; As long as can reflect the ratio that link text length is shared to a certain extent, the embodiment of the invention does not limit this.
To ratio according to link text length and non-link text length; Confirm the link text density embodiment that this content blocks is corresponding; Can be directly with the ratio of link text length of determining and non-link text length, as the corresponding link text density of this content blocks:
f i = Len ( LinkText ) i Len ( NonLinkText ) i
Wherein, f iBe the corresponding link text density of content blocks i;
Len (LinkText) iLink text length for content blocks i;
Len (NonLinkText) iNon-link text length for content blocks i.
Because the link text length that content blocks comprised is long more; The possibility that becomes Web page text is just more little; In order to embody this principle; According to the ratio of link text length and non-link text length, confirm the link text density that this content blocks is corresponding, can also use the value of the penalty factor link text density that content blocks is corresponding to adjust.Concrete, said ratio and penalty factor are multiplied each other, obtain the corresponding link text density of this content blocks.
Wherein, above-mentioned penalty factor can but be not limited to the number of links that this content blocks comprises, confirm as the corresponding link text density of this content blocks this moment and be specially:
f i = Len ( LinkText ) i Len ( NonLinkText ) i &times; n
Wherein, n is the number of links that content blocks i is comprised, i.e. penalty factor.
Table 2 illustrates a content blocks i in the webpage; This content blocks has comprised navigation link and copyright information; The link text length of calculating this content blocks i is 28; The quantity sum that is the character that each link is comprised among the content blocks i is 28, and the non-link text length of calculating content blocks i is 56, if do not use the f of penalty factor n with content blocks i iAmplify, then f iBe 0.5, less than first defined threshold 0.9, can be mistaken as is non-linked contents, and this content blocks is extracted as the Web page text content, if use the f of penalty factor n with content blocks i iAmplify, wherein n=7, then f iBe 3.5,, so just think that this content blocks is a linked contents, this content blocks is deleted from HTML code greater than first defined threshold 0.9, visible, after the employing penalty factor is adjusted, can further improve the accuracy that Web page text extracts.
Table 2
Figure BDA0000065942530000083
Figure BDA0000065942530000091
Therefore, because penalty factor can amplify the link text density of content blocks, therefore just can avoid some linked contents erroneous judgements are body matter, thereby improve the accuracy of extracting the Web page text content.
Step 14 to each content blocks that marks off, when the corresponding link text density of this content blocks is not more than the first preset defined threshold, is confirmed the body matter of this content blocks for this webpage respectively.
In the embodiment of the invention,, determining the corresponding link text density f of this content blocks to each content blocks that marks off iAfterwards, with this link text density f iCompare with the first preset defined threshold, if f iGreater than first defined threshold, think that then this content blocks is non-body matter, therefore this content blocks is deleted from HTML code, if f iBe not more than first defined threshold, think that then this content blocks is the body matter of webpage, can join this content blocks in the body matter result set.
Wherein, above-mentioned first defined threshold is empirical value normally, and those skilled in the art can be provided with it according to different extraction accuracy requirements, and the embodiment of the invention does not limit this.
At < TABLE>label to, < TR>label to, < TD>label to, < DIV>label to, these containers labels centerings of < P>label equity; Possibly exist label to nested situation; < P>label is to generally being used for the segmentation of text; The label that belongs to the comparison internal layer label seldom occurs to nested situation, and < TABLE>label is many to the nested situation of label occurring.To nested situation occurring; The embodiment of the invention label that coverage is relatively large is right to being called outer layer label; The label that coverage is less relatively is right to being called vpn label, and it is right that for example < TABLE>label centering is nested with < P>label, promptly is nested with the content blocks of < P>label to correspondence in the content blocks of < TABLE>label to correspondence; Then that < P>label is right to being called vpn label; < TABLE>label is right to being called outer layer label, < P>label is called the nested content blocks of < TABLE>label to the content blocks of correspondence to the content blocks of correspondence, as shown in table 3.
Table 3
Start-tag End-tag Label is to coverage
<P>; </P>; Vpn label
<SPAN>; </SPAN>;
<DIV>; </DIV>;
<UL>; </UL>;
<TR>; </TR>;
<TABLE>; </TABLE>; Outer layer label
To label occurring to nested situation, in practical application, only one deck is nested, also multilayer nest may occur.Can be according to handling carry out the Web page text contents extraction to the right order of outer layer label from vpn label; It is priority processing internal layer content blocks; From the inside to the outside each layer of processing successively content blocks then; In addition; Can also be according to handling carry out the Web page text contents extraction to the right order of vpn label from outer layer label, i.e. the outer content blocks of priority processing, from outside to inside each layer of processing successively content blocks then; The embodiment of the invention will be according to being called the sequential processes mode from vpn label to the processing mode of successively handling to the right order of outer layer label, will be according to from outer layer label the processing mode of successively handling to the right order of vpn label being called the backward processing mode.
To the situation of the nested content blocks of one deck, introduce the embodiment that sequential processes and backward are handled below respectively in detail, for the situation of multilayer nest content blocks, the handling principle under the nested content blocks situation with one deck of handling principle is consistent, repeats no more here.
When adopting the sequential processes mode; To each content blocks that marks off, also comprising: judge whether be nested with at least one nested content blocks in this content blocks, if judged result is for denying to before this content blocks execution in step 12 to step 14; Then distinguish execution in step 12 to step 14 to each content blocks that marks off; If judged result for being, then to each nested content blocks, is confirmed the link text length and the non-link text length of this nested content blocks respectively; According to link text length of determining and non-link text length; Confirm the link text density that this nested content blocks is corresponding, during greater than preset first defined threshold, confirm the non-body matter of this nested content blocks for this webpage in the corresponding link text density of this nested content blocks; Then with the content except that each nested content blocks of confirming as non-body matter in this content blocks, again as this content blocks.That is to say earlier to determine whether respectively to be body matter,, continue to confirm whether the content blocks after the deletion is body matter from this content blocks if non-body matter then should nested content blocks be deleted to each the nested content blocks in the content blocks.For example be nested with the content blocks of three < P>labels in the content blocks of < TABLE>label to correspondence to correspondence; To the content blocks of < TABLE>label to correspondence; Confirming respectively earlier whether three < P>labels are the body matter of webpage to the nested content blocks of correspondence, is body matter if determine wherein two < P>labels to the nested content blocks of correspondence, and one of them < P>label is non-body matter to the nested content blocks of correspondence; Then will determine to the nested content blocks of non-body matter and from this content blocks, delete; Continue then to confirm whether the content blocks after the deletion is the body matter of webpage, be body matter, then join in the body matter result set if determine; If determine, then from HTML code, delete to non-body matter.The sequential processes mode can effectively avoid the body matter of webpage to be filtered.
When adopting the backward processing mode; To each content blocks that marks off; Confirm that in step 14 this content blocks also comprises for after the body matter of this webpage: judging whether be nested with at least one nested content blocks in this content blocks, is not as if judged result, then process ends; If judged result is for being; Then confirm the link text length and the non-link text length of this nested content blocks respectively to each nested content blocks; And, confirm the link text density that this nested content blocks is corresponding according to link text length of determining and non-link text length, in the corresponding link text density of this nested content blocks during greater than preset first defined threshold; Confirm the non-body matter of this nested content blocks for this webpage; This nested content blocks of deletion from body matter with the content except that each nested content blocks of confirming as non-body matter in this content blocks, is confirmed as the body matter of this webpage then.That is to say and confirm that earlier whether the corresponding link text density of this content blocks is greater than the first preset defined threshold; If; Then this content blocks is deleted from HTML code, if not, then continue again to determine whether to be body matter respectively to each nested content blocks; The nested content blocks of confirming as non-body matter is deleted from HTML code, and the content blocks after the deletion is the body matter of this webpage.For example be nested with the nested content blocks of three < P>labels in the content blocks of < TABLE>label to correspondence to correspondence; To the content blocks of < TABLE>label to correspondence; Confirm earlier this < TABLE>label to the link text density of the content blocks of correspondence whether greater than the first preset defined threshold, if then this content blocks is deleted from HTML code; If not; Confirm respectively more then whether three < P>labels are the body matter of webpage to the nested content blocks of correspondence, and the nested content blocks of confirming as non-body matter is deleted, and the content blocks after the deletion is joined in the body matter result set from HTML code.The backward processing mode can effectively avoid linked contents to be mistaken for the body matter of webpage.
The embodiment of the invention proposes, and is preferred when the body matter that determines whether respectively to the content blocks that marks off to webpage, uses the backward processing mode to carry out the processing of nested content blocks in the Web page text contents extraction.
Can know that by above-mentioned processing procedure embodiment of the invention technical scheme need be extracted the webpage of body matter to each; Through link text length what (being link text density) of shared ratio in content blocks, determine whether body matter into webpage, when link text length proportion many more; Think that then to become the possibility of body matter more little; If surpass first defined threshold, then confirm as non-body matter, otherwise in like manner.It is thus clear that the embodiment of the invention is to carry out to each webpage that need extract body matter when extracting the Web page text content; Can not receive the influence of different web pages form difference, and link text length what of proportion in content blocks, can be comparatively objective and accurate reflect this content blocks become Web page text possibility what; Adopt other factor of judgment more accurately rationally relatively; Such as, adopting number of links relatively, this programme can effectively avoid less for number of links; But it is the situation appearance of Web page text that the content that overall link text length is long is mistaken as, and then improves the accuracy of extracting Web page text effectively.
In an alternative embodiment of the invention, for further improving the accuracy of Web page text contents extraction, the embodiment of the invention also comprises the high frequency unit filtration step.
In the Web page text content of confirming according to link text density; Also possibly comprise " noises " such as weather forecast, source, website, click volume, copyright informations; The frequency that these redundant informations occur in webpage is higher; Through the Web page text content of confirming according to link text density is carried out the high frequency unit filtration treatment,, make that Web page text contents extraction result is more accurate to filter out above-mentioned redundant information.
The body matter of confirming according to link text density is carried out high frequency unit to be filtered; At first confirm each content element; Be specially: obtain each label in the body matter; Respectively the content between adjacent two labels is confirmed as content element, for example, when confirming body matter according to link text density; < TABLE>label is confirmed as body matter to the content blocks of correspondence; It is right that < TABLE>label centering is nested with two < P>labels, and right start-tag and the content between the end-tag of < TABLE>label can be divided into 5 content element so, and the content between the right start-tag of the start-tag that < TABLE>label is right and first < P>label is first content element; Start-tag that first < P>label is right and the content between the end-tag are second content element; Content between the right start-tag of the end-tag that first < P>label is right and second < P>label is the 3rd content element, and second right start-tag and content between the end-tag of < P>label is the 4th content element, and the content between the right end-tag of the end-tag that second < P>label is right and < TABLE>label is the 5th content element.
To be divided into according to the body matter that link text density is determined after each content element, carry out following operation respectively: content element and preset content template storehouse will be mated to each content element; If mate successfully, then the coupling frequency with the content template that is complementary in the said content template storehouse adds 1; Judge that then whether said coupling frequency is greater than the second preset defined threshold; If judged result is for being; Think that then the frequency of this content element appearance is higher, confirm that therefore this content element is the non-body matter of this webpage, this content element is deleted from body matter; If judged result thinks then that for not the frequency of this content element appearance is lower, confirm that therefore this content element is the body matter of this webpage, finishes the flow process to the high frequency unit filtration treatment of this content element.
Preferably, the said process that content element and preset content template storehouse are mated can but be not limited to following:
At first in each content template in the content template storehouse, search the content template consistent with the content of said content element, wherein, the content template in the content template storehouse matees the content element of at least one webpage in advance and obtains; If find the consistent content template of content, then think and mate successfully, otherwise think to mate and fail.
Further, before searching the content template consistent, can also judge in the said content template storehouse whether store content template earlier with the content of content element; If judge and do not store content template, then coupling failure; If judge and store content template, then carry out in each content template in the content template storehouse, search the operation of the content template consistent with the content of said content element.When content element and content template storehouse are mated, possibly store content template in the content template storehouse, at this moment, can directly carry out in each content template in the content template storehouse, search the operation of the content template consistent with the content of said content element; Possibly not store content template in the content template storehouse yet, for example initially set up the situation in content template storehouse, think that the matching result in content element and content template storehouse is the coupling failure this moment.
If the matching result in content element and content template storehouse is the coupling failure, then can directly finish this content element is carried out the flow process of high frequency filtration treatment; Preferably, also can deposit in the said content template storehouse, and the coupling frequency that it is corresponding is made as initial value with said content element as new content template, wherein, the initial value of coupling frequency can but be not limited to be made as 1.
In order to save the memory space in content template storehouse; Before content template being deposited in the content template storehouse, can utilize the preset coding rule, content template is encoded; Content template behind the coding is deposited in the content template storehouse, thus the memory space of having saved the content template storehouse effectively.That is to say; That stores in the content template storehouse is the content template behind the coding, therefore before content element and content template storehouse are mated, need utilize preset coding regular earlier; This content element is encoded, and mate in content element and content template storehouse after will encoding then.
Because the high frequency unit filtration treatment is mainly used in these redundant informations such as filtering weather forecast, source, website, click volume, copyright information; The text size of these redundant informations generally can be not oversize, and is in order to reduce the storage size in content template storehouse, preferred; Before content element and content template storehouse are mated; Whether the text size that can judge content element earlier greater than the 3rd preset defined threshold, if judged result is for being to think that then this content element is a body matter; Need not to carry out the high frequency unit filtration treatment, finish this content element is carried out the flow process of high frequency unit filtration treatment this moment; If judged result thinks then that for not this content element possibly be redundant information, carry out the operation that content element and preset content template storehouse are mated this moment.
In practical application; Searching the content template consistent with the content of content element can be directly to search and the consistent content template of the content of content element; Promptly directly content in the content element and the content in the content template are compared, judge whether that content is consistent.Can also be when depositing content element in the content template storehouse as content template, to extract the keyword in this content element; Then the keyword that extracts is deposited in the content template storehouse as content template; That is to say, be corresponding keyword in each content template of storing in the content template storehouse, when searching the consistent content template of content with content element; The content of this content element and the keyword in the content template are compared, and then determine whether that content is consistent.Such as, the keyword of this content element of extraction in each content template in the content template storehouse, is searched the content template consistent with the keyword of content element then earlier.
Because the content of these redundant informations such as source, website, copyright information is generally more fixing, change in wording is less, therefore can preferably adopt the method for directly searching the content template consistent with the content of content element.
In HTML code; Characters such as " < ", ">", " & " have special implication, and they are the reserved character of html language, therefore can not directly use; When needs use this type character; Need replace character display with their escape sequence, wherein, the corresponding relation of escape sequence and character display can but be not limited to as shown in table 4.
Table 4:
Escape sequence Character display Escape sequence Character display Escape sequence Character display
&nbsp; &quot; \ &ldquo;
&times; × &copy; (C) &rdquo;
&divide; ÷ &reg; (R) &mdash; -
&amp; & &trade; TM &#8240;
Before carrying out the high frequency unit filtration, further, also need the escape sequence in the HTML code be reduced to corresponding character display, the escape sequence that is about in the HTML code replaces with corresponding character display.
Illustrate a kind of flow process of more preferably high frequency unit filtration treatment below.
When initial; Do not store any content template in the content template storehouse, when (being called webpage 1, need to prove to first webpage; This webpage 1 can be the webpage that this paper desire is extracted body matter; Also can be other webpage) when carrying out the high frequency unit filtration treatment, will utilize the body matter of the webpage 1 that link text density obtains to be divided into a plurality of content element earlier, for example be divided into content element 1A, content element 1B and content element 1C; Carry out the high frequency unit filtration treatment to content element 1A earlier: the text size that obtains content element 1A; The text size of judging content element 1A is greater than the 3rd defined threshold; Think that content element 1A is a body matter this moment, finishes content element 1A is carried out the flow process of high frequency unit filtration treatment.Next carry out the high frequency unit filtration treatment to content element 1B: the text size that obtains content element 1B; The text size of judging content element 1B is not more than the 3rd defined threshold, and think that content element 1B possibly be redundant information this moment, continues to judge whether store content template in the content template storehouse; Because the content template storehouse is an original state; Therefore wherein do not store any content template, directly content element 1B is deposited in the content template storehouse as new content template, the coupling frequency of this content template is made as initial value 1; Because therefore the coupling frequency confirms that content element 1B is a body matter less than the second preset defined threshold 5.Next carry out the high frequency unit filtration treatment to content element 1C: the text size that obtains content element 1C; The text size of judging content element 1C is not more than the 3rd defined threshold, and think that content element 1C possibly be redundant information this moment, continues to judge whether store content template in the content template storehouse; Owing to store the corresponding content template of content element 1B in the content template storehouse; Therefore search the content template consistent, do not find, confirm that then content element 1C is a body matter with the content of content element 1C; And content element 1C deposited in the content template storehouse as new content template, the coupling frequency of this content template is made as initial value 1.
(be called webpage 2 to second webpage; Need to prove; This webpage 2 can be the webpage that this paper desire is extracted body matter; Also can be other webpage) when carrying out the high frequency unit filtration treatment, will pass through link earlier and filter the body matter of the webpage 2 obtain and be divided into a plurality of content element, for example be divided into content element 2A and content element 2B; Carry out the high frequency unit filtration treatment to content element 2A earlier: the text size that obtains content element 2A; The text size of judging content element 2A is greater than the 3rd defined threshold; Think that content element 2A is a body matter this moment, finishes content element 2A is carried out the flow process of high frequency unit filtration treatment.Next carry out the high frequency unit filtration treatment to content element 2B: the text size that obtains content element 2B; The text size of judging content element 2B is not more than the 3rd defined threshold; Think that content element 2B possibly be redundant information this moment, continues to judge whether store content template in the content template storehouse, owing to store content element 1B and the corresponding content template of content element 1C in the content template storehouse; Therefore do not search the content template consistent with the content of content element 2B; Confirm that content element 2B is a body matter, and content element 2B is deposited in the content template storehouse as new content template, the coupling frequency of this content template is made as initial value 1.
Respectively webpage 3~webpage (N-1) is carried out the high frequency unit filtration treatment according to above-mentioned flow process; (be called webpage N to N webpage; Need to prove that this webpage N can be the webpage that this paper desire is extracted body matter, also can be other webpage) when carrying out the high frequency unit filtration treatment; To pass through link earlier and filter the body matter of the webpage N obtain and be divided into a plurality of content element, for example be divided into content element NA and content element NB; Carry out the high frequency unit filtration treatment to content element NA earlier: the text size that obtains content element NA; The text size of judging content element NA is greater than the 3rd defined threshold; Think that content element NA is a body matter this moment, finishes content element NA is carried out the flow process of high frequency unit filtration treatment.Next carry out the high frequency unit filtration treatment to content element NB: the text size that obtains content element NB; The text size of judging content element NB is not more than the 3rd defined threshold, and think that content element NB possibly be redundant information this moment, continues to judge whether store content template in the content template storehouse; Owing to store content template in the content template storehouse; Wherein, content element NB is identical with the content of content element 1B, therefore in each content template of content template library storage; Search the content template consistent with the content of content element NB; Because content element NB is identical with the content of content element 1B, therefore can in each content template of content template library storage, find the consistent content template of content, the corresponding coupling frequency of the content template that finds (content template that content element 1B is corresponding) is added 1; The coupling frequency of the content template that find this moment is 10; Greater than the second preset defined threshold, therefore confirm that content element NB is a redundant information, this content element of deletion NB from body matter.
Because the click volume in the redundant information, the numerical value of pageview are constantly to change; In order to prevent when carrying out the high frequency unit filtration; These numerical value that constantly the change erroneous judgements of click volume, pageview are body matter, before content element of determining and content template storehouse are mated, further; Can also carry out digital standardization processing to these numerical value, convert each numerical character that is comprised in the content element into unified preset characters.
The embodiment of the invention has made up two filtering models and has carried out the extraction of Web page text content; Two filtering models are link filtering model and high frequency unit filtering model; Link filtering model is mainly used in filters navigation bar link, advertisement link etc. and the irrelevant linked contents of body matter, and high frequency unit filtering model is mainly used in the higher redundant information of filtration frequency of occurrences in webpage.When extracting the body matter of webpage; After accomplishing the pre-service of webpage; Can regarding as of image pretreated webpage be put into two filtering models successively: link filtering model and high frequency unit filtering model, the result after process link filtering model and high frequency unit filtering model are filtered is the body matter of webpage.
Take over the treatment effect of filter for testing chain; Can be from the news web page that grasps; 100 webpages of picked at random; From 100 webpages choosing, choose 100 non-text paragraphs that comprise linked contents as testing material, wherein 98% linked contents can both be filtered by link filtering model, and remaining 2% linked contents (linked contents as shown in table 5) is failed to be linked filtering model and filtered.
Table 5:
Figure BDA0000065942530000181
Correctly filter though 2% linked contents fails to be linked filtering model, these contents occur in a plurality of webpages repeatedly, therefore can in high frequency unit filtering model, correctly filter.Because the embodiment of the invention adopts link filtering model and high frequency unit filtering model to carry out double filtration, therefore make redundant information by deletion to greatest extent, further improved the accuracy of extracting the Web page text content.
Provide more detailed embodiment below.
As shown in Figure 2; Be the concrete realization flow figure of Web page text method for extracting content in the embodiment of the invention, Web page text contents extraction process is divided into preprocessing part, link filtration fraction and high frequency unit filtration fraction, wherein; Step 21~step 23 is a preprocessing part; Step 24~step 29 is the link filtration fraction, and step 210~step 218 is the high frequency unit filtration fraction, and the concrete processing procedure of Web page text method for extracting content is following:
Step 21 is extracted the webpage of body matter and is unified the web page coding format analysis processing to need;
Step 22 is extracted the webpage of body matter and is simplified tag processes to need;
Step 23, to the webpage of need extraction body matter, deletion and the irrelevant code segment of text;
Step 24, after the performance specification pre-service, each containers labels that obtains in the pretreated webpage is right;
Step 25, right according to each containers labels that obtains, pretreated webpage is divided into each content blocks;
Step 26 to each content blocks that marks off, is confirmed the link text length and the non-link text length of this content blocks respectively;
Step 27 to each content blocks that marks off, respectively according to link text length and the non-link text length determined, is confirmed the link text density that this content blocks is corresponding;
Whether step 28 to each content blocks that marks off, judges the corresponding link text density of this content blocks respectively greater than the first preset defined threshold, if judged result for being, then goes to step 29, if judged result then goes to step 210 for not;
Step 29 confirms that this content blocks is non-body matter, deletes this content blocks from HTML code;
Step 210 is reduced to corresponding character display with the escape sequence in the HTML code;
Step 211 to each label in the HTML code, is confirmed as content element with the content between this label and the adjacent next label respectively;
Step 212 converts each numerical character that is comprised in each content element into unified preset characters;
Step 213, to each content element of determining, whether the text size of judging this content element is greater than the 3rd defined threshold, if judged result for being, then goes to step 218, if judged result then goes to step 214 for not;
Step 214 in each content template in the content template storehouse, is searched the content template consistent with the content of this content element, if find, then goes to step 216, if do not find, then goes to step 215;
Step 215 deposits this content element in the content template storehouse in as new content template, and the coupling frequency of this content template correspondence is made as initial value, goes to step 218 then;
Step 216, the coupling frequency that this content template is corresponding adds 1, goes to step 217 then;
Whether step 217 judges the current coupling frequency of this content element greater than the second preset defined threshold, if judged result for being, then goes to step 29, if judged result then goes to step 218 for not;
Step 218 is confirmed the body matter of this content element for this webpage, and this content element is joined in the body matter result set.
Accordingly; The embodiment of the invention provides a kind of Web page text contents extraction device; Its structure is as shown in Figure 3; Comprise that content blocks division unit 31, first text size confirm that unit 32, the first link text density confirms that unit 33, the first link text density judging unit 34 and body matter confirm unit 35, wherein:
Content blocks division unit 31, the webpage that is used for need are extracted body matter is divided into each content blocks;
First text size is confirmed unit 32, is used for to each content blocks, confirms the link text length and the non-link text length of this content blocks respectively;
The first link text density is confirmed unit 33, is used for confirming link text length and the non-link text length determined unit 32 according to first text size, confirms the link text density that this content blocks is corresponding;
The first link text density judging unit 34 is used to judge that whether link text density that the first link text density confirms to determine unit 33 is greater than the first preset defined threshold;
Body matter is confirmed unit 35, is used in the judged result of the first link text density judging unit 34 confirming the body matter of this content blocks for this webpage for not the time.
Preferably, content blocks division unit 31 comprises that specifically pre-service subelement, label are to obtaining subelement and content blocks division subelement, wherein:
The pre-service subelement is used for the webpage that need the extract body matter pre-service that standardizes;
Label is to obtaining subelement, and being used for obtaining the pre-service subelement, to carry out each containers labels of pretreated webpage right;
Content blocks is divided subelement, and it is right to each containers labels that obtains the subelement acquisition to be used for according to label, the pre-service subelement is carried out pretreated webpage be divided into each content blocks.
Preferably, the first link text density confirms that unit 33 comprises that specifically ratio calculation subelement and link text density confirms subelement, wherein:
The ratio calculation subelement is used to calculate the ratio of said link text length and non-link text length;
Link text density is confirmed subelement, is used for confirming the link text density that this content blocks is corresponding according to the said ratio that the ratio calculation subunit computes goes out.
More preferably, said link text density confirms that subelement specifically is used for, and said ratio and penalty factor that the ratio calculation subunit computes is gone out multiply each other, and obtains the corresponding link text density of this content blocks.
More preferably, said penalty factor is the number of links that this content blocks comprised.
Preferably, said Web page text contents extraction device comprises that also the first nested content blocks judging unit, second text size confirm that unit, the second link text density confirms unit, the second link text density judging unit and content blocks delete cells, wherein:
The first nested content blocks judging unit is used for after said body matter confirms that unit 35 confirms that this content blocks is for the body matter of this webpage, judging whether be nested with at least one nested content blocks in this content blocks;
Second text size is confirmed the unit, is used in the judged result of the first nested content blocks judging unit to each nested content blocks, confirming the link text length and the non-link text length of this nested content blocks respectively when being;
The second link text density is confirmed the unit, is used for confirming link text length and the non-link text length determined the unit according to second text size, confirms the link text density that this nested content blocks is corresponding;
The second link text density judging unit is used to judge that whether the corresponding link text density of this nested content blocks is greater than the first preset defined threshold;
The content blocks delete cells is used in the judged result of the second link text density judging unit when being this nested content blocks of deletion from body matter.
Preferably; Said Web page text contents extraction device comprises that also the second nested content blocks judging unit, the 3rd text size confirm that unit, the 3rd link text density confirms that unit, the 3rd link text density judging unit, non-body matter confirm that unit and content confirm the unit, wherein:
The second nested content blocks judging unit; Be used for judging whether this content blocks is nested with at least one nested content blocks; If judged result is for denying; Then said first text size is confirmed unit 32 execution to each content blocks, respectively the link text length and the non-link text length of definite this content blocks;
The 3rd text size is confirmed the unit, is used for to each nested content blocks, confirms the link text length and the non-link text length of this nested content blocks respectively;
The 3rd link text density is confirmed the unit, is used for confirming link text length and the non-link text length determined the unit according to the 3rd text size, confirms the link text density that this nested content blocks is corresponding;
The 3rd link text density judging unit is used to judge that whether the corresponding link text density of this nested content blocks is greater than the first preset defined threshold;
Non-body matter is confirmed the unit, and the judged result that is used for the 3rd link text density judging unit is confirmed the non-body matter of this nested content blocks for this webpage when being;
Content is confirmed the unit, only if be used for this content blocks body matter is confirmed that the unit confirms as the content outside each nested content blocks of non-body matter, again as this content blocks.
Embodiment of the invention technical scheme need be extracted the webpage of body matter to each, through link text length what (being link text density) of shared ratio in content blocks; Determine whether body matter for webpage; When link text length proportion is many more, think that then to become the possibility of body matter more little, if surpass first defined threshold; Then confirm as non-body matter, otherwise in like manner.It is thus clear that the embodiment of the invention is to carry out to each webpage that need extract body matter when extracting the Web page text content; Can not receive the influence of different web pages form difference, and link text length what of proportion in content blocks, can be comparatively objective and accurate reflect this content blocks become Web page text possibility what; Adopt other factor of judgment more accurately rationally relatively; Such as, adopting number of links relatively, this programme can effectively avoid less for number of links; But it is the situation appearance of Web page text that the content that overall link text length is long is mistaken as, and then improves the accuracy of extracting Web page text effectively.
Preferably, said Web page text contents extraction device also comprises content element division unit, content match unit, coupling frequency processing unit and content delete cells, wherein:
The content element division unit is used for obtaining each label of said body matter, respectively the content between adjacent two labels is confirmed as content element;
The content match unit is used for to each content element, respectively content element and preset content template storehouse is mated;
Coupling frequency processing unit is used for when the success of content match units match, and the coupling frequency of the content template that is complementary in the said content template storehouse is added 1;
Whether the content delete cells is used to judge said coupling frequency greater than the second preset defined threshold, is then from body matter, to delete this content element.
More preferably, said content match unit comprises that specifically content template is searched subelement and coupling is confirmed subelement, wherein:
Content template is searched subelement, is used in each content template in content template storehouse, searching the content template consistent with the content of said content element, and said content template is for to mate acquisition to the content element of at least one webpage in advance;
Coupling is confirmed subelement, is used for when content template is searched subelement and found content template, confirming mate successfully, when content template is searched subelement and do not found content template, confirms to mate and fails.
More preferably, said Web page text contents extraction device also comprises content template judging unit and matching result confirmation unit, wherein:
The content template judging unit is used for judging whether said content template storehouse stores content template;
The matching result confirmation unit; The judged result that is used for said content template judging unit is not for the time; Confirm the coupling failure; The judged result of said content template judging unit is when being, said content template is searched subelement and carried out in each content template in the content template storehouse, searches the content template consistent with the content of said content element.
Preferably; Said Web page text contents extraction device also comprises the content template storage unit, is used for content match units match when failure, with said content element as new content template; Deposit in the said content template storehouse, and the coupling frequency that it is corresponding is made as initial value.
Preferably, said Web page text contents extraction device also comprises text size judging unit and matching unit, wherein:
The text size judging unit is used for before said content match unit matees content element and preset content template storehouse, and whether the text size of judging this content element is greater than the 3rd preset defined threshold;
Matching unit, the judged result that is used for said text size judging unit be not for the time, and said content match unit is carried out to each content element, respectively content element and preset content template storehouse is mated.
Preferably; Said Web page text contents extraction device also comprises the character conversion unit; Be used for converting each numerical character that is comprised in this content element into unified preset characters before said content match unit matees content element and preset content template storehouse.
Because the embodiment of the invention adopts link filtering model and high frequency unit filtering model to carry out double filtration, therefore make redundant information by deletion to greatest extent, further improved the accuracy of extracting the Web page text content.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, belong within the scope of claim of the present invention and equivalent technologies thereof if of the present invention these are revised with modification, then the present invention also is intended to comprise these changes and modification interior.

Claims (25)

1. a Web page text method for extracting content is characterized in that, comprising:
The webpage that need is extracted body matter is divided into each content blocks;
Each content blocks to marking off is carried out respectively:
Confirm the link text length and the non-link text length of this content blocks; And
According to link text length of determining and non-link text length, confirm the link text density that this content blocks is corresponding;
When link text density is not more than the first preset defined threshold, confirm the body matter of this content blocks for this webpage.
2. Web page text method for extracting content as claimed in claim 1 is characterized in that, the said webpage that will extract body matter is divided into each content blocks, specifically comprises:
The webpage that need the are extracted body matter pre-service that standardizes;
Each containers labels that obtains in the pretreated webpage is right;
Each containers labels according to obtaining is right, and pretreated webpage is divided into a plurality of content blocks.
3. Web page text method for extracting content as claimed in claim 1 is characterized in that, link text length that said basis is determined and non-link text length are confirmed the link text density that this content blocks is corresponding, specifically comprise:
Calculate the ratio of said link text length and non-link text length;
Confirm the link text density that this content blocks is corresponding according to said ratio.
4. Web page text method for extracting content as claimed in claim 3 is characterized in that, saidly confirms the link text density that this content blocks is corresponding according to said ratio, specifically comprises:
Said ratio and penalty factor are multiplied each other, obtain the corresponding link text density of this content blocks.
5. Web page text method for extracting content as claimed in claim 4 is characterized in that, said penalty factor is the number of links that this content blocks comprised.
6. Web page text method for extracting content as claimed in claim 1 is characterized in that, also comprises after the body matter of said definite this content blocks for this webpage:
Judge and whether be nested with at least one nested content blocks in this content blocks; If judged result then finishes for not;
If judged result is for being, then is directed against each nested content blocks and carries out respectively:
Confirm the link text length and the non-link text length of this nested content blocks; And
According to link text length of determining and non-link text length, confirm the link text density that this nested content blocks is corresponding;
In the corresponding link text density of this nested content blocks during greater than preset first defined threshold, this nested content blocks of deletion from body matter.
7. Web page text method for extracting content as claimed in claim 1 is characterized in that, said carry out respectively to each content blocks of marking off before, also comprise:
Judge and whether be nested with at least one nested content blocks in this content blocks; If judged result is then carried out the said step of carrying out respectively to each content blocks that marks off for not;
If judged result is for being, then is directed against each nested content blocks and carries out respectively:
Confirm the link text length and the non-link text length of this nested content blocks; And
According to link text length of determining and non-link text length, confirm the link text density that this nested content blocks is corresponding;
During greater than preset first defined threshold, confirm the non-body matter of this nested content blocks in the corresponding link text density of this nested content blocks for this webpage;
With the content except that each nested content blocks of confirming as non-body matter in this content blocks, again as this content blocks.
8. Web page text method for extracting content as claimed in claim 1 is characterized in that, also comprises after the body matter of said definite this content blocks for this webpage:
Obtain each label in the said body matter, respectively the content between adjacent two labels is confirmed as content element;
To each content element, carry out respectively:
Content element and preset content template storehouse are mated;
If mate successfully, then the coupling frequency with the content template that is complementary in the said content template storehouse adds 1;
Whether judging said coupling frequency greater than the second preset defined threshold, is then from body matter, to delete this content element.
9. Web page text method for extracting content as claimed in claim 8 is characterized in that, said content element and preset content template storehouse is mated, and specifically comprises:
In each content template in the content template storehouse, search the content template consistent with the content of said content element, said content template is for to mate acquisition to the content element of at least one webpage in advance;
If find, then mate successfully, otherwise the coupling failure.
10. Web page text method for extracting content as claimed in claim 9 is characterized in that, in said each content template in the content template storehouse, searches before the content template consistent with the content of said content element, also comprises:
Judge in the said content template storehouse and whether store content template;
If do not store content template, then coupling failure;
If store content template, then carry out in said each content template in the content template storehouse operation of searching the content template consistent with the content of said content element.
11. like each described Web page text method for extracting content of claim 8-10, it is characterized in that, if the coupling failure then also comprises:
Said content element as new content template, is deposited in the said content template storehouse, and the coupling frequency that it is corresponding is made as initial value.
12. Web page text method for extracting content as claimed in claim 8 is characterized in that, said content element and preset content template storehouse are mated before, also comprise:
Whether the text size of judging this content element is greater than the 3rd preset defined threshold;
If judged result is for being then end;
If judged result is then carried out the said operation that content element and preset content template storehouse are mated for not.
13. Web page text method for extracting content as claimed in claim 8 is characterized in that, said content element and preset content template storehouse are mated before, also comprise:
Convert each numerical character that is comprised in this content element into unified preset characters.
14. a Web page text contents extraction device is characterized in that, comprising:
The content blocks division unit, the webpage that is used for need are extracted body matter is divided into each content blocks;
First text size is confirmed the unit, is used for to each content blocks, confirms the link text length and the non-link text length of this content blocks respectively;
The first link text density is confirmed the unit, is used for confirming link text length and the non-link text length determined the unit according to first text size, confirms the link text density that this content blocks is corresponding;
The first link text density judging unit is used to judge that whether link text density that the first link text density confirms to determine the unit is greater than the first preset defined threshold;
Body matter is confirmed the unit, is used in the judged result of the first link text density judging unit confirming the body matter of this content blocks for this webpage for not the time.
15. Web page text contents extraction device as claimed in claim 14 is characterized in that the content blocks division unit specifically comprises:
The pre-service subelement is used for the webpage that need the extract body matter pre-service that standardizes;
Label is to obtaining subelement, and being used for obtaining the pre-service subelement, to carry out each containers labels of pretreated webpage right;
Content blocks is divided subelement, and it is right to each containers labels that obtains the subelement acquisition to be used for according to label, the pre-service subelement is carried out pretreated webpage be divided into each content blocks.
16. Web page text contents extraction device as claimed in claim 14 is characterized in that, the first link text density confirms that the unit specifically comprises:
The ratio calculation subelement is used to calculate the ratio of said link text length and non-link text length;
Link text density is confirmed subelement, is used for confirming the link text density that this content blocks is corresponding according to the said ratio that the ratio calculation subunit computes goes out.
17. Web page text contents extraction device as claimed in claim 16; It is characterized in that; Said link text density confirms that subelement specifically is used for, and said ratio and penalty factor that the ratio calculation subunit computes is gone out multiply each other, and obtains the corresponding link text density of this content blocks.
18. Web page text contents extraction device as claimed in claim 14 is characterized in that, also comprises:
The first nested content blocks judging unit is used for after said body matter confirms that the unit confirms that this content blocks is the body matter of this webpage, judging whether be nested with at least one nested content blocks in this content blocks;
Second text size is confirmed the unit, is used in the judged result of the first nested content blocks judging unit to each nested content blocks, confirming the link text length and the non-link text length of this nested content blocks respectively when being;
The second link text density is confirmed the unit, is used for confirming link text length and the non-link text length determined the unit according to second text size, confirms the link text density that this nested content blocks is corresponding;
The second link text density judging unit is used to judge that whether the corresponding link text density of this nested content blocks is greater than the first preset defined threshold;
The content blocks delete cells is used in the judged result of the second link text density judging unit when being this nested content blocks of deletion from body matter.
19. Web page text contents extraction device as claimed in claim 14 is characterized in that, also comprises:
The second nested content blocks judging unit; Be used for judging whether this content blocks is nested with at least one nested content blocks; If judged result is not, then said first text size confirms that the unit execution is directed against each content blocks, confirms the link text length and the non-link text length of this content blocks respectively;
The 3rd text size is confirmed the unit, is used for to each nested content blocks, confirms the link text length and the non-link text length of this nested content blocks respectively;
The 3rd link text density is confirmed the unit, is used for confirming link text length and the non-link text length determined the unit according to the 3rd text size, confirms the link text density that this nested content blocks is corresponding;
The 3rd link text density judging unit is used to judge that whether the corresponding link text density of this nested content blocks is greater than the first preset defined threshold;
Non-body matter is confirmed the unit, and the judged result that is used for the 3rd link text density judging unit is confirmed the non-body matter of this nested content blocks for this webpage when being;
Content is confirmed the unit, only if be used for this content blocks body matter is confirmed that the unit confirms as the content outside each nested content blocks of non-body matter, again as this content blocks.
20. Web page text contents extraction device as claimed in claim 14 is characterized in that, also comprises:
The content element division unit is used for obtaining each label of said body matter, respectively the content between adjacent two labels is confirmed as content element;
The content match unit is used for to each content element, respectively content element and preset content template storehouse is mated;
Coupling frequency processing unit is used for when the success of content match units match, and the coupling frequency of the content template that is complementary in the said content template storehouse is added 1;
Whether the content delete cells is used to judge said coupling frequency greater than the second preset defined threshold, is then from body matter, to delete this content element.
21. Web page text contents extraction device as claimed in claim 20 is characterized in that, said content match unit specifically comprises:
Content template is searched subelement, is used in each content template in content template storehouse, searching the content template consistent with the content of said content element, and said content template is for to mate acquisition to the content element of at least one webpage in advance;
Coupling is confirmed subelement, is used for when content template is searched subelement and found content template, confirming mate successfully, when content template is searched subelement and do not found content template, confirms to mate and fails.
22. Web page text contents extraction device as claimed in claim 21 is characterized in that, also comprises:
The content template judging unit is used for judging whether said content template storehouse stores content template;
Matching result confirmation unit, the judged result that is used for said content template judging unit are confirmed the coupling failure for not the time; The judged result of said content template judging unit is when being, said content template is searched subelement and carried out in each content template in the content template storehouse, searches the content template consistent with the content of said content element.
23. like each described Web page text contents extraction device of claim 20-22, it is characterized in that, also comprise:
The content template storage unit when being used for the coupling failure, said content element as new content template, is deposited in the said content template storehouse, and the coupling frequency that it is corresponding is made as initial value.
24. Web page text contents extraction device as claimed in claim 20 is characterized in that, also comprises:
The text size judging unit is used for before said content match unit matees content element and preset content template storehouse, and whether the text size of judging this content element is greater than the 3rd preset defined threshold;
Matching unit, the judged result that is used for said text size judging unit be not for the time, and said content match unit is carried out to each content element, respectively content element and preset content template storehouse is mated.
25. Web page text contents extraction device as claimed in claim 20 is characterized in that, also comprises:
The character conversion unit is used for converting each numerical character that is comprised in this content element into unified preset characters before said content match unit matees content element and preset content template storehouse.
CN201110147583.7A 2011-06-02 2011-06-02 Webpage text content extracting method and device Active CN102810097B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110147583.7A CN102810097B (en) 2011-06-02 2011-06-02 Webpage text content extracting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110147583.7A CN102810097B (en) 2011-06-02 2011-06-02 Webpage text content extracting method and device

Publications (2)

Publication Number Publication Date
CN102810097A true CN102810097A (en) 2012-12-05
CN102810097B CN102810097B (en) 2016-03-02

Family

ID=47233804

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110147583.7A Active CN102810097B (en) 2011-06-02 2011-06-02 Webpage text content extracting method and device

Country Status (1)

Country Link
CN (1) CN102810097B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103064827A (en) * 2013-01-16 2013-04-24 盘古文化传播有限公司 Method and device for extracting webpage content
CN103714176A (en) * 2014-01-08 2014-04-09 同济大学 Webpage text extraction method based on maximum text density
CN103927397A (en) * 2014-05-05 2014-07-16 湖北文理学院 Recognition method for Web page link blocks based on block tree
CN103955632A (en) * 2014-05-07 2014-07-30 百度在线网络技术(北京)有限公司 Encryption display method and device for webpage words
CN104598577A (en) * 2015-01-14 2015-05-06 晶赞广告(上海)有限公司 Extraction method for webpage text
CN105808644A (en) * 2016-02-25 2016-07-27 浪潮软件集团有限公司 Method and device for determining text node
CN106407217A (en) * 2015-07-31 2017-02-15 北京国双科技有限公司 Navigation webpage identification method and device
CN106528504A (en) * 2015-09-11 2017-03-22 北京国双科技有限公司 Data screening method and device for social application
CN103870606B (en) * 2014-04-08 2017-05-10 上海语天信息技术有限公司 Webpage information extracting system and extracting method
CN106776886A (en) * 2016-11-29 2017-05-31 中国农业银行股份有限公司 A kind of Webpage body matter abstracting method and device
CN106802899A (en) * 2015-11-26 2017-06-06 北京搜狗科技发展有限公司 web page text extracting method and device
CN106855859A (en) * 2015-12-08 2017-06-16 北京搜狗科技发展有限公司 A kind of webpage context extraction method and device
CN107203527A (en) * 2016-03-16 2017-09-26 北大方正集团有限公司 The text extracting method and system of news web page
CN107391559A (en) * 2017-06-08 2017-11-24 广东工业大学 Based on block, the universal forum text extraction algorithm of pattern-recognition and style of writing originally
CN108628817A (en) * 2017-03-15 2018-10-09 腾讯科技(深圳)有限公司 A kind of data processing method and device
CN109033282A (en) * 2018-07-11 2018-12-18 山东邦尼信息科技有限公司 A kind of Web page text extracting method and device based on extraction template
CN110968807A (en) * 2018-09-27 2020-04-07 北京国双科技有限公司 Webpage text extraction method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1696933A (en) * 2005-05-27 2005-11-16 清华大学 Method for automatic picking up conceptual relationship of text based on dynamic programming
US20060085468A1 (en) * 2002-07-18 2006-04-20 Xerox Corporation Method for automatic wrapper repair
CN101937438A (en) * 2009-06-30 2011-01-05 富士通株式会社 Method and device for extracting webpage content
CN101950309A (en) * 2010-10-08 2011-01-19 华中师范大学 Subject area-oriented method for recognizing new specialized vocabulary

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060085468A1 (en) * 2002-07-18 2006-04-20 Xerox Corporation Method for automatic wrapper repair
CN1696933A (en) * 2005-05-27 2005-11-16 清华大学 Method for automatic picking up conceptual relationship of text based on dynamic programming
CN101937438A (en) * 2009-06-30 2011-01-05 富士通株式会社 Method and device for extracting webpage content
CN101950309A (en) * 2010-10-08 2011-01-19 华中师范大学 Subject area-oriented method for recognizing new specialized vocabulary

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
黄文蓓等: "基于分块的网页正文信息提取算法研究", 《计算机应用》 *
黄文蓓等: "基于分块的网页正文信息提取算法研究", 《计算机应用》, vol. 27, 30 June 2007 (2007-06-30), pages 24 - 26 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103064827A (en) * 2013-01-16 2013-04-24 盘古文化传播有限公司 Method and device for extracting webpage content
CN103714176A (en) * 2014-01-08 2014-04-09 同济大学 Webpage text extraction method based on maximum text density
CN103870606B (en) * 2014-04-08 2017-05-10 上海语天信息技术有限公司 Webpage information extracting system and extracting method
CN103927397B (en) * 2014-05-05 2017-02-22 湖北文理学院 Recognition method for Web page link blocks based on block tree
CN103927397A (en) * 2014-05-05 2014-07-16 湖北文理学院 Recognition method for Web page link blocks based on block tree
CN103955632A (en) * 2014-05-07 2014-07-30 百度在线网络技术(北京)有限公司 Encryption display method and device for webpage words
CN103955632B (en) * 2014-05-07 2018-03-06 百度在线网络技术(北京)有限公司 The encryption display methods and device of webpage word
CN104598577A (en) * 2015-01-14 2015-05-06 晶赞广告(上海)有限公司 Extraction method for webpage text
CN104598577B (en) * 2015-01-14 2017-09-15 晶赞广告(上海)有限公司 A kind of extracting method of Web page text
CN106407217A (en) * 2015-07-31 2017-02-15 北京国双科技有限公司 Navigation webpage identification method and device
CN106528504A (en) * 2015-09-11 2017-03-22 北京国双科技有限公司 Data screening method and device for social application
CN106802899B (en) * 2015-11-26 2020-11-24 北京搜狗科技发展有限公司 Webpage text extraction method and device
CN106802899A (en) * 2015-11-26 2017-06-06 北京搜狗科技发展有限公司 web page text extracting method and device
CN106855859A (en) * 2015-12-08 2017-06-16 北京搜狗科技发展有限公司 A kind of webpage context extraction method and device
CN106855859B (en) * 2015-12-08 2020-11-10 北京搜狗科技发展有限公司 Webpage text extraction method and device
CN105808644A (en) * 2016-02-25 2016-07-27 浪潮软件集团有限公司 Method and device for determining text node
CN107203527A (en) * 2016-03-16 2017-09-26 北大方正集团有限公司 The text extracting method and system of news web page
CN107203527B (en) * 2016-03-16 2019-06-28 北大方正集团有限公司 The text extracting method and system of news web page
CN106776886A (en) * 2016-11-29 2017-05-31 中国农业银行股份有限公司 A kind of Webpage body matter abstracting method and device
CN106776886B (en) * 2016-11-29 2019-09-24 中国农业银行股份有限公司 A kind of Webpage body matter abstracting method and device
CN108628817A (en) * 2017-03-15 2018-10-09 腾讯科技(深圳)有限公司 A kind of data processing method and device
CN108628817B (en) * 2017-03-15 2022-07-26 腾讯科技(深圳)有限公司 Data processing method and device
CN107391559A (en) * 2017-06-08 2017-11-24 广东工业大学 Based on block, the universal forum text extraction algorithm of pattern-recognition and style of writing originally
CN107391559B (en) * 2017-06-08 2020-06-02 广东工业大学 General forum text extraction algorithm based on block, pattern recognition and line text
CN109033282A (en) * 2018-07-11 2018-12-18 山东邦尼信息科技有限公司 A kind of Web page text extracting method and device based on extraction template
CN110968807A (en) * 2018-09-27 2020-04-07 北京国双科技有限公司 Webpage text extraction method and device

Also Published As

Publication number Publication date
CN102810097B (en) 2016-03-02

Similar Documents

Publication Publication Date Title
CN102810097A (en) Method and device for extracting webpage text content
CN102662969B (en) Internet information object positioning method based on webpage structure semantic meaning
US8868621B2 (en) Data extraction from HTML documents into tables for user comparison
CN102663023B (en) Implementation method for extracting web content
CN103927397B (en) Recognition method for Web page link blocks based on block tree
CN103473338B (en) Webpage content extraction method and webpage content extraction system
US8140533B1 (en) Harvesting relational tables from lists on the web
CN103064827A (en) Method and device for extracting webpage content
CN103177120B (en) A kind of XPath query pattern tree matching method based on index
CN101127042A (en) Sensibility classification method based on language model
CN105022806B (en) The method and system of the internet web page construction movement page based on translation template
CN101114281A (en) Open type document isomorphism engines system
CN102609427A (en) Public opinion vertical search analysis system and method
CN110704570A (en) Continuous page layout document structured information extraction method
CN105389389A (en) Network public opinion transmission situation media linked analysis method
CN102073654A (en) Methods and equipment for generating and maintaining web content extraction template
CN109657114B (en) Method for extracting webpage semi-structured data
CN103778141A (en) Mixed PDF book catalogue automatic extracting algorithm
CN105718584A (en) Web page content extracting method and device
CN105740355A (en) Aggregated text density based webpage body text extraction method and apparatus
CN107145591B (en) Title-based webpage effective metadata content extraction method
Schöch et al. Smart Modelling for Literary History
CN110334188A (en) A kind of multi-document summary generation method and system
Garofalakis et al. A semi-automatic system for the consolidation of Greek legislative texts
CN109783784A (en) A kind of data processing method and form builder based on the combination of minimum list

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200424

Address after: 310012 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Alibaba (China) Co.,Ltd.

Address before: 102200, No. 8, No., Changsheng Road, Changping District science and Technology Park, Beijing, China. 1-5

Patentee before: AUTONAVI SOFTWARE Co.,Ltd.

TR01 Transfer of patent right