CN103049536A

CN103049536A - Webpage main text content extracting method and webpage text content extracting system

Info

Publication number: CN103049536A
Application number: CN2012105701935A
Authority: CN
Inventors: 王海山
Original assignee: GUANGZHOU VOSON MARKETING CONSULTING Co Ltd
Current assignee: GUANGZHOU VOSON MARKETING CONSULTING Co Ltd
Priority date: 2012-11-01
Filing date: 2012-12-25
Publication date: 2013-04-17

Abstract

The invention provides a webpage main text content extracting method and a webpage main text content extracting system. The method includes the steps: acquiring an html source file and converting the html source file into character streams; rejecting invalid tags in the character streams; converting residual tags into a tag tree and converting the tag tree into a tag queue; processing each tag in the tag queue so as to obtain a valid tag set; and converting the valid tag set into a text and returning to the main text. The method and the system are high in universality, wide in coverage, less in customized development and high in maintainability, and are capable of effectively extracting the main text of a webpage and highly targeted even if webpage results are complex and various interference information is included.

Description

Extract the method and system of Web page text content

Technical field

The present invention relates to the internet information process field, particularly relate to a kind of method and system that extracts the Web page text content.

Background technology

Along with developing rapidly of internet, the quantity of information on the internet doubles with geometric series.People need to search the information that oneself needs in the information bank of magnanimity, and may need for further processing to the information that obtains and analysis.But navigation link, the advertisement link that adds for commercial interest, copyright information and related subject that a lot of original contents add in order to make things convenient for the user to browse are read recommended links etc.These information are entrained in the webpage, have affected user's browsing subject content.How from the webpage that comprises a large amount of noise contents, text message accurately, intactly extracted and become research topic.

Scheme commonly used has following two kinds at present:

First kind of way is to adopt RSS (simple and easy information fusion also is aggregated content, is the form of a kind of description and synchronous web site contents) seed file as information source.Because the RSS seed file is normally write according to the RSS standard of standard, can isolate the information such as the title that needs, issuing time, body matter by simple XML analytic technique, for example, the RSS reader all adopts this mode.

The second way is directly to adopt the WEB page of some specific website as information source, obtains the information of needs according to the proprietary resolver of coding characteristic exploitation of the WEB page.Most of news of using is at present read client and is all adopted this mode.

Yet, for first kind of way because a lot of websites do not provide the RSS seed, even and a lot of website the RSS seed is provided, but in order not affect the flow of its website, usually only comprise Content of Introductory Reading in the seed file.This just causes a lot of information to be excluded outside selectable range, and the information that also may cause the user to obtain is imperfect.

For the second way, this can bring a large amount of customization exploitations, and simultaneously comparatively rigid composing identification requirement can be brought many Maintenance developments because of the continuous variation of setting type in the targeted website.And the exploitation of these customization and maintainability brings the soaring of workload to cause it can only cover limited main stream website, can cause equally many information to be excluded outside optional scope.

Therefore, extracting the problem that needs to solve for the webpage main contents at present is: coverage rate is narrow, maintainability is poor.

Summary of the invention

The object of the present invention is to provide a kind of method and system that extracts the Web page text content, its broad covered area, maintainable strong.

Purpose of the present invention is achieved through the following technical solutions:

A kind of method of extracting the Web page text content comprises the steps:

Obtain the html source file, and this html source file is converted into character stream;

Reject the invalid label in the described character stream;

Remaining label is converted into tag tree, and converts this tag tree to tag queue;

Each label in the described formation label is carried out tag processes until formation for empty, obtains effective tally set;

Effective tally set is changed into text, be returned as text.

A kind of system that extracts the Web page text content comprises:

Acquisition module is used for obtaining the html source file, and this html source file is converted into character stream;

Filtering module is for the invalid label of rejecting described character stream;

The tag tree generation module is used for remaining label is converted into tag tree, and converts this tag tree to tag queue;

Traversal tag queue module is used for each label of described formation label is carried out tag processes until formation for empty, obtains effective tally set;

The text determination module is used for effective tally set is changed into text, is returned as text.

Scheme according to the invention described above, it is to obtain the html source file, and after this html source file is converted into character stream, reject the invalid label in this character stream, again remaining label is converted into tag tree, and convert this tag tree to tag queue, and each label in the described formation label is carried out tag processes obtain effective tally set, effective tally set is changed into text, be returned as text, because in whole processing procedure, only be to process the html source file from html label aspect, and not by the information of other aspects, has very high versatility, broad covered area even web results is complicated, contains multiple interfere information, also can effectively extract the body part of webpage, with strong points, simultaneously customization exploitation is few, and is maintainable strong.

Description of drawings

Fig. 1 is the schematic flow sheet that the present invention extracts the embodiment of the method for Web page text content;

Fig. 2 is the structural representation that the present invention extracts the system embodiment of Web page text content;

Fig. 3 is for using the original web page before the present invention extracts the Web page text content;

Fig. 4 extracts the result schematic diagram of Web page text content for using the present invention.

Embodiment

The present invention is further elaborated below in conjunction with embodiment and accompanying drawing, but implementation of the present invention is not limited to this.

Referring to shown in Figure 1, be the schematic flow sheet of the embodiment of the method for extraction Web page text content of the present invention.As shown in Figure 1, the method for the extraction Web page text content among this embodiment comprises the steps:

Step S101: obtain the html source file, and this html source file is converted into character stream, enter step S102;

Step S102: reject the invalid label in the described character stream, enter step S103;

Step S103: remaining label is converted into tag tree, and converts this tag tree to tag queue, enter step S104;

Step S104: each label in the described formation label is carried out tag processes obtain effective tally set, enter step S105;

Step S105: obtain text according to described effective tally set, be returned as text.

Accordingly, scheme according to present embodiment, it is to obtain the html source file, and after this html source file is converted into character stream, reject the invalid label in the described character stream, again remaining label is converted into tag tree, and convert this tag tree to tag queue, each label in the described formation label is carried out tag processes until formation for empty, obtains effective tally set, effective tally set is changed into text, be returned as text, because the present invention is directed to the Web page text that will extract is to process from html label aspect, from the function of bookmark name and tag attributes judgement label, can realize the automatic processing capacity of text entity, has very high versatility, broad covered area even structure of web page is complicated, contains multiple interfere information, also can effectively extract the body part of webpage, with strong points, simultaneously customization exploitation is few, and is maintainable strong.

The below describes above-mentioned each step in detail.

At first, in step S101, obtain the html source file and can adopt existing mode, do not repeat them here, above-mentioned character stream can be the character stream with the UTF-8 coding, the text coding of most webpages is stored in the energy collecting of UTF-8 character, and same coding is conducive to the subsequent character stream handling procedure, but also is not limited to the UTF-8 this coded system of encoding.

Then, in step S102, reject invalid label and can comprise the noise token process of removing, this can accelerate subsequent processes, and then the efficient of Web page text is extracted in raising, comprise the noise token such as content in deletion note, script and " head " label, these noise token are to be present in the html source file, not only there is not help for the Web page text contents extraction, can extract the noise token piece that causes interference to text on the contrary, for example, the User Exploitation person for the webpage source code carry out note the note piece (＜!--.*?--), perhaps be used for subsidiary function script block (＜(no)? script.*?＜/(no)? script) etc.

Among embodiment, reject invalid label and can also comprise the first label and the second label process of rejecting in the described character stream therein; Described the first label mainly is for the label of the display mode of text being done fine tune, as change font, color, font size, thickness etc., because whether their existence does not change the layout of the page, general these labels do not affect the webpage piecemeal, to extracting the Web page text content without help, so generally first this class label is removed, the first label generally comprises " A ", " ABBR ", " ACRONYM ", " AREA ", " B ", " BASE ", " BASEFONT ", " BDO ", BIG "; " BUTTON "; " CAPTION "; " CITE "; " CODE "; " DD "; " DEL, " DFN ", " EM ", " FONT ", " H1 ", " H2 ", " H3 ", " H4 ", " H5 ", " H6 ", " I ", " INS ", " KBD ", " LABLE ", " SMALL ", " STRIKE ", " STRONG ", " SUB ", " SUP ", " Q ", " S ", " SAMP ", " SPAN ", " THEAD ", " TFOOT ", " TEXTAREA ", " U ", " TT ", " VAR ", " O:SMARTTAGTYPE "; Described the second label comprises the label that page layout is not played help and be subordinated to other labels, this class label refers to be subordinated to the label of other classifications, because they generally do not occur separately, impact on page layout is embodied in the main label of its subordinate, so in order to accelerate follow-up process, also this class label just can be deleted in the process of the invalid label of deletion, the second label generally comprises " FRAME ", " INPUT ", " ISINDEX ", " LEGEND ", " LINK ", " MAP ", " META ", " OPTION ", " OPTGROUP ", " PARAM ", " TD ", " TH ", " TR ", " TBODY ", " TITLE ".

Follow again, in step S103, remaining label is converted into tag tree, html is HTML (Hypertext Markup Language), it is a subset of standard generalized markup language, can become the html source code representation easily the form of mark book by analytical tools such as neko or htmlparser, here owing in step S102, deleted some invalid labels, then be to convert the residue label to tag tree, and convert this tag tree to tag queue, conversion regime can be the mode of preorder traversal, also can be the mode of follow-up traversal, perhaps other modes.

Because html is a kind of language of format, its text message need to be placed in the html label, provided the information position by label again, the modifications such as display mode, above-mentioned tag tree then is the tree structure that forms from top to bottom, the corresponding label of each node, the content of text that is clipped between "＞" and "＜" is text node, containing the maximum text node of content of text is maximum text node, to from＜title the most similar text node of the content that extracts the label then is the title text node, wherein, the choice criteria of maximum text node can be: the text node that contains maximum punctuation marks is maximum text node, being calculated in interior punctuation mark has [.,! ]; The choice criteria of title text node can be: with from label＜title the content the longest text node that begins to mate that sets to 0 in place that extracts be the title text node.

The selection logic of tree node:

If maximum text node and title text node all exist, then try to go for from the bottom up first common father node of maximum text node and title text node, if this father node is not the root node, then this father node is exactly node to be selected.If except father node, there are not other common father nodes, then check the 1st grade of (from the bottom up) div or the table father node that comprise maximum text node, if the length of the text that this div/table father node contains surpasses the preset ratio (for example 30%) of the text size of whole web page, then this div/table father node is exactly node to be selected; If the text size of this div/table father node is no more than the preset ratio of whole web page text length, the 2nd grade of (from the bottom up) div/table father node that then comprises the title text node is exactly node to be selected;

If only have the title text node, then the 2nd of the title text node the grade of (from the bottom up) div/table father node is exactly node to be selected.

If only have maximum text node, then the 1st of maximum text node the grade of (from the bottom up) div/table father node is exactly node to be selected.

In step S104, the effective tally set that obtains refers to extract for Web page text the set of the label of helpful effect, each label in the described formation label is carried out the process that tag processes obtains effective tally set can be: if the label in the described formation label is the 3rd label or the 4th label, then directly described the 3rd label or the 4th label are saved in effective tally set, if the label in the described formation label is the 5th label, then the length according to the corresponding text of described the 5th label is saved in effective tally set with the 5th label or merges to corresponding father's label, if the label in the described formation label is described the 3rd label, other label that the 4th label and the 5th label are outer reinserts in the described tag queue after then directly merging to corresponding father's label.

Wherein, the 3rd label refers to the root label that those can directly be identified a web page blocks, can directly it be added the web page blocks pond, this class label mainly contains " HEAD ", " SCRIPT ", " STYLE ", " OBJECT ", " FIELDSET ", " FRAMESET ", " IFRAME "; The 4th label refers to have influence on the display effect of webpage, change text layout, if comprise a plurality of the 4th labels in the html subtree, then this subtree becomes separately the possibility of piece to increase, and this class label mainly comprises: " P ",, " UL ", " OL ", " DL ", " DIR ", " LI ", " DT ", " BLOCKQUOTE ", " ADDRESS ", " BR ", " HR ", " COL ", " COLGROUP ", " IMG ", " MENU ", " SELECT "; The 5th label refers to that those all represent a web page blocks usually, hold very few when only having within it, need to be merged into a web page blocks with other nodes, perhaps its inside does not have character visible under special circumstances, thereby run into this label, whether the condition as a web page blocks is ripe separately just to need to judge it, for example judge whether text size reaches threshold value, can think that then this label is ripe as the condition of a web page blocks separately if reach, then it can be added the web page blocks pond, can not think that then this label is not yet ripe as the condition of a web page blocks separately if reach threshold value, it need to be merged to corresponding father's label the 5th label and mainly comprise " DIV ", " TD ", " TABLE ", " FORM ", " FIELDSET ", " CENTER ", " NOFRAMES ", " NOSCRIPT ", " PRE ", " BODY ", " HTML " etc., yet, it should be noted that, " BODY ", " HTML " two labels are also as the 5th label, reason is the omission that can prevent so the inner literal of webpage behind the piecemeal, like this, even omission is arranged, also can be included at least in " HTML " this label that makes a final check.

Owing in different application, the webpage piecemeal is understood some different requirement.For example, in the work of the data mining of carrying out news web page, need to use the webpage piecemeal, but for this class webpage, often especially need to extract date issued and the time of this news web page, and this part content little style of writing word between headline and body normally, above-mentioned webpage piecemeal can't be extracted into it separately a web page blocks, at this moment can customize according to user's needs some labels, concrete step can be to receive custom instruction; Add the label corresponding with this custom instruction in described effective tally set according to described custom instruction, this custom instruction is for adding instruction, this interpolation instruction can be common label, such as " TITLE " etc., it also can be regular expression, every its inner literal satisfies the 3rd label, the 4th label and the 5th label of this regular expression, all will be extracted as separately web page blocks.

Also may be in actual conditions, need to remove especially some web page blocks, also can be after above-mentioned steps S104, comprises step: receive custom instruction, this custom instruction is delete instruction, deletes label corresponding with this delete instruction in effective tally set according to this delete instruction.

Need to prove, above-mentioned web page blocks pond refers to that those can keep the html code block that does not need further processing, web page blocks in the web page blocks pond can be the form storage with QuarkElement, and comprise the DomTree structure of original html subtree and other relevant informations in the QuarkElement class, simultaneously in the process of above-mentioned traversal, even the web page blocks that has is included under the more high-rise web page blocks on the html structure, but in QuarkElement, also eliminated relation of inclusion, all web page blocks are all independent mutually, do not comprise mutually.

According to the method for the extraction Web page text content of the invention described above, the present invention also provides a kind of system that extracts the Web page text content, below is elaborated with regard to the concrete example of the system of extraction Web page text content of the present invention.

The structural representation of the system embodiment of extraction Web page text content of the present invention has been shown among Fig. 2.According to different Consideration, when the system of specific implementation extraction Web page text of the present invention content, can comprise whole shown in Fig. 2, also can only comprise wherein a part of shown in Fig. 2.

At first, comprise that take the system that extracts the Web page text content acquisition module 201, filtering module 202, tag tree generation module 203, traversal tag queue module 204, text determination module 205 describe as example, wherein:

Acquisition module 201, be used for obtaining the html source file, and this html source file is converted into character stream, wherein, obtain the html source file and can adopt existing mode, do not repeat them here, above-mentioned character stream can be the character stream with the UTF-8 coding, and the text coding of most webpages is stored in the energy collecting of UTF-8 character, same coding is conducive to the successive character handling procedure, but also is not limited to the UTF-8 this coded system of encoding;

Filtering module 202, be used for rejecting the invalid label of described character stream, wherein, reject invalid label and can comprise the noise token process of removing, can accelerate subsequent processes, and then improve the efficient of extracting Web page text, comprise the noise token such as content in deletion note, script and " head " label, these noise token are to be present in the html source file but to extract for Web page text not only not have help, can extract the noise token piece that cause interference to text on the contrary;

Tag tree generation module 203, be used for remaining label is converted into tag tree, and convert this tag tree to tag queue, wherein, convert this tag tree to the tag queue conversion regime and can be the mode of preorder traversal, also can be the mode of follow-up traversal, perhaps other modes;

Traversal tag queue module 204 is used for obtaining effective tally set to each label of described formation label is processed, and wherein, effective tally set of acquisition refers to extract for Web page text the set of helpful active tag;

Text determination module 205 is used for described effective tally set is changed into text, is returned as text.

Accordingly, scheme according to present embodiment, it is to obtain the html source file at acquisition module 201, and after this html source file is converted into character stream, the invalid label that filtering module 202 is rejected in the described character stream, tag tree generation module 203 is converted into tag tree with remaining label, and convert this tag tree to tag queue, each label in 204 pairs of described formation labels of traversal tag queue module carries out tag processes until formation for empty, obtains effective tally set, and text determination module 205 changes into text with effective tally set, be returned as text, because the present invention is directed to the Web page text that will extract is to process from html label aspect, from the function of bookmark name and tag attributes judgement label, can realize the automatic processing capacity of text entity, has very high versatility, broad covered area even structure of web page is complicated, contains multiple interfere information, also can effectively extract the body part of webpage, with strong points, simultaneously customization exploitation is few, and is maintainable strong.

Therein among embodiment, described filtering module 203 can be rejected the first label and the second label in the described character stream, described the first label comprises for the label of the display mode of text being done fine tune, described the second label comprises the label that page layout is not worked and be attached to other labels, wherein, described in the first label and the second label such as the above-mentioned embodiment of the method, do not repeat them here.

Therein among embodiment, provided the specific works mode of traversal tag queue module 204, each label that traversal tag queue module 204 can travel through in the described formation label, if the label in the described formation label is the 3rd label or the 4th label, then directly described the 3rd label or the 4th label are saved in effective tally set, if the label in the described formation label is the 5th label, then the length according to the corresponding text of described the 5th label is saved in effective tally set with the 5th label or merges to corresponding father's label, if the label in the described formation label is described the 3rd label, other label that the 4th label and the 5th label are outer, reinsert in the described tag queue after then directly merging to corresponding father's label, wherein, the 3rd label, the 4th label, described in the 5th label such as the above-mentioned embodiment of the method, do not repeat them here.

Owing in different application, the webpage piecemeal is understood some different requirement.For example, in the work of the data mining of carrying out news web page, need to use the webpage piecemeal, but for this class webpage, often need to extract especially date issued and the time of this news web page, and this part content little style of writing word between headline and body normally, above-mentioned webpage piecemeal can't be extracted into it separately a web page blocks, at this moment can customize according to user's needs some labels, for this reason, therein among embodiment, the system of extraction Web page text content of the present invention can also comprise the customized module 206 that is connected between described traversal tag queue module and the described text determination module, and this customized module 206 comprises adding device 2061, be used for receiving and add instruction, add with this according to described interpolation instruction and add the corresponding label of instruction in described effective tally set.

Also may be in actual conditions, need to remove especially some web page blocks, for this reason, therein among embodiment, the system of extraction Web page text content of the present invention also can comprise the customized module 206 that is connected between described traversal tag queue module and the described text determination module, and this customized module 206 comprises delete cells 2062, delete cells 2062 is used for receiving delete instruction, deletes label corresponding with this delete instruction in described effective tally set according to described delete instruction.

Use the present invention original web page is as shown in Figure 3 carried out the Web page text extraction, extract the result as shown in Figure 4, there is Fig. 4 as seen, after treatment, the guide page of webpage, navigation bar, advertisement column, recommendation information have all been filtered, but comprise that the text messages such as title, subhead, author, news content have all been kept by complete, and the present invention can reach good extraction effect, simultaneously, extraction efficiency is significantly increased than traditional approach.

The above embodiment has only expressed several embodiment of the present invention, and it describes comparatively concrete and detailed, but can not therefore be interpreted as the restriction to claim of the present invention.Should be pointed out that for the person of ordinary skill of the art without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.

Claims

1. a method of extracting the Web page text content is characterized in that, comprises the steps:

Reject the invalid label in the described character stream;

Each label in the described formation label is processed the effective tally set of acquisition;

Described effective tally set is changed into text, be returned as text.

2. the method for extraction Web page text content according to claim 1 is characterized in that, the invalid label in the described character stream of described rejecting comprises step:

Reject the first label and the second label in the described character stream, described the first label comprises that for the label of the display mode of text being done fine tune, described the second label comprises the label that page layout is not worked and be attached to other labels.

3. the method for extraction Web page text content according to claim 1 is characterized in that, described each label in the described formation label is processed obtains effective tally set and comprise step:

Travel through each label in the described formation label, if the label in the described formation label is the 3rd label or the 4th label, then directly described the 3rd label or the 4th label are saved in effective tally set, if the label in the described formation label is the 5th label, then the length according to the corresponding text of described the 5th label is saved in effective tally set with the 5th label or merges to corresponding father's label, if the label in the described formation label is described the 3rd label, other label that the 4th label and the 5th label are outer reinserts in the described tag queue after then directly merging to corresponding father's label.

4. the method for extraction Web page text content according to claim 3 is characterized in that, described each label in the described formation label is processed obtains effective tally set and also comprise step:

Receive custom instruction, this custom instruction is for adding instruction;

Add the label corresponding with this interpolation instruction in described effective tally set according to described interpolation instruction.

5. the method for extraction Web page text content according to claim 1, it is characterized in that, described each label in the described formation label is processed obtain effective tally set after, described effective tally set is changed into text, be returned as and also comprise step before the text step:

Receive custom instruction, this custom instruction is delete instruction;

Delete label corresponding with this delete instruction in described effective tally set according to described delete instruction.

6. a system that extracts the Web page text content is characterized in that, comprising:

Traversal tag queue module is used for obtaining effective tally set to each label of described formation label is processed;

The text determination module is used for described effective tally set is changed into text, is returned as text.

7. the system of extraction Web page text content according to claim 6, it is characterized in that, described filtering module is rejected the first label and the second label in the described character stream, described the first label comprises that for the label of the display mode of text being done fine tune, described the second label comprises the label that page layout is not worked and be attached to other labels.

8. the system of extraction Web page text content according to claim 6, it is characterized in that, described traversal tag queue module travels through each label in the described formation label, if the label in the described formation label is the 3rd label or the 4th label, then directly described the 3rd label or the 4th label are saved in effective tally set, if the label in the described formation label is the 5th label, then the length according to the corresponding text of described the 5th label is saved in effective tally set with the 5th label or merges to corresponding father's label, if the label in the described formation label is described the 3rd label, other label that the 4th label and the 5th label are outer reinserts in the described tag queue after then directly merging to corresponding father's label.

9. the system of extraction Web page text content according to claim 6 is characterized in that, also comprises the customized module that is connected between described traversal tag queue module and the described text determination module, and this customized module comprises:

Adding device is used for receiving and adds instruction, adds with this according to described interpolation instruction and adds the corresponding label of instruction in described effective tally set.

10. according to claim 6 or the system of 9 described extraction Web page text contents, it is characterized in that also comprise the customized module that is connected between described traversal tag queue module and the described text determination module, this customized module comprises:

Delete cells is used for receiving delete instruction, deletes label corresponding with this delete instruction in described effective tally set according to described delete instruction.