CN103049536A - Webpage main text content extracting method and webpage text content extracting system - Google Patents

Webpage main text content extracting method and webpage text content extracting system Download PDF

Info

Publication number
CN103049536A
CN103049536A CN2012105701935A CN201210570193A CN103049536A CN 103049536 A CN103049536 A CN 103049536A CN 2012105701935 A CN2012105701935 A CN 2012105701935A CN 201210570193 A CN201210570193 A CN 201210570193A CN 103049536 A CN103049536 A CN 103049536A
Authority
CN
China
Prior art keywords
label
text
effective
web page
tally set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012105701935A
Other languages
Chinese (zh)
Inventor
王海山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GUANGZHOU VOSON MARKETING CONSULTING Co Ltd
Original Assignee
GUANGZHOU VOSON MARKETING CONSULTING Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GUANGZHOU VOSON MARKETING CONSULTING Co Ltd filed Critical GUANGZHOU VOSON MARKETING CONSULTING Co Ltd
Priority to CN2012105701935A priority Critical patent/CN103049536A/en
Publication of CN103049536A publication Critical patent/CN103049536A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a webpage main text content extracting method and a webpage main text content extracting system. The method includes the steps: acquiring an html source file and converting the html source file into character streams; rejecting invalid tags in the character streams; converting residual tags into a tag tree and converting the tag tree into a tag queue; processing each tag in the tag queue so as to obtain a valid tag set; and converting the valid tag set into a text and returning to the main text. The method and the system are high in universality, wide in coverage, less in customized development and high in maintainability, and are capable of effectively extracting the main text of a webpage and highly targeted even if webpage results are complex and various interference information is included.

Description

Extract the method and system of Web page text content
Technical field
The present invention relates to the internet information process field, particularly relate to a kind of method and system that extracts the Web page text content.
Background technology
Along with developing rapidly of internet, the quantity of information on the internet doubles with geometric series.People need to search the information that oneself needs in the information bank of magnanimity, and may need for further processing to the information that obtains and analysis.But navigation link, the advertisement link that adds for commercial interest, copyright information and related subject that a lot of original contents add in order to make things convenient for the user to browse are read recommended links etc.These information are entrained in the webpage, have affected user's browsing subject content.How from the webpage that comprises a large amount of noise contents, text message accurately, intactly extracted and become research topic.
Scheme commonly used has following two kinds at present:
First kind of way is to adopt RSS (simple and easy information fusion also is aggregated content, is the form of a kind of description and synchronous web site contents) seed file as information source.Because the RSS seed file is normally write according to the RSS standard of standard, can isolate the information such as the title that needs, issuing time, body matter by simple XML analytic technique, for example, the RSS reader all adopts this mode.
The second way is directly to adopt the WEB page of some specific website as information source, obtains the information of needs according to the proprietary resolver of coding characteristic exploitation of the WEB page.Most of news of using is at present read client and is all adopted this mode.
Yet, for first kind of way because a lot of websites do not provide the RSS seed, even and a lot of website the RSS seed is provided, but in order not affect the flow of its website, usually only comprise Content of Introductory Reading in the seed file.This just causes a lot of information to be excluded outside selectable range, and the information that also may cause the user to obtain is imperfect.
For the second way, this can bring a large amount of customization exploitations, and simultaneously comparatively rigid composing identification requirement can be brought many Maintenance developments because of the continuous variation of setting type in the targeted website.And the exploitation of these customization and maintainability brings the soaring of workload to cause it can only cover limited main stream website, can cause equally many information to be excluded outside optional scope.
Therefore, extracting the problem that needs to solve for the webpage main contents at present is: coverage rate is narrow, maintainability is poor.
Summary of the invention
The object of the present invention is to provide a kind of method and system that extracts the Web page text content, its broad covered area, maintainable strong.
Purpose of the present invention is achieved through the following technical solutions:
A kind of method of extracting the Web page text content comprises the steps:
Obtain the html source file, and this html source file is converted into character stream;
Reject the invalid label in the described character stream;
Remaining label is converted into tag tree, and converts this tag tree to tag queue;
Each label in the described formation label is carried out tag processes until formation for empty, obtains effective tally set;
Effective tally set is changed into text, be returned as text.
A kind of system that extracts the Web page text content comprises:
Acquisition module is used for obtaining the html source file, and this html source file is converted into character stream;
Filtering module is for the invalid label of rejecting described character stream;
The tag tree generation module is used for remaining label is converted into tag tree, and converts this tag tree to tag queue;
Traversal tag queue module is used for each label of described formation label is carried out tag processes until formation for empty, obtains effective tally set;
The text determination module is used for effective tally set is changed into text, is returned as text.
Scheme according to the invention described above, it is to obtain the html source file, and after this html source file is converted into character stream, reject the invalid label in this character stream, again remaining label is converted into tag tree, and convert this tag tree to tag queue, and each label in the described formation label is carried out tag processes obtain effective tally set, effective tally set is changed into text, be returned as text, because in whole processing procedure, only be to process the html source file from html label aspect, and not by the information of other aspects, has very high versatility, broad covered area even web results is complicated, contains multiple interfere information, also can effectively extract the body part of webpage, with strong points, simultaneously customization exploitation is few, and is maintainable strong.
Description of drawings
Fig. 1 is the schematic flow sheet that the present invention extracts the embodiment of the method for Web page text content;
Fig. 2 is the structural representation that the present invention extracts the system embodiment of Web page text content;
Fig. 3 is for using the original web page before the present invention extracts the Web page text content;
Fig. 4 extracts the result schematic diagram of Web page text content for using the present invention.
Embodiment
The present invention is further elaborated below in conjunction with embodiment and accompanying drawing, but implementation of the present invention is not limited to this.
Referring to shown in Figure 1, be the schematic flow sheet of the embodiment of the method for extraction Web page text content of the present invention.As shown in Figure 1, the method for the extraction Web page text content among this embodiment comprises the steps:
Step S101: obtain the html source file, and this html source file is converted into character stream, enter step S102;
Step S102: reject the invalid label in the described character stream, enter step S103;
Step S103: remaining label is converted into tag tree, and converts this tag tree to tag queue, enter step S104;
Step S104: each label in the described formation label is carried out tag processes obtain effective tally set, enter step S105;
Step S105: obtain text according to described effective tally set, be returned as text.
Accordingly, scheme according to present embodiment, it is to obtain the html source file, and after this html source file is converted into character stream, reject the invalid label in the described character stream, again remaining label is converted into tag tree, and convert this tag tree to tag queue, each label in the described formation label is carried out tag processes until formation for empty, obtains effective tally set, effective tally set is changed into text, be returned as text, because the present invention is directed to the Web page text that will extract is to process from html label aspect, from the function of bookmark name and tag attributes judgement label, can realize the automatic processing capacity of text entity, has very high versatility, broad covered area even structure of web page is complicated, contains multiple interfere information, also can effectively extract the body part of webpage, with strong points, simultaneously customization exploitation is few, and is maintainable strong.
The below describes above-mentioned each step in detail.
At first, in step S101, obtain the html source file and can adopt existing mode, do not repeat them here, above-mentioned character stream can be the character stream with the UTF-8 coding, the text coding of most webpages is stored in the energy collecting of UTF-8 character, and same coding is conducive to the subsequent character stream handling procedure, but also is not limited to the UTF-8 this coded system of encoding.
Then, in step S102, reject invalid label and can comprise the noise token process of removing, this can accelerate subsequent processes, and then the efficient of Web page text is extracted in raising, comprise the noise token such as content in deletion note, script and " head " label, these noise token are to be present in the html source file, not only there is not help for the Web page text contents extraction, can extract the noise token piece that causes interference to text on the contrary, for example, the User Exploitation person for the webpage source code carry out note the note piece (<!--.*?--), perhaps be used for subsidiary function script block (<(no)? script.*?</(no)? script) etc.
Among embodiment, reject invalid label and can also comprise the first label and the second label process of rejecting in the described character stream therein; Described the first label mainly is for the label of the display mode of text being done fine tune, as change font, color, font size, thickness etc., because whether their existence does not change the layout of the page, general these labels do not affect the webpage piecemeal, to extracting the Web page text content without help, so generally first this class label is removed, the first label generally comprises " A ", " ABBR ", " ACRONYM ", " AREA ", " B ", " BASE ", " BASEFONT ", " BDO ", BIG "; " BUTTON "; " CAPTION "; " CITE "; " CODE "; " DD "; " DEL, " DFN ", " EM ", " FONT ", " H1 ", " H2 ", " H3 ", " H4 ", " H5 ", " H6 ", " I ", " INS ", " KBD ", " LABLE ", " SMALL ", " STRIKE ", " STRONG ", " SUB ", " SUP ", " Q ", " S ", " SAMP ", " SPAN ", " THEAD ", " TFOOT ", " TEXTAREA ", " U ", " TT ", " VAR ", " O:SMARTTAGTYPE "; Described the second label comprises the label that page layout is not played help and be subordinated to other labels, this class label refers to be subordinated to the label of other classifications, because they generally do not occur separately, impact on page layout is embodied in the main label of its subordinate, so in order to accelerate follow-up process, also this class label just can be deleted in the process of the invalid label of deletion, the second label generally comprises " FRAME ", " INPUT ", " ISINDEX ", " LEGEND ", " LINK ", " MAP ", " META ", " OPTION ", " OPTGROUP ", " PARAM ", " TD ", " TH ", " TR ", " TBODY ", " TITLE ".
Follow again, in step S103, remaining label is converted into tag tree, html is HTML (Hypertext Markup Language), it is a subset of standard generalized markup language, can become the html source code representation easily the form of mark book by analytical tools such as neko or htmlparser, here owing in step S102, deleted some invalid labels, then be to convert the residue label to tag tree, and convert this tag tree to tag queue, conversion regime can be the mode of preorder traversal, also can be the mode of follow-up traversal, perhaps other modes.
Because html is a kind of language of format, its text message need to be placed in the html label, provided the information position by label again, the modifications such as display mode, above-mentioned tag tree then is the tree structure that forms from top to bottom, the corresponding label of each node, the content of text that is clipped between ">" and "<" is text node, containing the maximum text node of content of text is maximum text node, to from<title the most similar text node of the content that extracts the label then is the title text node, wherein, the choice criteria of maximum text node can be: the text node that contains maximum punctuation marks is maximum text node, being calculated in interior punctuation mark has [.,! ]; The choice criteria of title text node can be: with from label<title the content the longest text node that begins to mate that sets to 0 in place that extracts be the title text node.
The selection logic of tree node:
If maximum text node and title text node all exist, then try to go for from the bottom up first common father node of maximum text node and title text node, if this father node is not the root node, then this father node is exactly node to be selected.If except father node, there are not other common father nodes, then check the 1st grade of (from the bottom up) div or the table father node that comprise maximum text node, if the length of the text that this div/table father node contains surpasses the preset ratio (for example 30%) of the text size of whole web page, then this div/table father node is exactly node to be selected; If the text size of this div/table father node is no more than the preset ratio of whole web page text length, the 2nd grade of (from the bottom up) div/table father node that then comprises the title text node is exactly node to be selected;
If only have the title text node, then the 2nd of the title text node the grade of (from the bottom up) div/table father node is exactly node to be selected.
If only have maximum text node, then the 1st of maximum text node the grade of (from the bottom up) div/table father node is exactly node to be selected.
In step S104, the effective tally set that obtains refers to extract for Web page text the set of the label of helpful effect, each label in the described formation label is carried out the process that tag processes obtains effective tally set can be: if the label in the described formation label is the 3rd label or the 4th label, then directly described the 3rd label or the 4th label are saved in effective tally set, if the label in the described formation label is the 5th label, then the length according to the corresponding text of described the 5th label is saved in effective tally set with the 5th label or merges to corresponding father's label, if the label in the described formation label is described the 3rd label, other label that the 4th label and the 5th label are outer reinserts in the described tag queue after then directly merging to corresponding father's label.
Wherein, the 3rd label refers to the root label that those can directly be identified a web page blocks, can directly it be added the web page blocks pond, this class label mainly contains " HEAD ", " SCRIPT ", " STYLE ", " OBJECT ", " FIELDSET ", " FRAMESET ", " IFRAME "; The 4th label refers to have influence on the display effect of webpage, change text layout, if comprise a plurality of the 4th labels in the html subtree, then this subtree becomes separately the possibility of piece to increase, and this class label mainly comprises: " P ",, " UL ", " OL ", " DL ", " DIR ", " LI ", " DT ", " BLOCKQUOTE ", " ADDRESS ", " BR ", " HR ", " COL ", " COLGROUP ", " IMG ", " MENU ", " SELECT "; The 5th label refers to that those all represent a web page blocks usually, hold very few when only having within it, need to be merged into a web page blocks with other nodes, perhaps its inside does not have character visible under special circumstances, thereby run into this label, whether the condition as a web page blocks is ripe separately just to need to judge it, for example judge whether text size reaches threshold value, can think that then this label is ripe as the condition of a web page blocks separately if reach, then it can be added the web page blocks pond, can not think that then this label is not yet ripe as the condition of a web page blocks separately if reach threshold value, it need to be merged to corresponding father's label the 5th label and mainly comprise " DIV ", " TD ", " TABLE ", " FORM ", " FIELDSET ", " CENTER ", " NOFRAMES ", " NOSCRIPT ", " PRE ", " BODY ", " HTML " etc., yet, it should be noted that, " BODY ", " HTML " two labels are also as the 5th label, reason is the omission that can prevent so the inner literal of webpage behind the piecemeal, like this, even omission is arranged, also can be included at least in " HTML " this label that makes a final check.
Owing in different application, the webpage piecemeal is understood some different requirement.For example, in the work of the data mining of carrying out news web page, need to use the webpage piecemeal, but for this class webpage, often especially need to extract date issued and the time of this news web page, and this part content little style of writing word between headline and body normally, above-mentioned webpage piecemeal can't be extracted into it separately a web page blocks, at this moment can customize according to user's needs some labels, concrete step can be to receive custom instruction; Add the label corresponding with this custom instruction in described effective tally set according to described custom instruction, this custom instruction is for adding instruction, this interpolation instruction can be common label, such as " TITLE " etc., it also can be regular expression, every its inner literal satisfies the 3rd label, the 4th label and the 5th label of this regular expression, all will be extracted as separately web page blocks.
Also may be in actual conditions, need to remove especially some web page blocks, also can be after above-mentioned steps S104, comprises step: receive custom instruction, this custom instruction is delete instruction, deletes label corresponding with this delete instruction in effective tally set according to this delete instruction.
Need to prove, above-mentioned web page blocks pond refers to that those can keep the html code block that does not need further processing, web page blocks in the web page blocks pond can be the form storage with QuarkElement, and comprise the DomTree structure of original html subtree and other relevant informations in the QuarkElement class, simultaneously in the process of above-mentioned traversal, even the web page blocks that has is included under the more high-rise web page blocks on the html structure, but in QuarkElement, also eliminated relation of inclusion, all web page blocks are all independent mutually, do not comprise mutually.
According to the method for the extraction Web page text content of the invention described above, the present invention also provides a kind of system that extracts the Web page text content, below is elaborated with regard to the concrete example of the system of extraction Web page text content of the present invention.
The structural representation of the system embodiment of extraction Web page text content of the present invention has been shown among Fig. 2.According to different Consideration, when the system of specific implementation extraction Web page text of the present invention content, can comprise whole shown in Fig. 2, also can only comprise wherein a part of shown in Fig. 2.
At first, comprise that take the system that extracts the Web page text content acquisition module 201, filtering module 202, tag tree generation module 203, traversal tag queue module 204, text determination module 205 describe as example, wherein:
Acquisition module 201, be used for obtaining the html source file, and this html source file is converted into character stream, wherein, obtain the html source file and can adopt existing mode, do not repeat them here, above-mentioned character stream can be the character stream with the UTF-8 coding, and the text coding of most webpages is stored in the energy collecting of UTF-8 character, same coding is conducive to the successive character handling procedure, but also is not limited to the UTF-8 this coded system of encoding;
Filtering module 202, be used for rejecting the invalid label of described character stream, wherein, reject invalid label and can comprise the noise token process of removing, can accelerate subsequent processes, and then improve the efficient of extracting Web page text, comprise the noise token such as content in deletion note, script and " head " label, these noise token are to be present in the html source file but to extract for Web page text not only not have help, can extract the noise token piece that cause interference to text on the contrary;
Tag tree generation module 203, be used for remaining label is converted into tag tree, and convert this tag tree to tag queue, wherein, convert this tag tree to the tag queue conversion regime and can be the mode of preorder traversal, also can be the mode of follow-up traversal, perhaps other modes;
Traversal tag queue module 204 is used for obtaining effective tally set to each label of described formation label is processed, and wherein, effective tally set of acquisition refers to extract for Web page text the set of helpful active tag;
Text determination module 205 is used for described effective tally set is changed into text, is returned as text.
Accordingly, scheme according to present embodiment, it is to obtain the html source file at acquisition module 201, and after this html source file is converted into character stream, the invalid label that filtering module 202 is rejected in the described character stream, tag tree generation module 203 is converted into tag tree with remaining label, and convert this tag tree to tag queue, each label in 204 pairs of described formation labels of traversal tag queue module carries out tag processes until formation for empty, obtains effective tally set, and text determination module 205 changes into text with effective tally set, be returned as text, because the present invention is directed to the Web page text that will extract is to process from html label aspect, from the function of bookmark name and tag attributes judgement label, can realize the automatic processing capacity of text entity, has very high versatility, broad covered area even structure of web page is complicated, contains multiple interfere information, also can effectively extract the body part of webpage, with strong points, simultaneously customization exploitation is few, and is maintainable strong.
Therein among embodiment, described filtering module 203 can be rejected the first label and the second label in the described character stream, described the first label comprises for the label of the display mode of text being done fine tune, described the second label comprises the label that page layout is not worked and be attached to other labels, wherein, described in the first label and the second label such as the above-mentioned embodiment of the method, do not repeat them here.
Therein among embodiment, provided the specific works mode of traversal tag queue module 204, each label that traversal tag queue module 204 can travel through in the described formation label, if the label in the described formation label is the 3rd label or the 4th label, then directly described the 3rd label or the 4th label are saved in effective tally set, if the label in the described formation label is the 5th label, then the length according to the corresponding text of described the 5th label is saved in effective tally set with the 5th label or merges to corresponding father's label, if the label in the described formation label is described the 3rd label, other label that the 4th label and the 5th label are outer, reinsert in the described tag queue after then directly merging to corresponding father's label, wherein, the 3rd label, the 4th label, described in the 5th label such as the above-mentioned embodiment of the method, do not repeat them here.
Owing in different application, the webpage piecemeal is understood some different requirement.For example, in the work of the data mining of carrying out news web page, need to use the webpage piecemeal, but for this class webpage, often need to extract especially date issued and the time of this news web page, and this part content little style of writing word between headline and body normally, above-mentioned webpage piecemeal can't be extracted into it separately a web page blocks, at this moment can customize according to user's needs some labels, for this reason, therein among embodiment, the system of extraction Web page text content of the present invention can also comprise the customized module 206 that is connected between described traversal tag queue module and the described text determination module, and this customized module 206 comprises adding device 2061, be used for receiving and add instruction, add with this according to described interpolation instruction and add the corresponding label of instruction in described effective tally set.
Also may be in actual conditions, need to remove especially some web page blocks, for this reason, therein among embodiment, the system of extraction Web page text content of the present invention also can comprise the customized module 206 that is connected between described traversal tag queue module and the described text determination module, and this customized module 206 comprises delete cells 2062, delete cells 2062 is used for receiving delete instruction, deletes label corresponding with this delete instruction in described effective tally set according to described delete instruction.
Use the present invention original web page is as shown in Figure 3 carried out the Web page text extraction, extract the result as shown in Figure 4, there is Fig. 4 as seen, after treatment, the guide page of webpage, navigation bar, advertisement column, recommendation information have all been filtered, but comprise that the text messages such as title, subhead, author, news content have all been kept by complete, and the present invention can reach good extraction effect, simultaneously, extraction efficiency is significantly increased than traditional approach.
The above embodiment has only expressed several embodiment of the present invention, and it describes comparatively concrete and detailed, but can not therefore be interpreted as the restriction to claim of the present invention.Should be pointed out that for the person of ordinary skill of the art without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.

Claims (10)

1. a method of extracting the Web page text content is characterized in that, comprises the steps:
Obtain the html source file, and this html source file is converted into character stream;
Reject the invalid label in the described character stream;
Remaining label is converted into tag tree, and converts this tag tree to tag queue;
Each label in the described formation label is processed the effective tally set of acquisition;
Described effective tally set is changed into text, be returned as text.
2. the method for extraction Web page text content according to claim 1 is characterized in that, the invalid label in the described character stream of described rejecting comprises step:
Reject the first label and the second label in the described character stream, described the first label comprises that for the label of the display mode of text being done fine tune, described the second label comprises the label that page layout is not worked and be attached to other labels.
3. the method for extraction Web page text content according to claim 1 is characterized in that, described each label in the described formation label is processed obtains effective tally set and comprise step:
Travel through each label in the described formation label, if the label in the described formation label is the 3rd label or the 4th label, then directly described the 3rd label or the 4th label are saved in effective tally set, if the label in the described formation label is the 5th label, then the length according to the corresponding text of described the 5th label is saved in effective tally set with the 5th label or merges to corresponding father's label, if the label in the described formation label is described the 3rd label, other label that the 4th label and the 5th label are outer reinserts in the described tag queue after then directly merging to corresponding father's label.
4. the method for extraction Web page text content according to claim 3 is characterized in that, described each label in the described formation label is processed obtains effective tally set and also comprise step:
Receive custom instruction, this custom instruction is for adding instruction;
Add the label corresponding with this interpolation instruction in described effective tally set according to described interpolation instruction.
5. the method for extraction Web page text content according to claim 1, it is characterized in that, described each label in the described formation label is processed obtain effective tally set after, described effective tally set is changed into text, be returned as and also comprise step before the text step:
Receive custom instruction, this custom instruction is delete instruction;
Delete label corresponding with this delete instruction in described effective tally set according to described delete instruction.
6. a system that extracts the Web page text content is characterized in that, comprising:
Acquisition module is used for obtaining the html source file, and this html source file is converted into character stream;
Filtering module is for the invalid label of rejecting described character stream;
The tag tree generation module is used for remaining label is converted into tag tree, and converts this tag tree to tag queue;
Traversal tag queue module is used for obtaining effective tally set to each label of described formation label is processed;
The text determination module is used for described effective tally set is changed into text, is returned as text.
7. the system of extraction Web page text content according to claim 6, it is characterized in that, described filtering module is rejected the first label and the second label in the described character stream, described the first label comprises that for the label of the display mode of text being done fine tune, described the second label comprises the label that page layout is not worked and be attached to other labels.
8. the system of extraction Web page text content according to claim 6, it is characterized in that, described traversal tag queue module travels through each label in the described formation label, if the label in the described formation label is the 3rd label or the 4th label, then directly described the 3rd label or the 4th label are saved in effective tally set, if the label in the described formation label is the 5th label, then the length according to the corresponding text of described the 5th label is saved in effective tally set with the 5th label or merges to corresponding father's label, if the label in the described formation label is described the 3rd label, other label that the 4th label and the 5th label are outer reinserts in the described tag queue after then directly merging to corresponding father's label.
9. the system of extraction Web page text content according to claim 6 is characterized in that, also comprises the customized module that is connected between described traversal tag queue module and the described text determination module, and this customized module comprises:
Adding device is used for receiving and adds instruction, adds with this according to described interpolation instruction and adds the corresponding label of instruction in described effective tally set.
10. according to claim 6 or the system of 9 described extraction Web page text contents, it is characterized in that also comprise the customized module that is connected between described traversal tag queue module and the described text determination module, this customized module comprises:
Delete cells is used for receiving delete instruction, deletes label corresponding with this delete instruction in described effective tally set according to described delete instruction.
CN2012105701935A 2012-11-01 2012-12-25 Webpage main text content extracting method and webpage text content extracting system Pending CN103049536A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012105701935A CN103049536A (en) 2012-11-01 2012-12-25 Webpage main text content extracting method and webpage text content extracting system

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201210431251.6 2012-11-01
CN201210431251 2012-11-01
CN2012105701935A CN103049536A (en) 2012-11-01 2012-12-25 Webpage main text content extracting method and webpage text content extracting system

Publications (1)

Publication Number Publication Date
CN103049536A true CN103049536A (en) 2013-04-17

Family

ID=48062177

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012105701935A Pending CN103049536A (en) 2012-11-01 2012-12-25 Webpage main text content extracting method and webpage text content extracting system

Country Status (1)

Country Link
CN (1) CN103049536A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103309961A (en) * 2013-05-30 2013-09-18 北京智海创讯信息技术有限公司 Webpage content extraction method based on Markov random field
CN104573097A (en) * 2015-01-30 2015-04-29 湖南蚁坊软件有限公司 Method for extracting webpage content
CN106202579A (en) * 2016-08-26 2016-12-07 乐视控股(北京)有限公司 Web page text extraction process method and device, server, terminal
CN106547806A (en) * 2015-09-23 2017-03-29 阿里巴巴集团控股有限公司 Page loading method and device
CN106960057A (en) * 2017-04-05 2017-07-18 上海威固信息技术有限公司 A kind of method that Web page text is extracted based on information density
CN108491536A (en) * 2018-03-30 2018-09-04 北京智慧正安科技有限公司 Legal provision extracting method, device and computer readable storage medium
CN110020385A (en) * 2017-09-29 2019-07-16 甲骨文国际公司 System and method for extracting website characteristic
CN110874428A (en) * 2019-11-11 2020-03-10 汉口北进出口服务有限公司 Structured data extraction device and method for e-commerce page and readable storage medium
CN112069063A (en) * 2020-08-27 2020-12-11 苏州浪潮智能科技有限公司 Method for obtaining label ID of designated component by dojo framework and automatic testing method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020124020A1 (en) * 2001-03-01 2002-09-05 International Business Machines Corporation Extracting textual equivalents of multimedia content stored in multimedia files
CN1896992A (en) * 2006-06-15 2007-01-17 Ut斯达康通讯有限公司 Method and device for analyzing XML file based on applied customization
CN102270206A (en) * 2010-06-03 2011-12-07 北京迅捷英翔网络科技有限公司 Method and device for capturing valid web page contents
CN102591612A (en) * 2011-12-27 2012-07-18 厦门市美亚柏科信息股份有限公司 General webpage text extraction method based on punctuation continuity and system thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020124020A1 (en) * 2001-03-01 2002-09-05 International Business Machines Corporation Extracting textual equivalents of multimedia content stored in multimedia files
CN1896992A (en) * 2006-06-15 2007-01-17 Ut斯达康通讯有限公司 Method and device for analyzing XML file based on applied customization
CN102270206A (en) * 2010-06-03 2011-12-07 北京迅捷英翔网络科技有限公司 Method and device for capturing valid web page contents
CN102591612A (en) * 2011-12-27 2012-07-18 厦门市美亚柏科信息股份有限公司 General webpage text extraction method based on punctuation continuity and system thereof

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103309961B (en) * 2013-05-30 2015-07-15 北京智海创讯信息技术有限公司 Webpage content extraction method based on Markov random field
CN103309961A (en) * 2013-05-30 2013-09-18 北京智海创讯信息技术有限公司 Webpage content extraction method based on Markov random field
CN104573097B (en) * 2015-01-30 2018-07-24 湖南蚁坊软件有限公司 A method of extraction Web page text
CN104573097A (en) * 2015-01-30 2015-04-29 湖南蚁坊软件有限公司 Method for extracting webpage content
CN106547806A (en) * 2015-09-23 2017-03-29 阿里巴巴集团控股有限公司 Page loading method and device
CN106202579A (en) * 2016-08-26 2016-12-07 乐视控股(北京)有限公司 Web page text extraction process method and device, server, terminal
CN106960057A (en) * 2017-04-05 2017-07-18 上海威固信息技术有限公司 A kind of method that Web page text is extracted based on information density
CN110020385A (en) * 2017-09-29 2019-07-16 甲骨文国际公司 System and method for extracting website characteristic
CN110020385B (en) * 2017-09-29 2023-09-15 甲骨文国际公司 System and method for extracting website characteristics
CN108491536A (en) * 2018-03-30 2018-09-04 北京智慧正安科技有限公司 Legal provision extracting method, device and computer readable storage medium
CN110874428A (en) * 2019-11-11 2020-03-10 汉口北进出口服务有限公司 Structured data extraction device and method for e-commerce page and readable storage medium
CN112069063A (en) * 2020-08-27 2020-12-11 苏州浪潮智能科技有限公司 Method for obtaining label ID of designated component by dojo framework and automatic testing method
CN112069063B (en) * 2020-08-27 2022-08-12 苏州浪潮智能科技有限公司 Method for obtaining label ID of designated component by dojo framework and automatic testing method

Similar Documents

Publication Publication Date Title
CN103049536A (en) Webpage main text content extracting method and webpage text content extracting system
CN102184189B (en) Webpage core block determining method based on DOM (Document Object Model) node text density
CN102253979B (en) Vision-based web page extracting method
CN101727461B (en) Method for extracting content of web page
CN101251855B (en) Equipment, system and method for cleaning internet web page
CN102591612B (en) General webpage text extraction method based on punctuation continuity and system thereof
CN103166981B (en) A kind of radio web page code-transferring method and device
CN101650715B (en) Method and device for screening links on web pages
CN102662969B (en) Internet information object positioning method based on webpage structure semantic meaning
CN102163213B (en) Voice browsing method and browser
CN101727498A (en) Automatic extraction method of web page information based on WEB structure
CN101609399B (en) Intelligent website development system based on modeling and method thereof
CN102270206A (en) Method and device for capturing valid web page contents
CN102306201B (en) Method and system for analyzing webpage title
CN102117289B (en) Method and device for extracting comment content from webpage
CN105677638B (en) Web information abstracting method
CN101872350A (en) Web page text extracting method and device thereof
CN106446072A (en) Webpage content processing method and apparatus
CN108733813A (en) Information extracting method, system towards BBS forum Web pages contents and medium
CN101582074A (en) Method for extracting data of DeepWeb response webpage
CN109492177A (en) A kind of web page release method based on web page semantics structure
CN107145591B (en) Title-based webpage effective metadata content extraction method
CN105740355B (en) Webpage context extraction method and device based on aggregation text density
CN105589918B (en) A kind of method and device for extracting page info
CN103761312B (en) Information extraction system and method for multi-recording webpage

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned

Effective date of abandoning: 20170201

C20 Patent right or utility model deemed to be abandoned or is abandoned