CN103020129A - Text content extraction method and text content extraction device - Google Patents

Text content extraction method and text content extraction device Download PDF

Info

Publication number
CN103020129A
CN103020129A CN2012104699406A CN201210469940A CN103020129A CN 103020129 A CN103020129 A CN 103020129A CN 2012104699406 A CN2012104699406 A CN 2012104699406A CN 201210469940 A CN201210469940 A CN 201210469940A CN 103020129 A CN103020129 A CN 103020129A
Authority
CN
China
Prior art keywords
module
link
text
content
modules
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012104699406A
Other languages
Chinese (zh)
Other versions
CN103020129B (en
Inventor
叶伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN201210469940.6A priority Critical patent/CN103020129B/en
Publication of CN103020129A publication Critical patent/CN103020129A/en
Priority to PCT/CN2013/080666 priority patent/WO2013178193A2/en
Application granted granted Critical
Publication of CN103020129B publication Critical patent/CN103020129B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents

Abstract

The invention discloses a text content extraction method and a text content extraction device. The method comprises the following steps: breaking up an input HTML (Hypertext Markup Language) web page into a plurality of modules, determining the position scores of the modules according to the positions of the modules in the web page layout, and calculating the text lengths of the modules; extracting the link addresses contained in the modules, counting the most frequently used character contents in the link addresses, marking the link addresses containing the character contents as effective links, and marking the link addresses, which do not contain the character contents, as ineffective links; and determining the comprehensive scores of the modules according to the formula that the comprehensive score is equal to the position score multiplied by (the text length plus the character length of the effective link) and divided by the character length of the ineffective link, and confirming the modules of which the comprehensive scores exceed a set threshold value as content modules. Through the method provided by the invention, redundant information of non-content parts in the web page can be effectively removed, and more accurate effective content extraction of the wed page is realized.

Description

A kind of method for extracting content of text and device
Technical field
The present invention relates to communication technical field, relate in particular to a kind of method for extracting content of text and device.
Background technology
Along with the fast development of Internet technology, browsing page becomes the Main Means of people's obtaining information gradually, and in the page info of all contacts, text message has accounted for again major part wherein.The text message that how effectively to extract in the page is very important, because if the content of text is all extracted, many unnecessary contents of wherein will inevitably mixing, such as advertising message, navigation information etc., these information normally repeat in a large number, and are not the content that the user is interested and need; Moreover a large amount of repetitions and invalid information also can reduce the accuracy of text cluster and text classification, can strengthen the workload of content retrieval.And in different webpages, the composing of the page and layout are diversified, if list is divided from module or position, then are difficult to obtain accurately effective text message.
At present, the extraction means of content of text are that the input webpage is decomposed into a plurality of modules, and determine by the integrate score that calculates each module whether respective modules is content module.Wherein, the account form of integrate score is: integrate score=position score * word length/link characters length, however this account form is still accurate not, can not accurately divide content.So, how a kind of Text Extraction is provided at present, realize the accurate extraction of content of text is become the technical matters that needs to be resolved hurrily at present.
Summary of the invention
The invention provides a kind of method for extracting content of text and device, can not accurately extract the problem of content of text in order to the method for extracting content of text that solves available technology adopting.
In order to address the above problem, the technical solution used in the present invention is as follows:
On the one hand, the invention provides a kind of method for extracting content of text, comprising:
The HTML (Hypertext Markup Language) html web page of input is decomposed into a plurality of modules, according to the position of each module in page layout, determines the position score of each module, and calculate the text size of each module;
Extract the chained address that each module comprises, in the statistics all-links address except the agreement character the highest character content of frequency of utilization, each chained address that will comprise the highest character content of described frequency of utilization is labeled as effective link, and each chained address that does not comprise the highest character content of described frequency of utilization is labeled as invalid link;
Word length according to invalid link in the position score of integrate score=module * (in the text size+module of module effectively the word length of link)/module, determine the integrate score of each module, and judge that integrate score is content module above the module of setting threshold.
Further, in the method for the invention, use Table label or Div label that the html web page of input is decomposed into a plurality of modules.
Further, in the method for the invention, can also continue to decompose and do not occur the situation that label mixes if decompose the module that obtains, then the module after decomposing be continued to decompose.
Further, in the method for the invention, when the effective link of mark and invalid link, the word length that each link of unified calculation is interior; Perhaps, when determining the integrate score of each module, calculate respectively the interior word length of each link that each module comprises.
Further, in the method for the invention, the text size that calculates each module specifically comprises: for each module, extract the html tag of module, obtain the text message that comprises in the respective modules according to described html tag, calculate the length of text information, obtain the text size of respective modules;
Further, in the method for the invention, go out the chained address of each module by the achor tag extraction.
On the other hand, the present invention also provides a kind of content of text extraction element, comprising:
The Web Page Processing unit is used for the HTML (Hypertext Markup Language) html web page of input is decomposed into a plurality of modules, according to the position of each module in page layout, determines the position score of each module, and calculates the text size of each module;
The mark processing unit, be used for extracting the chained address that each module comprises, in the statistics all-links address except the agreement character the highest character content of frequency of utilization, each chained address that will comprise the highest character content of described frequency of utilization is labeled as effective link, and each chained address that does not comprise the highest character content of described frequency of utilization is labeled as invalid link;
The contents extraction unit, the word length that is used for invalid link in position score * (in the text size+module of module effectively the word length of link)/module according to integrate score=module, determine the integrate score of each module, and judge that integrate score is content module above the module of setting threshold.
Further, in the device of the present invention, described Web Page Processing unit, concrete html web page for using Table label or Div label with input is decomposed into a plurality of modules.
Further, in the device of the present invention, described Web Page Processing unit is used for also judging whether decompose the module that obtains can also continue to decompose and do not occur the situation that label mixes, if then the module after decomposing is continued to decompose.
Further, in the device of the present invention, described mark processing unit also is used for when the effective link of mark and invalid link the word length that each link of unified calculation is interior; Perhaps, described contents extraction unit also is used for when determining the integrate score of each module, calculates respectively the interior word length of each link that each module comprises.
Further, in the device of the present invention, described Web Page Processing unit, concrete being used for for each module extracts the html tag of module, obtains the text message that comprises in the respective modules according to described html tag, calculate the length of text information, obtain the text size of respective modules;
Further, in the device of the present invention, described mark processing unit, the concrete chained address that is used for going out by the achor tag extraction each module.
Compared with prior art, beneficial effect of the present invention is as follows:
The method of the invention and device, used plain text with the ratio that effectively links word length sum and invalid link word length, can extract the content of html web page more accurately, the redundant information such as advertisement have been removed, so that the workload in the participle stage of back reduces greatly, improved the accuracy of text cluster and text classification, autoabstract.
Description of drawings
In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, the below will do one to the accompanying drawing of required use in embodiment or the description of the Prior Art and introduce simply, apparently, accompanying drawing in the following describes only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain according to these accompanying drawings other accompanying drawing.
The process flow diagram of a kind of method for extracting content of text that Fig. 1 provides for the embodiment of the invention;
Fig. 2 is page layout synoptic diagram in the embodiment of the invention;
The method for extracting content of text particular flow sheet that Fig. 3 provides for the embodiment of the invention;
The structured flowchart of a kind of content of text extraction element that Fig. 4 provides for the embodiment of the invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that obtains under the creative work prerequisite.
For the method for extracting content of text that solves available technology adopting can not accurately extract the problem of content of text, the embodiment of the invention provides a kind of method for extracting content of text and device.
As shown in Figure 1, a kind of method for extracting content of text that the embodiment of the invention provides specifically comprises:
Step S101 is decomposed into a plurality of modules with the html web page of inputting, and according to the position of each module in page layout, determines the position score of each module, and calculates the text size of each module;
In this step, preferably, use Table label or Div label that the html web page of input is decomposed into a plurality of modules.
Further, in this step, can also continue to decompose and do not occur the situation that label mixes if decompose the module that obtains, then the module after decomposing be continued to decompose.Wherein, label mixes: the page layout mode of main flow mainly is divided into two kinds at present, namely by<Table〉label or<Div the label layout of carrying out on the page structure divides, but when the edit page content, these two labels also may comprise mutually, i.e. employing<Table〉may comprise<Div in the page of layout label, same employing<Div〉also may comprise<Talbe in the page of layout label; In addition, label mix also the label of accusing structure processed (such as<Table 〉,<h1) and the label that shows of control (such as<font 〉,<b) use mixed in together, cause the difficulty of correcting and data division.Owing to be to divide module, refer to<Table so label used herein mixes mainly among the present invention〉and<Div label mix use.
Further, in this step, the text size that calculates each module specifically comprises: for each module, extract the html tag of module, obtain the text message that comprises in the respective modules according to described html tag, calculate the length of text information, obtain the text size of respective modules.
Step S102, extract the chained address that each module comprises, in the statistics all-links address except the agreement character the highest character content of frequency of utilization, each chained address that will comprise the highest character content of frequency of utilization is labeled as effective link, and each chained address that does not comprise the highest character content of frequency of utilization is labeled as invalid link;
In this step, preferably, go out the chained address of each module by the achor tag extraction.
Step S103, word length according to invalid link in the position score of integrate score=module * (in the text size+module of module effectively the word length of link)/module, determine the integrate score of each module, and judge that integrate score is content module above the module of setting threshold.
The word length of the effective link described in this step and the word length of invalid link preferably, are that unified calculation obtains when the effective link of mark and invalid link; Certainly, also can when determining the integrate score of each module, calculate respectively for each module.
For the implementation procedure of clearer elaboration the method for the invention, further specify below in conjunction with 2 to 3 pairs of the method for the invention of accompanying drawing, be specifically related to following content:
The method for extracting content of text that the embodiment of the invention provides is follow-on method for extracting content of text based on the HTML feature, and the method can more accurate and reasonably be divided content of text.In the embodiment of the invention, page layout is divided into content module and non-content module, as shown in Figure 2, content module is the content part in the webpage, but not content module generally is to show the information such as navigation information, banner, copyright notice/or advertisement.The target of the described scheme of the embodiment of the invention is exactly accurate decomposing H TML webpage, and content module is extracted from html web page.For each module of decomposing out, give and different scores according to the position in the page layout at its place, the module score that is in user's sight line focus is higher, otherwise score is lower, if this module ratio is excessive relatively for the invalid link word length, this modules exhibit may be advertisement or navigation information so.
The described method for extracting content of text based on the HTML feature of the embodiment of the invention specifically comprises:
Step 1 uses label that the html web page of input is resolved into a plurality of modules;
In this step, preferably, adopt Table or Div label that the html web page of input is resolved into a plurality of modules.Why the embodiment of the invention adopts Table label and these two labels that are used for layout of Div label to come decomposing module, be on the one hand because they are for page layout, also can reduce on the other hand the complexity of analyzing web page, to such as span, the label of br these other is not processed, accelerate greatly the resolution speed of webpage, reduced the analysis of system resource.
Step 2 if the module of decomposing in the step 1 can also continue to decompose again, and the situation that Table or Div label mix do not occur, and so again this module is delivered to step 1 and continues to decompose.
Step 3 awards the diverse location score to the module of input according to the diverse location in layout.Certainly, in this step, be set in advance good for the concrete score of each position in the layout, it is higher that its cardinal rule is that module more is in its weight of position that user's notice is concentrated in the layout, and the position score is also higher.
Step 4 is calculated the text size in each module.
Step 5, add up the highest character content of frequency of utilization in the chained address of all modules, because it is identical that must there be the partial character content chained address that belongs to this webpage related content, chained address such as advertisement then can not comprise this part identical character content, therefore can distinguish effective chained address (link of webpage related content) and invalid chained address (advertisement is equal to the incoherent link of web page contents) with the highest character content of frequency of utilization that counts.
In this step, when the highest character content of statistics frequency of utilization, with all URL(URL(uniform resource locator)) the total character that all has, such as these total agreement characters of www, http, get rid of outside statistics.
Step 6, the chained address that will comprise the character content that counts in the step 5 is labeled as effective link, and calculates each effectively interior word length of link;
Step 7 is labeled as invalid link with the chained address that does not comprise the character content that counts in the step 5, and calculates the word length in each invalid link.
Step 8, according to the word length in the position score of the integrate score=module of module * (word length in the link of the text size in the module+effectively)/invalid link, draw each module synthesis score, what integrate score was higher than set threshold value namely thinks content module.
Step 9 according to the threshold value of prior setting (namely think the integrate score lower limit that content module should reach), is judged the integrate score in the described step 8, if its score value is higher than the threshold value that sets, thinks that then the content of this module is to need the content of text that extracts.
Based on above-mentioned principle statement, below in conjunction with describing with concrete example, as shown in Figure 3, comprise: after obtaining webpage, webpage as input, as step 1. in, if can use Table label and Div label that the webpage of input is decomposed into a plurality of modules, just the webpage of input be decomposed.2. judge whether the module decompose out can also continue to decompose in step in, 1. continue to decompose if can get back to so step, otherwise enter step 3..Obtain simple text behind all html tags in the module that 3. step proposes 2. to be inputted by step, calculate the length of this text.4. step uses the achor tag extraction to go out all links, and adds up the highest character content of frequency of utilization in the chained address of all modules.5. step calculates the link characters length that comprises and do not comprise the character content that 4. step count, and is labeled as respectively effective link and invalid link.6. step utilizes formula: integrate score=position score * (word length+effectively link word length)/invalid link word length calculates the integrate score of each module.7. integrate score is deleted through step less than the module of threshold value, and the module that integrate score is higher than threshold value enters step and 8. exports.
In sum, the described method of the embodiment of the invention, used plain text with the ratio that effectively links word length sum and invalid link word length, can extract the content of html web page more accurately, the redundant information such as advertisement have been removed, so that the workload in the participle stage of back reduces greatly, improved the accuracy of text cluster and text classification, autoabstract.
As shown in Figure 4, the embodiment of the invention also provides a kind of content of text extraction element, specifically comprises:
Web Page Processing unit 410 is used for the html web page of input is decomposed into a plurality of modules, according to the position of each module in page layout, determines the position score of each module, and calculates the text size of each module;
Mark processing unit 420, be used for extracting the chained address that each module comprises, in the statistics all-links address except the agreement character the highest character content of frequency of utilization, each chained address that will comprise the highest character content of frequency of utilization is labeled as effective link, and each chained address that does not comprise the highest character content of frequency of utilization is labeled as invalid link;
Contents extraction unit 430, the word length that is used for invalid link in position score * (in the text size+module of module effectively the word length of link)/module according to integrate score=module, determine the integrate score of each module, and judge that integrate score is content module above the module of setting threshold.
Based on above-mentioned principle framework, the below provides the specific implementation of above-mentioned each unit when realizing corresponding function, and is specific as follows:
In the embodiment of the invention, Web Page Processing unit 410 specifically uses Table label or Div label that the html web page of input is decomposed into a plurality of modules; And for each module, extract the html tag of module, and obtain the text message that comprises in the respective modules according to described html tag, calculate the length of text information, obtain the text size of respective modules.
Further, Web Page Processing unit 410 is used for also judging whether decompose the module that obtains can also continue to decompose and do not occur the situation that label mixes, if then the module after decomposing is continued to decompose.
In the embodiment of the invention, mark processing unit 420 also is used for when the effective link of mark and invalid link the word length that each link of unified calculation is interior; Perhaps, contents extraction unit 430 when determining the integrate score of each module, calculates respectively the interior word length of each link that each module comprises.
Further, in the embodiment of the invention, mark processing unit 420, the concrete chained address that is used for going out by the achor tag extraction each module.
In sum, device of the present invention, used plain text with the ratio that effectively links word length sum and invalid link word length, can extract the content of html web page more accurately, the redundant information such as advertisement have been removed, so that the workload in the participle stage of back reduces greatly, improved the accuracy of text cluster and text classification, autoabstract.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims (10)

1. a method for extracting content of text is characterized in that, comprising:
The HTML (Hypertext Markup Language) html web page of input is decomposed into a plurality of modules, according to the position of each module in page layout, determines the position score of each module, and calculate the text size of each module;
Extract the chained address that each module comprises, in the statistics all-links address except the agreement character the highest character content of frequency of utilization, each chained address that will comprise the highest character content of described frequency of utilization is labeled as effective link, and each chained address that does not comprise the highest character content of described frequency of utilization is labeled as invalid link;
Word length according to invalid link in the position score of integrate score=module * (in the text size+module of module effectively the word length of link)/module, determine the integrate score of each module, and judge that integrate score is content module above the module of setting threshold.
2. the method for claim 1 is characterized in that, in the described method, uses Table label or Div label that the html web page of input is decomposed into a plurality of modules.
3. method as claimed in claim 2 is characterized in that, in the described method, can also continue to decompose and do not occur the situation that label mixes if decompose the module that obtains, and then the module after decomposing is continued to decompose.
4. the method for claim 1 is characterized in that, in the described method, and when the effective link of mark and invalid link, the word length that each link of unified calculation is interior; Perhaps, when determining the integrate score of each module, calculate respectively the interior word length of each link that each module comprises.
5. such as each described method in the claim 1 to 4, it is characterized in that,
In the described method, the text size that calculates each module specifically comprises: for each module, extract the html tag of module, obtain the text message that comprises in the respective modules according to described html tag, calculate the length of text information, obtain the text size of respective modules;
In the described method, go out the chained address of each module by the achor tag extraction.
6. a content of text extraction element is characterized in that, comprising:
The Web Page Processing unit is used for the HTML (Hypertext Markup Language) html web page of input is decomposed into a plurality of modules, according to the position of each module in page layout, determines the position score of each module, and calculates the text size of each module;
The mark processing unit, be used for extracting the chained address that each module comprises, in the statistics all-links address except the agreement character the highest character content of frequency of utilization, each chained address that will comprise the highest character content of described frequency of utilization is labeled as effective link, and each chained address that does not comprise the highest character content of described frequency of utilization is labeled as invalid link;
The contents extraction unit, the word length that is used for invalid link in position score * (in the text size+module of module effectively the word length of link)/module according to integrate score=module, determine the integrate score of each module, and judge that integrate score is content module above the module of setting threshold.
7. device as claimed in claim 6 is characterized in that, described Web Page Processing unit, and concrete html web page for using Table label or Div label with input is decomposed into a plurality of modules.
8. device as claimed in claim 7 is characterized in that, described Web Page Processing unit is used for also judging whether decompose the module that obtains can also continue to decompose and do not occur the situation that label mixes, if then the module after decomposing is continued to decompose.
9. device as claimed in claim 6 is characterized in that,
Described mark processing unit also is used for when the effective link of mark and invalid link the word length that each link of unified calculation is interior;
Perhaps, described contents extraction unit also is used for when determining the integrate score of each module, calculates respectively the interior word length of each link that each module comprises.
10. such as each described device in the claim 6 to 9, it is characterized in that,
Described Web Page Processing unit, concrete being used for for each module, extract the html tag of module, obtain the text message that comprises in the respective modules according to described html tag, calculate the length of text information, obtain the text size of respective modules;
Described mark processing unit, the concrete chained address that is used for going out by the achor tag extraction each module.
CN201210469940.6A 2012-11-20 2012-11-20 A kind of method for extracting content of text and device Active CN103020129B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201210469940.6A CN103020129B (en) 2012-11-20 2012-11-20 A kind of method for extracting content of text and device
PCT/CN2013/080666 WO2013178193A2 (en) 2012-11-20 2013-08-01 Text content extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210469940.6A CN103020129B (en) 2012-11-20 2012-11-20 A kind of method for extracting content of text and device

Publications (2)

Publication Number Publication Date
CN103020129A true CN103020129A (en) 2013-04-03
CN103020129B CN103020129B (en) 2015-11-18

Family

ID=47968733

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210469940.6A Active CN103020129B (en) 2012-11-20 2012-11-20 A kind of method for extracting content of text and device

Country Status (2)

Country Link
CN (1) CN103020129B (en)
WO (1) WO2013178193A2 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013178193A2 (en) * 2012-11-20 2013-12-05 中兴通讯股份有限公司 Text content extraction method and device
CN105320734A (en) * 2015-07-14 2016-02-10 中国互联网络信息中心 Web page core content extraction method
CN106326445A (en) * 2016-08-26 2017-01-11 武汉大学 Method for evaluating webpage contents based on sensing information quantity
CN106528504A (en) * 2015-09-11 2017-03-22 北京国双科技有限公司 Data screening method and device for social application
CN107766419A (en) * 2017-09-08 2018-03-06 广州汪汪信息技术有限公司 A kind of TextRank file summarization methods and device based on threshold denoising
CN109063996A (en) * 2018-07-23 2018-12-21 长沙知了信息科技有限公司 The information processing method and device of multi-user collaborative editor
CN110377810A (en) * 2019-06-25 2019-10-25 浙江大学 A kind of classification method of mobile terminal webpage

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10922366B2 (en) * 2018-03-27 2021-02-16 International Business Machines Corporation Self-adaptive web crawling and text extraction

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020124020A1 (en) * 2001-03-01 2002-09-05 International Business Machines Corporation Extracting textual equivalents of multimedia content stored in multimedia files
CN101093487A (en) * 2006-06-22 2007-12-26 上海新纳广告传媒有限公司 Method for extracting content of text based on HTML characteristics
CN102184227A (en) * 2011-05-10 2011-09-14 北京邮电大学 General crawler engine system used for WEB service and working method thereof
CN102411587A (en) * 2010-09-21 2012-04-11 腾讯科技(深圳)有限公司 Webpage classification method and device
CN102622382A (en) * 2011-03-14 2012-08-01 北京小米科技有限责任公司 Webpage rearranging method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101702160B (en) * 2009-10-28 2013-04-17 深圳市龙视传媒有限公司 Method for acquiring internet subject information and device thereof
CN102479181B (en) * 2010-11-22 2015-10-07 中国电信股份有限公司 Based on Web page text extracting method and the device of DIV position
CN103020129B (en) * 2012-11-20 2015-11-18 中兴通讯股份有限公司 A kind of method for extracting content of text and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020124020A1 (en) * 2001-03-01 2002-09-05 International Business Machines Corporation Extracting textual equivalents of multimedia content stored in multimedia files
CN101093487A (en) * 2006-06-22 2007-12-26 上海新纳广告传媒有限公司 Method for extracting content of text based on HTML characteristics
CN102411587A (en) * 2010-09-21 2012-04-11 腾讯科技(深圳)有限公司 Webpage classification method and device
CN102622382A (en) * 2011-03-14 2012-08-01 北京小米科技有限责任公司 Webpage rearranging method
CN102184227A (en) * 2011-05-10 2011-09-14 北京邮电大学 General crawler engine system used for WEB service and working method thereof

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013178193A2 (en) * 2012-11-20 2013-12-05 中兴通讯股份有限公司 Text content extraction method and device
WO2013178193A3 (en) * 2012-11-20 2014-01-23 中兴通讯股份有限公司 Text content extraction method and device
CN105320734A (en) * 2015-07-14 2016-02-10 中国互联网络信息中心 Web page core content extraction method
WO2017008448A1 (en) * 2015-07-14 2017-01-19 中国互联网络信息中心 Method for extracting core content of web page
CN105320734B (en) * 2015-07-14 2019-02-22 中国互联网络信息中心 A kind of web page core content extracting method
CN106528504A (en) * 2015-09-11 2017-03-22 北京国双科技有限公司 Data screening method and device for social application
CN106326445A (en) * 2016-08-26 2017-01-11 武汉大学 Method for evaluating webpage contents based on sensing information quantity
CN106326445B (en) * 2016-08-26 2019-09-17 武汉大学 A kind of web page contents evaluation method based on heat transfer agent amount
CN107766419A (en) * 2017-09-08 2018-03-06 广州汪汪信息技术有限公司 A kind of TextRank file summarization methods and device based on threshold denoising
CN109063996A (en) * 2018-07-23 2018-12-21 长沙知了信息科技有限公司 The information processing method and device of multi-user collaborative editor
CN110377810A (en) * 2019-06-25 2019-10-25 浙江大学 A kind of classification method of mobile terminal webpage
CN110377810B (en) * 2019-06-25 2022-04-08 浙江大学 Classification method of mobile terminal web pages

Also Published As

Publication number Publication date
CN103020129B (en) 2015-11-18
WO2013178193A2 (en) 2013-12-05
WO2013178193A3 (en) 2014-01-23

Similar Documents

Publication Publication Date Title
CN103020129B (en) A kind of method for extracting content of text and device
US8819028B2 (en) System and method for web content extraction
CN102200971B (en) Method and equipment for realizing webpage content previewing
CN101093487A (en) Method for extracting content of text based on HTML characteristics
CN102541874B (en) Webpage text content extracting method and device
EP2687997A1 (en) Method for rearranging web page
CA2755427A1 (en) Web translation with display replacement
CN104063401B (en) The method and apparatus that a kind of webpage pattern address merges
CN103577466A (en) Method and device for displaying webpage content in browser
CN101727461A (en) Method for extracting content of web page
US9449114B2 (en) Removing non-substantive content from a web page by removing its text-sparse nodes and removing high-frequency sentences of its text-dense nodes using sentence hash value frequency across a web page collection
CN101976260A (en) Visual label and method for generating webpage by using visual label
EP1880312A2 (en) System and method for providing data formatting
CN103166981A (en) Wireless webpage transcoding method and device
CN103870486A (en) Webpage type confirming method and device
CN107153716A (en) Webpage content extracting method and device
CN112463152A (en) Webpage adaptation method and device based on AST
CN112650905A (en) Anti-crawler method and device based on label, computer equipment and storage medium
CN106776800A (en) The page generation method of AngularJS frameworks, apparatus and system
CN102768663A (en) Method and device for extracting webpage title and information processing system
CN102629252A (en) Method and device for prompting information
US8656371B2 (en) System and method of report representation
EP3602352A1 (en) Transformation of marked-up content into a file format that enables automated browser based pagination
US9773182B1 (en) Document data classification using a noise-to-content ratio
CN110633081A (en) Page generation method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant