CN103020129B - A kind of method for extracting content of text and device - Google Patents

A kind of method for extracting content of text and device Download PDF

Info

Publication number
CN103020129B
CN103020129B CN201210469940.6A CN201210469940A CN103020129B CN 103020129 B CN103020129 B CN 103020129B CN 201210469940 A CN201210469940 A CN 201210469940A CN 103020129 B CN103020129 B CN 103020129B
Authority
CN
China
Prior art keywords
module
link
content
text
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210469940.6A
Other languages
Chinese (zh)
Other versions
CN103020129A (en
Inventor
叶伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN201210469940.6A priority Critical patent/CN103020129B/en
Publication of CN103020129A publication Critical patent/CN103020129A/en
Priority to PCT/CN2013/080666 priority patent/WO2013178193A2/en
Application granted granted Critical
Publication of CN103020129B publication Critical patent/CN103020129B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents

Abstract

The invention discloses a kind of method for extracting content of text and device, described method comprises: the html web page of input is decomposed into multiple module, according to the position of each module in page layout, determines the position score of each module, and calculates the text size of each module; Extract the chained address that each module comprises, the character content that in statistics all-links address, frequency of utilization is the highest, is labeled as effective link by each chained address comprising described character content, each chained address not comprising described character content is labeled as invalid link; According to the word length of integrate score=position score × (word length of text size+effectively link)/invalid link, determine the integrate score of each module, and judge that integrate score exceedes the module that sets threshold value as content module.The method of the invention effectively can remove the information of the redundancy of non-content part in webpage, achieves and extracts effective content of webpage more accurately.

Description

A kind of method for extracting content of text and device
Technical field
The present invention relates to communication technical field, particularly relate to a kind of method for extracting content of text and device.
Background technology
Along with the fast development of Internet technology, browse the Main Means that webpage becomes people's obtaining information gradually, and in the page info of all contacts, text message account for again major part wherein.How effectively the text message extracted in the page is very important, because if the content of text is all extracted, wherein will inevitably to adulterate many unnecessary contents, as advertising message, navigation information etc., these information normally repeat in a large number, and are not the content that user is interested and need; Moreover a large amount of repetitions and invalid information also can reduce the accuracy of text cluster and text classification, can strengthen the workload of content retrieval.And in different webpages, the typesetting of the page and layout are diversified, divide from module or position if single, be then difficult to obtain effective text message accurately.
At present, the extraction means of content of text are that input webpage is decomposed into multiple module, and determine whether respective modules is content module by the integrate score calculating each module.Wherein, the account form of integrate score is: integrate score=position score × word length/link characters length, but this account form is still accurate not, can not accurately divide content.So, how to provide a kind of Text Extraction at present, realize becoming technical matters urgently to be resolved hurrily at present to the accurate extraction of content of text.
Summary of the invention
The invention provides a kind of method for extracting content of text and device, accurately can not extract the problem of content of text in order to solve the method for extracting content of text adopted in prior art.
In order to solve the problem, the technical solution used in the present invention is as follows:
On the one hand, the invention provides a kind of method for extracting content of text, comprising:
The HTML (Hypertext Markup Language) html web page of input is decomposed into multiple module, according to the position of each module in page layout, determines the position score of each module, and calculate the text size of each module;
Extract the chained address that each module comprises, the character content that in statistics all-links address, frequency of utilization is the highest except agreement character, the each chained address comprising the highest character content of described frequency of utilization is labeled as effective link, each chained address not comprising the highest character content of described frequency of utilization is labeled as invalid link;
According to the word length of invalid link in position score × (word length of effectively link in the text size+module of module)/module of integrate score=module, determine the integrate score of each module, and judge that integrate score exceedes the module that sets threshold value as content module.
Further, in the method for the invention, use Table label or Div label that the html web page of input is decomposed into multiple module.
Further, in the method for the invention, can also continue decompose and do not occur the situation that label mixes if decompose the module obtained, then continue to decompose to the module after decomposition.
Further, in the method for the invention, when marking effective link and invalid link, unifiedly calculate the word length in each link; Or, when determining the integrate score of each module, calculate the word length in each link that each module comprises respectively.
Further, in the method for the invention, the text size calculating each module specifically comprises: for each module, extract the html tag of module, the text message comprised in respective modules is obtained according to described html tag, calculate the length of text information, obtain the text size of respective modules;
Further, in the method for the invention, gone out the chained address of each module by anchor tag extraction.
On the other hand, the present invention also provides a kind of content of text extraction element, comprising:
Web Page Processing unit, for the HTML (Hypertext Markup Language) html web page of input is decomposed into multiple module, according to the position of each module in page layout, determines the position score of each module, and calculates the text size of each module;
Mark processing unit, for extracting the chained address that each module comprises, the character content that in statistics all-links address, frequency of utilization is the highest except agreement character, the each chained address comprising the highest character content of described frequency of utilization is labeled as effective link, each chained address not comprising the highest character content of described frequency of utilization is labeled as invalid link;
Contents extracting unit, for the word length of invalid link in the position score according to integrate score=module × (word length of effectively link in the text size+module of module)/module, determine the integrate score of each module, and judge that integrate score exceedes the module that sets threshold value as content module.
Further, in device of the present invention, described Web Page Processing unit, is decomposed into multiple module specifically for using Table label or Div label by the html web page of input.
Further, in device of the present invention, described Web Page Processing unit, also for judging that whether decompose the module obtained can also continue decompose and do not occur the situation that label mixes, and if so, then continues to decompose to the module after decomposition.
Further, in device of the present invention, described mark processing unit, also for when marking effective link and invalid link, unifiedly calculates the word length in each link; Or described contents extracting unit, also for when determining the integrate score of each module, calculates the word length in each link that each module comprises respectively.
Further, in device of the present invention, described Web Page Processing unit, specifically for for each module, extract the html tag of module, obtain the text message comprised in respective modules according to described html tag, calculate the length of text information, obtain the text size of respective modules;
Further, in device of the present invention, described mark processing unit, specifically for going out the chained address of each module by anchor tag extraction.
Compared with prior art, beneficial effect of the present invention is as follows:
The method of the invention and device, employ plain text with the ratio effectively linking word length sum and invalid link word length, can extract the content of html web page more accurately, eliminate the information such as the advertisement of redundancy, the workload in participle stage is below reduced greatly, improves the accuracy of text cluster and text classification, autoabstract.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
The process flow diagram of a kind of method for extracting content of text that Fig. 1 provides for the embodiment of the present invention;
Fig. 2 is page layout's schematic diagram in the embodiment of the present invention;
The method for extracting content of text particular flow sheet that Fig. 3 provides for the embodiment of the present invention;
The structured flowchart of a kind of content of text extraction element that Fig. 4 provides for the embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
Accurately can not extract the problem of content of text to solve the method for extracting content of text adopted in prior art, the embodiment of the present invention provides a kind of method for extracting content of text and device.
As shown in Figure 1, a kind of method for extracting content of text that the embodiment of the present invention provides, specifically comprises:
Step S101, is decomposed into multiple module by the html web page of input, according to the position of each module in page layout, determines the position score of each module, and calculates the text size of each module;
In this step, preferably, use Table label or Div label that the html web page of input is decomposed into multiple module.
Further, in this step, can also continue decompose and do not occur the situation that label mixes if decompose the module obtained, then continue to decompose to the module after decomposition.Wherein, label mixes: page layout's mode of main flow is mainly divided into two kinds at present, the layout of namely being undertaken on page structure by <Table> label or <Div> label is divided, but when edit page content, these two labels also may comprise mutually, namely adopt in the page of <Table> layout and may comprise <Div> label, also <Talbe> label may be comprised in the page of same employing <Div> layout, in addition, label mixes the label (as <Table>, <h1>) also referring to control structure and label (as <font>, <b>) use mixed in together controlling performance, causes the difficulty of correcting and Data Placement.Owing to being need to divide module in the present invention, thus label used herein mix mainly refer to <Table> and <Div> label mix use.
Further, in this step, the text size calculating each module specifically comprises: for each module, extract the html tag of module, obtain the text message comprised in respective modules according to described html tag, calculate the length of text information, obtain the text size of respective modules.
Step S102, extract the chained address that each module comprises, the character content that in statistics all-links address, frequency of utilization is the highest except agreement character, the each chained address comprising the highest character content of frequency of utilization is labeled as effective link, each chained address not comprising the highest character content of frequency of utilization is labeled as invalid link;
In this step, preferably, the chained address of each module is gone out by anchor tag extraction.
Step S103, according to the word length of invalid link in position score × (word length of effectively link in the text size+module of module)/module of integrate score=module, determine the integrate score of each module, and judge that integrate score exceedes the module that sets threshold value as content module.
The word length of effective link described in this step and the word length of invalid link preferably, are that unified calculation obtains when marking effective link and invalid link; Certainly, also when determining the integrate score of each module, can calculate respectively for each module.
In order to the implementation procedure of clearer elaboration the method for the invention, be further described below in conjunction with accompanying drawing 2 to 3 pairs of the method for the invention, be specifically related to following content:
The method for extracting content of text that the embodiment of the present invention provides is follow-on method for extracting content of text based on HTML feature, and the method more accurately and reasonably can divide content of text.In the embodiment of the present invention, page layout is divided into content module and non-content module, as shown in Figure 2, content module is the content part in webpage, but not content module is generally used to show navigation information, banner, the information such as copyright notice or advertisement.The target of scheme described in the embodiment of the present invention is exactly can accurate decomposing H TML webpage, and content module is extracted from html web page.For each decomposition module out, give and different scores according to the position in the page layout at its place, the module score being in user's sight line focus is higher, otherwise score is lower, if invalid link word length is excessive relative to this module ratio, so this modules exhibit may be advertisement or navigation information.
The method for extracting content of text based on HTML feature described in the embodiment of the present invention, specifically comprises:
Step 1, uses label that the html web page of input is resolved into multiple module;
In this step, preferably, adopt Table or Div label that the html web page of input is resolved into multiple module.Why the embodiment of the present invention adopts Table label and these two labels for layout of Div label to carry out decomposing module, because they are for page layout on the one hand, also the complexity of analyzing web page can be reduced on the other hand, to such as span, the label of br these other does not process, accelerate the resolution speed of webpage greatly, decrease the analysis of system resource.
, if the module of decomposing in step 1 can also continue to decompose again, and there is not the situation that Table or Div label mixes in step 2, so again this module is delivered to step 1 and continue to decompose.
Step 3, awards diverse location score the module of input according to the diverse location in layout.Certainly, in this step, the concrete score for position each in layout is pre-set good, and it is higher that its cardinal rule is that module is more in its weight of position of user's visual cognitive ability in layout, and position score is also higher.
Step 4, calculates the text size in each module.
Step 5, add up the character content that in the chained address of all modules, frequency of utilization is the highest, because the chained address belonging to this webpage related content must have partial character content to be identical, the chained address of such as advertisement and so on then can not comprise this part identical character content, therefore can distinguish effective chained address (link of webpage related content) and invalid chained address (advertisement is equal to the incoherent link of web page contents) by the character content that counted frequency of utilization is the highest.
In this step, when the character content that statistics frequency of utilization is the highest, by the total character that all URL (URL(uniform resource locator)) all have, as these total agreement characters of www, http, get rid of outside statistics.
Step 6, is labeled as effective link by the chained address comprising the character content counted in step 5, and calculates the word length in each effective link;
Step 7, is labeled as invalid link by the chained address not comprising the character content counted in step 5, and calculates the word length in each invalid link.
Step 8, according to the word length in position score × (word length in the text size in module+effective link)/invalid link of the integrate score=module of module, draw each module synthesis score, namely integrate score thinks content module higher than set threshold value.
Step 9, according to the threshold value of in advance setting (namely think the integrate score lower limit that content module should reach), integrate score in step 8 described in judgement, if its score value is higher than set threshold value, then thinks that the content of this module is the content of text needing to extract.
Based on the statement of above-mentioned principle, be described below in conjunction with concrete example, as shown in Figure 3, comprise: acquisition webpage after, using webpage as input, as step 1. in, if can use Table label and Div label that the webpage of input is decomposed into multiple module, just the webpage of input is decomposed.Whether can also continue to decompose in the step module that 2. middle judgement decomposes out, 1. continue to decompose if step so can be got back to, otherwise enter step 3..Step obtains simple text after 3. proposing all html tags in the module 2. inputted by step, calculates the length of this text.4. step uses anchor tag extraction to go out all links, and adds up the character content that in the chained address of all modules, frequency of utilization is the highest.5. step calculates the link characters length comprising He do not comprise 4. the counted character content of step, is labeled as effectively link and invalid link respectively.6. step utilizes formula: integrate score=position score × (word length+effectively link word length)/invalid link word length, calculates the integrate score of each module.7. the module that integrate score is less than threshold value deletes through step, and integrate score enters step higher than the module of threshold value and 8. exports.
In sum, method described in the embodiment of the present invention, employ plain text with the ratio effectively linking word length sum and invalid link word length, can extract the content of html web page more accurately, eliminate the information such as the advertisement of redundancy, the workload in participle stage is below reduced greatly, improves the accuracy of text cluster and text classification, autoabstract.
As shown in Figure 4, the embodiment of the present invention also provides a kind of content of text extraction element, specifically comprises:
Web Page Processing unit 410, for the html web page of input is decomposed into multiple module, according to the position of each module in page layout, determines the position score of each module, and calculates the text size of each module;
Mark processing unit 420, for extracting the chained address that each module comprises, the character content that in statistics all-links address, frequency of utilization is the highest except agreement character, the each chained address comprising the highest character content of frequency of utilization is labeled as effective link, each chained address not comprising the highest character content of frequency of utilization is labeled as invalid link;
Contents extracting unit 430, for the word length of invalid link in the position score according to integrate score=module × (word length of effectively link in the text size+module of module)/module, determine the integrate score of each module, and judge that integrate score exceedes the module that sets threshold value as content module.
Based on above-mentioned principle framework, provide the specific implementation of above-mentioned each unit when realizing corresponding function below, specific as follows:
In the embodiment of the present invention, Web Page Processing unit 410, the html web page of input is decomposed into multiple module by concrete Table label or the Div label of using; And for each module, extract the html tag of module, obtain the text message comprised in respective modules according to described html tag, calculate the length of text information, obtain the text size of respective modules.
Further, Web Page Processing unit 410, also for judging that whether decompose the module obtained can also continue decompose and do not occur the situation that label mixes, and if so, then continues to decompose to the module after decomposition.
In the embodiment of the present invention, mark processing unit 420, also for when marking effective link and invalid link, unifiedly calculates the word length in each link; Or contents extracting unit 430, when determining the integrate score of each module, calculates the word length in each link that each module comprises respectively.
Further, in the embodiment of the present invention, mark processing unit 420, specifically for going out the chained address of each module by anchor tag extraction.
In sum, device of the present invention, employ plain text with the ratio effectively linking word length sum and invalid link word length, can extract the content of html web page more accurately, eliminate the information such as the advertisement of redundancy, the workload in participle stage is below reduced greatly, improves the accuracy of text cluster and text classification, autoabstract.
Obviously, those skilled in the art can carry out various change and modification to the present invention and not depart from the spirit and scope of the present invention.Like this, if these amendments of the present invention and modification belong within the scope of the claims in the present invention and equivalent technologies thereof, then the present invention is also intended to comprise these change and modification.

Claims (10)

1. a method for extracting content of text, is characterized in that, comprising:
The HTML (Hypertext Markup Language) html web page of input is decomposed into multiple module, according to the position of each module in page layout, determines the position score of each module, and calculate the text size of each module;
Extract the chained address that each module comprises, the character content that in statistics all-links address, frequency of utilization is the highest except agreement character, the each chained address comprising the highest character content of described frequency of utilization is labeled as effective link, each chained address not comprising the highest character content of described frequency of utilization is labeled as invalid link;
According to the word length of invalid link in position score × (word length of effectively link in the text size+module of module)/module of integrate score=module, determine the integrate score of each module, and judge that integrate score exceedes the module that sets threshold value as content module.
2. the method for claim 1, is characterized in that, in described method, uses Table label or Div label that the html web page of input is decomposed into multiple module.
3. method as claimed in claim 2, is characterized in that, in described method, can also continue decompose and do not occur the situation that label mixes, then continue to decompose to the module after decomposition if decompose the module obtained.
4. the method for claim 1, is characterized in that, in described method, when marking effective link and invalid link, unifiedly calculates the word length in each link; Or, when determining the integrate score of each module, calculate the word length in each link that each module comprises respectively.
5. the method according to any one of Claims 1-4, is characterized in that,
In described method, the text size calculating each module specifically comprises: for each module, extracts the html tag of module, obtains the text message comprised in respective modules according to described html tag, calculate the length of text information, obtain the text size of respective modules;
In described method, gone out the chained address of each module by anchor tag extraction.
6. a content of text extraction element, is characterized in that, comprising:
Web Page Processing unit, for the HTML (Hypertext Markup Language) html web page of input is decomposed into multiple module, according to the position of each module in page layout, determines the position score of each module, and calculates the text size of each module;
Mark processing unit, for extracting the chained address that each module comprises, the character content that in statistics all-links address, frequency of utilization is the highest except agreement character, the each chained address comprising the highest character content of described frequency of utilization is labeled as effective link, each chained address not comprising the highest character content of described frequency of utilization is labeled as invalid link;
Contents extracting unit, for the word length of invalid link in the position score according to integrate score=module × (word length of effectively link in the text size+module of module)/module, determine the integrate score of each module, and judge that integrate score exceedes the module that sets threshold value as content module.
7. device as claimed in claim 6, is characterized in that, described Web Page Processing unit, specifically for using Table label or Div label, the html web page of input is decomposed into multiple module.
8. device as claimed in claim 7, is characterized in that, described Web Page Processing unit, also for judging that whether decompose the module obtained can also continue decompose and do not occur the situation that label mixes, and if so, then continues to decompose to the module after decomposition.
9. device as claimed in claim 6, is characterized in that,
Described mark processing unit, also for when marking effective link and invalid link, unifiedly calculates the word length in each link;
Or described contents extracting unit, also for when determining the integrate score of each module, calculates the word length in each link that each module comprises respectively.
10. the device according to any one of claim 6 to 9, is characterized in that,
Described Web Page Processing unit, specifically for for each module, extracts the html tag of module, obtains the text message comprised in respective modules according to described html tag, calculates the length of text information, obtains the text size of respective modules;
Described mark processing unit, specifically for going out the chained address of each module by anchor tag extraction.
CN201210469940.6A 2012-11-20 2012-11-20 A kind of method for extracting content of text and device Active CN103020129B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201210469940.6A CN103020129B (en) 2012-11-20 2012-11-20 A kind of method for extracting content of text and device
PCT/CN2013/080666 WO2013178193A2 (en) 2012-11-20 2013-08-01 Text content extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210469940.6A CN103020129B (en) 2012-11-20 2012-11-20 A kind of method for extracting content of text and device

Publications (2)

Publication Number Publication Date
CN103020129A CN103020129A (en) 2013-04-03
CN103020129B true CN103020129B (en) 2015-11-18

Family

ID=47968733

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210469940.6A Active CN103020129B (en) 2012-11-20 2012-11-20 A kind of method for extracting content of text and device

Country Status (2)

Country Link
CN (1) CN103020129B (en)
WO (1) WO2013178193A2 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020129B (en) * 2012-11-20 2015-11-18 中兴通讯股份有限公司 A kind of method for extracting content of text and device
CN105320734B (en) * 2015-07-14 2019-02-22 中国互联网络信息中心 A kind of web page core content extracting method
CN106528504A (en) * 2015-09-11 2017-03-22 北京国双科技有限公司 Data screening method and device for social application
CN106326445B (en) * 2016-08-26 2019-09-17 武汉大学 A kind of web page contents evaluation method based on heat transfer agent amount
CN107766419B (en) * 2017-09-08 2021-08-31 广州汪汪信息技术有限公司 Threshold denoising-based TextRank document summarization method and device
US10922366B2 (en) * 2018-03-27 2021-02-16 International Business Machines Corporation Self-adaptive web crawling and text extraction
CN109063996A (en) * 2018-07-23 2018-12-21 长沙知了信息科技有限公司 The information processing method and device of multi-user collaborative editor
CN110377810B (en) * 2019-06-25 2022-04-08 浙江大学 Classification method of mobile terminal web pages

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101093487A (en) * 2006-06-22 2007-12-26 上海新纳广告传媒有限公司 Method for extracting content of text based on HTML characteristics
CN102184227A (en) * 2011-05-10 2011-09-14 北京邮电大学 General crawler engine system used for WEB service and working method thereof
CN102411587A (en) * 2010-09-21 2012-04-11 腾讯科技(深圳)有限公司 Webpage classification method and device
CN102622382A (en) * 2011-03-14 2012-08-01 北京小米科技有限责任公司 Webpage rearranging method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020124020A1 (en) * 2001-03-01 2002-09-05 International Business Machines Corporation Extracting textual equivalents of multimedia content stored in multimedia files
CN101702160B (en) * 2009-10-28 2013-04-17 深圳市龙视传媒有限公司 Method for acquiring internet subject information and device thereof
CN102479181B (en) * 2010-11-22 2015-10-07 中国电信股份有限公司 Based on Web page text extracting method and the device of DIV position
CN103020129B (en) * 2012-11-20 2015-11-18 中兴通讯股份有限公司 A kind of method for extracting content of text and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101093487A (en) * 2006-06-22 2007-12-26 上海新纳广告传媒有限公司 Method for extracting content of text based on HTML characteristics
CN102411587A (en) * 2010-09-21 2012-04-11 腾讯科技(深圳)有限公司 Webpage classification method and device
CN102622382A (en) * 2011-03-14 2012-08-01 北京小米科技有限责任公司 Webpage rearranging method
CN102184227A (en) * 2011-05-10 2011-09-14 北京邮电大学 General crawler engine system used for WEB service and working method thereof

Also Published As

Publication number Publication date
CN103020129A (en) 2013-04-03
WO2013178193A2 (en) 2013-12-05
WO2013178193A3 (en) 2014-01-23

Similar Documents

Publication Publication Date Title
CN103020129B (en) A kind of method for extracting content of text and device
CN102541874B (en) Webpage text content extracting method and device
US8819028B2 (en) System and method for web content extraction
CN101093487A (en) Method for extracting content of text based on HTML characteristics
EP2687997A1 (en) Method for rearranging web page
CN103577466A (en) Method and device for displaying webpage content in browser
CN103761317A (en) Multithreading asynchronous rendering system and method
CN103166981A (en) Wireless webpage transcoding method and device
CN101976260A (en) Visual label and method for generating webpage by using visual label
CN103870486A (en) Webpage type confirming method and device
US9449114B2 (en) Removing non-substantive content from a web page by removing its text-sparse nodes and removing high-frequency sentences of its text-dense nodes using sentence hash value frequency across a web page collection
CN107153716A (en) Webpage content extracting method and device
CN106776800A (en) The page generation method of AngularJS frameworks, apparatus and system
CN102629252A (en) Method and device for prompting information
CN102768663A (en) Method and device for extracting webpage title and information processing system
CA2602749C (en) System and method of report representation
CN105069043A (en) Paging reading method and system for web data information
EP3602352A1 (en) Transformation of marked-up content into a file format that enables automated browser based pagination
US20090307578A1 (en) Top down chinese character display on a computing device
CN105550279A (en) Vision-based list page identification method
CN113139145B (en) Page generation method and device, electronic equipment and readable storage medium
CN104216868A (en) Adaptation method and device for document display format
CN103927363A (en) Browser grid display method and system and browser client
CN105808644A (en) Method and device for determining text node
Jalali Applied principles of criticism in comparative children's literature

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant