CN103116592A - Formatting output method of webpage content - Google Patents

Formatting output method of webpage content Download PDF

Info

Publication number
CN103116592A
CN103116592A CN2012100091177A CN201210009117A CN103116592A CN 103116592 A CN103116592 A CN 103116592A CN 2012100091177 A CN2012100091177 A CN 2012100091177A CN 201210009117 A CN201210009117 A CN 201210009117A CN 103116592 A CN103116592 A CN 103116592A
Authority
CN
China
Prior art keywords
node
text
buffer zone
html
formatting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012100091177A
Other languages
Chinese (zh)
Inventor
黄靖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
KUNSHAN MAIKESITAI TECHNOLOGY Co Ltd
Original Assignee
KUNSHAN MAIKESITAI TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by KUNSHAN MAIKESITAI TECHNOLOGY Co Ltd filed Critical KUNSHAN MAIKESITAI TECHNOLOGY Co Ltd
Priority to CN2012100091177A priority Critical patent/CN103116592A/en
Publication of CN103116592A publication Critical patent/CN103116592A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a formatting output method of webpage content. The formatting output method of the webpage content is characterized in that the content of a specified area of a webpage is output in a text formatting in a formatting mode, indentation, carriage return and the like are added to proper places, and a display form of output text is enabled to be close to a display form in a browser as far as possible. The formatting output method of the webpage content comprises the steps of analyzing a hypertext markup language (Html) page, converting the html into a standard extensive markup language (xml) format, reserving a string butter in advance, generating a DOM Tree structure, browsing all nodes of the Tree structure, adding the content of the text to the string butter, and content in the string buffer is the final formatted text after the nodes are all browsed. According to the formatting output method of the webpage content, the text in the webpage is enabled to be output in the form closer to the form of the browser, and better reading experience is brought to a user.

Description

A kind of Formatting Output method of web page contents
Technical field
The present invention relates to a kind of network information export technique, especially a kind of Formatting Output method of internet web page contents.
Background technology
Along with the development of internet information technology with reach its maturity, the network information spreads to huge numbers of families.Though be the omnipotent epoch of internet, but can have needed web page contents is printed reading.These are more suitable for the electronics newspaper of giving birth to for net, electronic journal in the content of reading in network environment, how to generate suitable paper media's page by printout? long-time next, this is a problem always.
The content formatting of specific region in webpage is output into plain text format, is similar to the text of the form in browser to the user, become the problem that the technician need to solve.
Summary of the invention
In view of this, fundamental purpose of the present invention is to provide a kind of Formatting Output method of web page contents, can complete quickly and easily the Formatting Output of web page contents, can improve again the quality of output file, improves user's reading experience.
For achieving the above object, method provided by the invention is to find web page area and resolve to tree construction, according to (demonstration) implication of each label, and utilizes dual stack to find which point need to add carriage return, and paragraph begin need to add indentation.Technical scheme of the present invention is achieved in that
A, the parsing Html page convert html to standard xml form;
B, first have and reserve a character string buffer zone (StringBuffer), generate document tree structure (DOMTree) and travel through all nodes of this tree construction, content of text is appended in the character string buffer zone.After traversal is completed, the content of character string buffer zone is exactly final formatted text.
Further, be refined as described in steps A:
A1 analyzes original document, supplies the TAG that needs in the html standard, and lacks the TAG of ending, and the literary style of processing nonstandard TAG attribute, and the html page of non-good structure is become good structure;
A2 filters out<SCRIPT〉<STYLE〉<SELECT〉<INPUT〉and the irrelevant mark of text demonstration.
Preferably, steps A first finds the useful zone in the page, i.e. the text of webpage zone.
Further, step B resolves to document tree with the html page of good structure, and travels through this tree, finds the text node in tree.
Further, be refined as described in step B:
B1 traverse tree structure can be used a stack architecture, creates the father node of another stack record current accessed node, below is called the father node storehouse, and prepares what a character string buffer zone, prepares to accept character;
B2 is when running into<address 〉,<blackquote 〉,<div 〉,<dl 〉,<h1 〉,<h2 〉,<h3 〉,<h4 〉,<h5 〉,<h6 〉,<ol 〉,<table 〉,<tr 〉,<ul 〉,<p 〉,<br〉and during the label of non-html, explanation be a paragraph begin need append a carriage return in the character string buffer zone;
B3 is when the node that runs into #TEXT, and text appends in character buffer;
B4 is when certain node visit end, and this node has child node this node to be pressed into the father node storehouse;
B5 finishes when certain node visit, and this node do not have child node there is no the right brotgher of node yet, checks the storehouse of father node, and ejects the storehouse of this father node, and eject this father node, again accesses this node, and press the B2 processing;
B6 finishes when certain node visit, and this node is that node and this node of access again that ejects from the father node storehouse do not had the right brotgher of node, ejects this father node, and again accesses this node, and press the B2 processing.
Preferably, in step B2 append carriage return the time, check last character in the character string buffer zone if currently in carriage return or character string buffer zone there is no a character, just no longer appended, in order to avoid append too much carriage return, destroy the display format of article.
Preferably, step B3 checks last character in the character string buffer zone if currently in carriage return or character string buffer zone there is no a character, begins to add indentation at text, can make the text formatting bandwagon effect better.
Preferably, step B is right<strong〉<b〉need to add under character record in label black, can make the text formatting bandwagon effect better.
Web page contents output intent provided by the invention makes the text of webpage with more near Formatting Output like browser, gives the better reading experience of user.
Embodiment
The following stated embodiment describes the present invention in detail.Method of the present invention is used this tree of preorder traversal, utilizes a stack architecture to record the node of current accessed, and after its all descendants's node traversals are completed, again accesses this node to judge whether adding carriage return.Father node P as the tree construction aaaaaa of fragment B need to add carriage return after back-call after aaaaa.
Segment A
Figure BSA00000656824200031
Corresponding tree construction is:
Figure BSA00000656824200041
At first prepare a character string buffer area StringBuffer
The character string buffer area of better processing represents with OptStringBuffer
We with n represent the carriage return character, t represent indentation
Father node storehouse ParentStack:
1 access div, div is not that the node in B2 neither text node,
Child node is arranged, so div is stacked
ParentStack:div
2 access p (first), p is the node in B2, appends carriage return
StringBuffer:\n
Preferably, current buffer zone does not have character, so do not append carriage return
OptStringBuffer:
P has child node, so p is stacked
ParentStack:div?p
3 accessing text node aaaaaa,
StringBuffer:\naaaaaa
Preferably, current buffer zone does not have character, adds indentation
OptStringBuffer:\taaaaaa
No child node
ParentStack:div?p
4 backs do not have child node there is no the right brotgher of node yet,
So access ParentStack,
Eject p, and access again, ParentStack:div
P is the node in B2, appends carriage return
StringBuffer:\naaaaaa\n
OptStringBuffer:\taaaaaa\n
5p (first) has the right brotgher of node, continues traverse tree
Access p (second), p is the node in B2, appends carriage return
StringBuffer:\naaaaaa\n\n
Preferably, last character of current buffer zone is carriage return, so do not append carriage return
OptStringBuffer:\taaaaaa\n
P has child node, so p is stacked
ParentStack:div?p
6 accessing text node bbbbbb,
StringBuffer:\naaaaaa\n\nbbbbbb
Preferably, last character of current buffer zone is carriage return, adds indentation
OptStringBuffer:\taaaaaa\n\tbbbbbb
No child node
ParentStack:div?p
7 backs do not have child node there is no the right brotgher of node yet,
So access ParentStack,
Eject p, and access again, ParentStack:div
P is the node in B2, appends carriage return
StringBuffer:\naaaaaa\n\nbbbbbb\n
OptStringBuffer:\taaaaaa\n\tbbbbbb\n
ParentStack:div?p
8p (second) has the right brotgher of node, continues traverse tree
Access p (the 3rd), p is the node in B2, appends carriage return
StringBuffer:\naaaaaa\n\nbbbbbb\n\n
Preferably, last character of current buffer zone is carriage return, so do not append carriage return
OptStringBuffer:\taaaaaa\n\tbbbbbb\n
P has child node, so p is stacked
ParentStack:div?p
9 accessing text node cccccc,
StringBuffer:\naaaaaa\n\nbbbbbb\n\ncccccc
Preferably, last character of current buffer zone is carriage return, adds indentation
OptStringBuffer:\taaaaaa\n\tbbbbbb\n\tcccccc
No child node
ParentStack:div?p
10 backs do not have child node there is no the right brotgher of node yet,
So access ParentStack,
Eject p, and access again, ParentStack:div
P is the node in B2, appends carriage return
StringBuffer:\naaaaaa\n\nbbbbbb\n\ncccccc\n
OptStringBuffer:\taaaaaa\n\tbbbbbb\n\tcccccc\n
ParentStack:div?p
11p (the 3rd) be get from ParentStack and also there is no the right brotgher of node
So access ParentStack ejects div, and access again,
ParentStack:
The node in B2 neither text node for div
Net result
StringBuffer:\naaaaaa\n\nbbbbbb\n\ncccccc\n
OptStringBuffer:\taaaaaa\n\tbbbbbb\n\tcccccc\n
Fragment B
Figure BSA00000656824200061
Corresponding tree construction is
Figure BSA00000656824200062
Notice that bbbbbb has lacked one deck<p than example A here〉but itself is also independent paragraph.
1 access div, div is not that the node in B2 neither text node,
Child node is arranged, so div is stacked
ParentStack:div
2 access p (first), p is the node in B2, appends carriage return
StringBuffer:\n
Preferably, current buffer zone does not have character, so do not append carriage return
OptStringBuffer:
P has child node, so p is stacked
ParentStack:div?p
3 accessing text node aaaaaa,
StringBuffer:\naaaaaa
Preferably, current buffer zone does not have character, adds indentation
OptStringBuffer:\taaaaaa
No child node
ParentStack:div?p
4 backs do not have child node there is no the right brotgher of node yet,
So access ParentStack,
Eject p, and access again, ParentStack:div
P is the node in B2, appends carriage return
StringBuffer:\naaaaaa\n
OptStringBuffer:\taaaaaa\n
5p (first) has the right brotgher of node, continues traverse tree
Accessing text node bbbbbb,
StringBuffer:\naaaaaa\nbbbbbb
Preferably, last character of current buffer zone is carriage return, adds indentation
OptStringBuffer:\taaaaaa\n\tbbbbbb
No child node
ParentStack:div
7 backs do not have child node, but the right brotgher of node is arranged,
So continuation traverse tree
Access p (second), p is the node in B2, appends carriage return
StringBuffer:\naaaaaa\nbbbbbb\n
Preferably, last character of current buffer zone is carriage return, so do not append carriage return
OptStringBuffer:\taaaaaa\n\tbbbbbb\n
P has child node, so p is stacked
ParentStack:div?p
8 accessing text node cccccc,
StringBuffer:\naaaaaa\nbbbbbb\ncccccc
Preferably, last character of current buffer zone is carriage return, adds indentation
OptStringBuffer:\taaaaaa\n\tbbbbbb\n\tcccccc
No child node
ParentStack:div?p
9 backs do not have child node there is no the right brotgher of node yet,
So access ParentStack,
Eject p, and access again, ParentStack:div
P is the node in B2, appends carriage return
StringBuffer:\naaaaaa\nbbbbbb\ncccccc\n
OptStringBuffer:\taaaaaa\n\tbbbbbb\n\tcccccc\n
ParentStack:div?p
10p (second) be get from ParentStack and also there is no the right brotgher of node
So access ParentStack ejects div, and access again,
ParentStack:
Div is not that the node in B2 neither text node
Net result
StringBuffer:\naaaaaa\nbbbbbb\ncccccc\n
OptStringBuffer:\taaaaaa\n\tbbbbbb\n\tcccccc\n
Can see that Segment A and fragment B are slightly different, in Segment A, each node P is exactly a paragraph, although and the bbbbbb in fragment B is not a paragraph in node P yet, this is because its front and back are all a paragraph just.

Claims (10)

1. the Formatting Output method of a web page contents, is characterized in that, the method comprises the following steps:
A, the parsing Html page convert html to standard xml form;
B, first have and reserve a character string buffer zone (StringBuffer), generate document tree structure (DOMTree) and travel through all nodes of this tree construction, content of text is appended in the character string buffer zone, and after traversal is completed, the content of character string buffer zone is exactly final formatted text.
2. method according to claim 1, is characterized in that, analyzes original document in steps A, supplies the TAG that needs in the html standard, and lack the TAG of ending, and the literary style of processing nonstandard TAG attribute, and the html page of non-good structure is become good structure.
3. method according to claim 2, is characterized in that, filters out<SCRIPT〉<STYLE〉<SELECT〉<INPUT〉and the irrelevant mark of text demonstration.
4. method according to claim 1, is characterized in that, the step B html page of good structure resolves to document tree, and travels through this tree, finds the text node in tree.
5. method according to claim 4, is characterized in that, the traverse tree structure can be used a stack architecture, create the father node of another stack record current accessed node, below be called the father node storehouse, and prepare what a character string buffer zone, prepare to accept character.
6. method according to claim 5, is characterized in that, when running into<address 〉,<blackquote 〉,<div 〉,<dl 〉,<h1 〉,<h2 〉,<h3 〉,<h4 〉,<h5 〉,<h6 〉,<ol 〉,<table 〉,<tr 〉,<ul 〉,<p 〉,<br and during the label of non-html, explanation be a paragraph begin need append a carriage return in the character string buffer zone.
7. method according to claim 6, is characterized in that, when the node that runs into #TEXT, text appends in character buffer.
8. method according to claim 7, is characterized in that, when certain node visit end, and this node has child node this node to be pressed into the father node storehouse.
9. method according to claim 8, is characterized in that, when certain node visit finishes, and this node does not have child node there is no the right brotgher of node yet, checks the storehouse of father node, and ejects the storehouse of this father node, and eject this father node, again access this node, and press B2 and process.
10. method according to claim 9, is characterized in that, when certain node visit finishes, and this node is that node and this node of access again that eject in uncle's storehouse have not had the right brotgher of node, eject this father node, and again access this node, and press B2 and process.
CN2012100091177A 2012-01-13 2012-01-13 Formatting output method of webpage content Pending CN103116592A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012100091177A CN103116592A (en) 2012-01-13 2012-01-13 Formatting output method of webpage content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012100091177A CN103116592A (en) 2012-01-13 2012-01-13 Formatting output method of webpage content

Publications (1)

Publication Number Publication Date
CN103116592A true CN103116592A (en) 2013-05-22

Family

ID=48414969

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012100091177A Pending CN103116592A (en) 2012-01-13 2012-01-13 Formatting output method of webpage content

Country Status (1)

Country Link
CN (1) CN103116592A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320697A (en) * 2014-08-01 2016-02-10 北京龙源创新信息技术有限公司 Method for realizing magazine data storage standard
RU2610585C2 (en) * 2015-03-31 2017-02-13 Общество С Ограниченной Ответственностью "Яндекс" Method and system for modifying text in document
CN107622106A (en) * 2017-09-13 2018-01-23 五八有限公司 Reminding method and device when a kind of page can not render
CN110377884A (en) * 2019-06-13 2019-10-25 北京百度网讯科技有限公司 Document analytic method, device, computer equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101470728A (en) * 2007-12-25 2009-07-01 北京大学 Method and device for automatically abstracting text of Chinese news web page
CN101727461A (en) * 2008-10-13 2010-06-09 中国科学院计算技术研究所 Method for extracting content of web page

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101470728A (en) * 2007-12-25 2009-07-01 北京大学 Method and device for automatically abstracting text of Chinese news web page
CN101727461A (en) * 2008-10-13 2010-06-09 中国科学院计算技术研究所 Method for extracting content of web page

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320697A (en) * 2014-08-01 2016-02-10 北京龙源创新信息技术有限公司 Method for realizing magazine data storage standard
RU2610585C2 (en) * 2015-03-31 2017-02-13 Общество С Ограниченной Ответственностью "Яндекс" Method and system for modifying text in document
US10762279B2 (en) 2015-03-31 2020-09-01 Yandex Europe Ag Method and system for augmenting text in a document
CN107622106A (en) * 2017-09-13 2018-01-23 五八有限公司 Reminding method and device when a kind of page can not render
CN110377884A (en) * 2019-06-13 2019-10-25 北京百度网讯科技有限公司 Document analytic method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
TWI592807B (en) Method and device for web style address merge
CN102253979B (en) Vision-based web page extracting method
US20160292294A1 (en) Extracting a portion of a document, such as a web page
CN104699714B (en) Book version formatted file is converted to the method and device of EPUB formatted files
US20150143230A1 (en) Method and device for displaying webpage contents in browser
CN103166981B (en) A kind of radio web page code-transferring method and device
CN102270206A (en) Method and device for capturing valid web page contents
US20080163077A1 (en) System and method for visually generating an xquery document
CN104142985B (en) A kind of semi-automatic vertical reptile Core Generator and method
CN103116592A (en) Formatting output method of webpage content
CN104020984A (en) Method and device for generating static page
CN105740355B (en) Webpage context extraction method and device based on aggregation text density
CN103049536A (en) Webpage main text content extracting method and webpage text content extracting system
CN107145591B (en) Title-based webpage effective metadata content extraction method
CN103902571A (en) Method and system for saving webpage complete content and corresponding client end and server
US20070180357A1 (en) Story Tracking for Fixed Layout Markup Documents
CN106528509A (en) Webpage information extracting method and apparatus
US10198408B1 (en) System and method for converting and importing web site content
CN103246680B (en) A kind of method in browser, web page contents polymerization being represented and device
CN106897287A (en) Homepage Publishing decimation in time method and the device for Homepage Publishing decimation in time
US11514241B2 (en) Method, apparatus, and computer-readable medium for transforming a hierarchical document object model to filter non-rendered elements
Hori et al. Generating transformational annotation for web document adaptation: tool support and empirical evaluation
Wong et al. Updating the ice annotation system: tagging, parsing and validation
CN108345589A (en) A kind of translation on line method of full original text reference
JP5986896B2 (en) Web browsing history management apparatus and program

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20130522