CN103116592A - Formatting output method of webpage content - Google Patents
Formatting output method of webpage content Download PDFInfo
- Publication number
- CN103116592A CN103116592A CN2012100091177A CN201210009117A CN103116592A CN 103116592 A CN103116592 A CN 103116592A CN 2012100091177 A CN2012100091177 A CN 2012100091177A CN 201210009117 A CN201210009117 A CN 201210009117A CN 103116592 A CN103116592 A CN 103116592A
- Authority
- CN
- China
- Prior art keywords
- node
- text
- buffer zone
- html
- formatting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The invention discloses a formatting output method of webpage content. The formatting output method of the webpage content is characterized in that the content of a specified area of a webpage is output in a text formatting in a formatting mode, indentation, carriage return and the like are added to proper places, and a display form of output text is enabled to be close to a display form in a browser as far as possible. The formatting output method of the webpage content comprises the steps of analyzing a hypertext markup language (Html) page, converting the html into a standard extensive markup language (xml) format, reserving a string butter in advance, generating a DOM Tree structure, browsing all nodes of the Tree structure, adding the content of the text to the string butter, and content in the string buffer is the final formatted text after the nodes are all browsed. According to the formatting output method of the webpage content, the text in the webpage is enabled to be output in the form closer to the form of the browser, and better reading experience is brought to a user.
Description
Technical field
The present invention relates to a kind of network information export technique, especially a kind of Formatting Output method of internet web page contents.
Background technology
Along with the development of internet information technology with reach its maturity, the network information spreads to huge numbers of families.Though be the omnipotent epoch of internet, but can have needed web page contents is printed reading.These are more suitable for the electronics newspaper of giving birth to for net, electronic journal in the content of reading in network environment, how to generate suitable paper media's page by printout? long-time next, this is a problem always.
The content formatting of specific region in webpage is output into plain text format, is similar to the text of the form in browser to the user, become the problem that the technician need to solve.
Summary of the invention
In view of this, fundamental purpose of the present invention is to provide a kind of Formatting Output method of web page contents, can complete quickly and easily the Formatting Output of web page contents, can improve again the quality of output file, improves user's reading experience.
For achieving the above object, method provided by the invention is to find web page area and resolve to tree construction, according to (demonstration) implication of each label, and utilizes dual stack to find which point need to add carriage return, and paragraph begin need to add indentation.Technical scheme of the present invention is achieved in that
A, the parsing Html page convert html to standard xml form;
B, first have and reserve a character string buffer zone (StringBuffer), generate document tree structure (DOMTree) and travel through all nodes of this tree construction, content of text is appended in the character string buffer zone.After traversal is completed, the content of character string buffer zone is exactly final formatted text.
Further, be refined as described in steps A:
A1 analyzes original document, supplies the TAG that needs in the html standard, and lacks the TAG of ending, and the literary style of processing nonstandard TAG attribute, and the html page of non-good structure is become good structure;
A2 filters out<SCRIPT〉<STYLE〉<SELECT〉<INPUT〉and the irrelevant mark of text demonstration.
Preferably, steps A first finds the useful zone in the page, i.e. the text of webpage zone.
Further, step B resolves to document tree with the html page of good structure, and travels through this tree, finds the text node in tree.
Further, be refined as described in step B:
B1 traverse tree structure can be used a stack architecture, creates the father node of another stack record current accessed node, below is called the father node storehouse, and prepares what a character string buffer zone, prepares to accept character;
B2 is when running into<address 〉,<blackquote 〉,<div 〉,<dl 〉,<h1 〉,<h2 〉,<h3 〉,<h4 〉,<h5 〉,<h6 〉,<ol 〉,<table 〉,<tr 〉,<ul 〉,<p 〉,<br〉and during the label of non-html, explanation be a paragraph begin need append a carriage return in the character string buffer zone;
B3 is when the node that runs into #TEXT, and text appends in character buffer;
B4 is when certain node visit end, and this node has child node this node to be pressed into the father node storehouse;
B5 finishes when certain node visit, and this node do not have child node there is no the right brotgher of node yet, checks the storehouse of father node, and ejects the storehouse of this father node, and eject this father node, again accesses this node, and press the B2 processing;
B6 finishes when certain node visit, and this node is that node and this node of access again that ejects from the father node storehouse do not had the right brotgher of node, ejects this father node, and again accesses this node, and press the B2 processing.
Preferably, in step B2 append carriage return the time, check last character in the character string buffer zone if currently in carriage return or character string buffer zone there is no a character, just no longer appended, in order to avoid append too much carriage return, destroy the display format of article.
Preferably, step B3 checks last character in the character string buffer zone if currently in carriage return or character string buffer zone there is no a character, begins to add indentation at text, can make the text formatting bandwagon effect better.
Preferably, step B is right<strong〉<b〉need to add under character record in label black, can make the text formatting bandwagon effect better.
Web page contents output intent provided by the invention makes the text of webpage with more near Formatting Output like browser, gives the better reading experience of user.
Embodiment
The following stated embodiment describes the present invention in detail.Method of the present invention is used this tree of preorder traversal, utilizes a stack architecture to record the node of current accessed, and after its all descendants's node traversals are completed, again accesses this node to judge whether adding carriage return.Father node P as the tree construction aaaaaa of fragment B need to add carriage return after back-call after aaaaa.
Segment A
Corresponding tree construction is:
At first prepare a character string buffer area StringBuffer
The character string buffer area of better processing represents with OptStringBuffer
We with n represent the carriage return character, t represent indentation
Father node storehouse ParentStack:
1 access div, div is not that the node in B2 neither text node,
Child node is arranged, so div is stacked
ParentStack:div
2 access p (first), p is the node in B2, appends carriage return
StringBuffer:\n
Preferably, current buffer zone does not have character, so do not append carriage return
OptStringBuffer:
P has child node, so p is stacked
ParentStack:div?p
3 accessing text node aaaaaa,
StringBuffer:\naaaaaa
Preferably, current buffer zone does not have character, adds indentation
OptStringBuffer:\taaaaaa
No child node
ParentStack:div?p
4 backs do not have child node there is no the right brotgher of node yet,
So access ParentStack,
Eject p, and access again, ParentStack:div
P is the node in B2, appends carriage return
StringBuffer:\naaaaaa\n
OptStringBuffer:\taaaaaa\n
5p (first) has the right brotgher of node, continues traverse tree
Access p (second), p is the node in B2, appends carriage return
StringBuffer:\naaaaaa\n\n
Preferably, last character of current buffer zone is carriage return, so do not append carriage return
OptStringBuffer:\taaaaaa\n
P has child node, so p is stacked
ParentStack:div?p
6 accessing text node bbbbbb,
StringBuffer:\naaaaaa\n\nbbbbbb
Preferably, last character of current buffer zone is carriage return, adds indentation
OptStringBuffer:\taaaaaa\n\tbbbbbb
No child node
ParentStack:div?p
7 backs do not have child node there is no the right brotgher of node yet,
So access ParentStack,
Eject p, and access again, ParentStack:div
P is the node in B2, appends carriage return
StringBuffer:\naaaaaa\n\nbbbbbb\n
OptStringBuffer:\taaaaaa\n\tbbbbbb\n
ParentStack:div?p
8p (second) has the right brotgher of node, continues traverse tree
Access p (the 3rd), p is the node in B2, appends carriage return
StringBuffer:\naaaaaa\n\nbbbbbb\n\n
Preferably, last character of current buffer zone is carriage return, so do not append carriage return
OptStringBuffer:\taaaaaa\n\tbbbbbb\n
P has child node, so p is stacked
ParentStack:div?p
9 accessing text node cccccc,
StringBuffer:\naaaaaa\n\nbbbbbb\n\ncccccc
Preferably, last character of current buffer zone is carriage return, adds indentation
OptStringBuffer:\taaaaaa\n\tbbbbbb\n\tcccccc
No child node
ParentStack:div?p
10 backs do not have child node there is no the right brotgher of node yet,
So access ParentStack,
Eject p, and access again, ParentStack:div
P is the node in B2, appends carriage return
StringBuffer:\naaaaaa\n\nbbbbbb\n\ncccccc\n
OptStringBuffer:\taaaaaa\n\tbbbbbb\n\tcccccc\n
ParentStack:div?p
11p (the 3rd) be get from ParentStack and also there is no the right brotgher of node
So access ParentStack ejects div, and access again,
ParentStack:
The node in B2 neither text node for div
Net result
StringBuffer:\naaaaaa\n\nbbbbbb\n\ncccccc\n
OptStringBuffer:\taaaaaa\n\tbbbbbb\n\tcccccc\n
Fragment B
Corresponding tree construction is
Notice that bbbbbb has lacked one deck<p than example A here〉but itself is also independent paragraph.
1 access div, div is not that the node in B2 neither text node,
Child node is arranged, so div is stacked
ParentStack:div
2 access p (first), p is the node in B2, appends carriage return
StringBuffer:\n
Preferably, current buffer zone does not have character, so do not append carriage return
OptStringBuffer:
P has child node, so p is stacked
ParentStack:div?p
3 accessing text node aaaaaa,
StringBuffer:\naaaaaa
Preferably, current buffer zone does not have character, adds indentation
OptStringBuffer:\taaaaaa
No child node
ParentStack:div?p
4 backs do not have child node there is no the right brotgher of node yet,
So access ParentStack,
Eject p, and access again, ParentStack:div
P is the node in B2, appends carriage return
StringBuffer:\naaaaaa\n
OptStringBuffer:\taaaaaa\n
5p (first) has the right brotgher of node, continues traverse tree
Accessing text node bbbbbb,
StringBuffer:\naaaaaa\nbbbbbb
Preferably, last character of current buffer zone is carriage return, adds indentation
OptStringBuffer:\taaaaaa\n\tbbbbbb
No child node
ParentStack:div
7 backs do not have child node, but the right brotgher of node is arranged,
So continuation traverse tree
Access p (second), p is the node in B2, appends carriage return
StringBuffer:\naaaaaa\nbbbbbb\n
Preferably, last character of current buffer zone is carriage return, so do not append carriage return
OptStringBuffer:\taaaaaa\n\tbbbbbb\n
P has child node, so p is stacked
ParentStack:div?p
8 accessing text node cccccc,
StringBuffer:\naaaaaa\nbbbbbb\ncccccc
Preferably, last character of current buffer zone is carriage return, adds indentation
OptStringBuffer:\taaaaaa\n\tbbbbbb\n\tcccccc
No child node
ParentStack:div?p
9 backs do not have child node there is no the right brotgher of node yet,
So access ParentStack,
Eject p, and access again, ParentStack:div
P is the node in B2, appends carriage return
StringBuffer:\naaaaaa\nbbbbbb\ncccccc\n
OptStringBuffer:\taaaaaa\n\tbbbbbb\n\tcccccc\n
ParentStack:div?p
10p (second) be get from ParentStack and also there is no the right brotgher of node
So access ParentStack ejects div, and access again,
ParentStack:
Div is not that the node in B2 neither text node
Net result
StringBuffer:\naaaaaa\nbbbbbb\ncccccc\n
OptStringBuffer:\taaaaaa\n\tbbbbbb\n\tcccccc\n
Can see that Segment A and fragment B are slightly different, in Segment A, each node P is exactly a paragraph, although and the bbbbbb in fragment B is not a paragraph in node P yet, this is because its front and back are all a paragraph just.
Claims (10)
1. the Formatting Output method of a web page contents, is characterized in that, the method comprises the following steps:
A, the parsing Html page convert html to standard xml form;
B, first have and reserve a character string buffer zone (StringBuffer), generate document tree structure (DOMTree) and travel through all nodes of this tree construction, content of text is appended in the character string buffer zone, and after traversal is completed, the content of character string buffer zone is exactly final formatted text.
2. method according to claim 1, is characterized in that, analyzes original document in steps A, supplies the TAG that needs in the html standard, and lack the TAG of ending, and the literary style of processing nonstandard TAG attribute, and the html page of non-good structure is become good structure.
3. method according to claim 2, is characterized in that, filters out<SCRIPT〉<STYLE〉<SELECT〉<INPUT〉and the irrelevant mark of text demonstration.
4. method according to claim 1, is characterized in that, the step B html page of good structure resolves to document tree, and travels through this tree, finds the text node in tree.
5. method according to claim 4, is characterized in that, the traverse tree structure can be used a stack architecture, create the father node of another stack record current accessed node, below be called the father node storehouse, and prepare what a character string buffer zone, prepare to accept character.
6. method according to claim 5, is characterized in that, when running into<address 〉,<blackquote 〉,<div 〉,<dl 〉,<h1 〉,<h2 〉,<h3 〉,<h4 〉,<h5 〉,<h6 〉,<ol 〉,<table 〉,<tr 〉,<ul 〉,<p 〉,<br and during the label of non-html, explanation be a paragraph begin need append a carriage return in the character string buffer zone.
7. method according to claim 6, is characterized in that, when the node that runs into #TEXT, text appends in character buffer.
8. method according to claim 7, is characterized in that, when certain node visit end, and this node has child node this node to be pressed into the father node storehouse.
9. method according to claim 8, is characterized in that, when certain node visit finishes, and this node does not have child node there is no the right brotgher of node yet, checks the storehouse of father node, and ejects the storehouse of this father node, and eject this father node, again access this node, and press B2 and process.
10. method according to claim 9, is characterized in that, when certain node visit finishes, and this node is that node and this node of access again that eject in uncle's storehouse have not had the right brotgher of node, eject this father node, and again access this node, and press B2 and process.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012100091177A CN103116592A (en) | 2012-01-13 | 2012-01-13 | Formatting output method of webpage content |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012100091177A CN103116592A (en) | 2012-01-13 | 2012-01-13 | Formatting output method of webpage content |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103116592A true CN103116592A (en) | 2013-05-22 |
Family
ID=48414969
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2012100091177A Pending CN103116592A (en) | 2012-01-13 | 2012-01-13 | Formatting output method of webpage content |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103116592A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105320697A (en) * | 2014-08-01 | 2016-02-10 | 北京龙源创新信息技术有限公司 | Method for realizing magazine data storage standard |
RU2610585C2 (en) * | 2015-03-31 | 2017-02-13 | Общество С Ограниченной Ответственностью "Яндекс" | Method and system for modifying text in document |
CN107622106A (en) * | 2017-09-13 | 2018-01-23 | 五八有限公司 | Reminding method and device when a kind of page can not render |
CN110377884A (en) * | 2019-06-13 | 2019-10-25 | 北京百度网讯科技有限公司 | Document analytic method, device, computer equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101470728A (en) * | 2007-12-25 | 2009-07-01 | 北京大学 | Method and device for automatically abstracting text of Chinese news web page |
CN101727461A (en) * | 2008-10-13 | 2010-06-09 | 中国科学院计算技术研究所 | Method for extracting content of web page |
-
2012
- 2012-01-13 CN CN2012100091177A patent/CN103116592A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101470728A (en) * | 2007-12-25 | 2009-07-01 | 北京大学 | Method and device for automatically abstracting text of Chinese news web page |
CN101727461A (en) * | 2008-10-13 | 2010-06-09 | 中国科学院计算技术研究所 | Method for extracting content of web page |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105320697A (en) * | 2014-08-01 | 2016-02-10 | 北京龙源创新信息技术有限公司 | Method for realizing magazine data storage standard |
RU2610585C2 (en) * | 2015-03-31 | 2017-02-13 | Общество С Ограниченной Ответственностью "Яндекс" | Method and system for modifying text in document |
US10762279B2 (en) | 2015-03-31 | 2020-09-01 | Yandex Europe Ag | Method and system for augmenting text in a document |
CN107622106A (en) * | 2017-09-13 | 2018-01-23 | 五八有限公司 | Reminding method and device when a kind of page can not render |
CN110377884A (en) * | 2019-06-13 | 2019-10-25 | 北京百度网讯科技有限公司 | Document analytic method, device, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI592807B (en) | Method and device for web style address merge | |
CN102253979B (en) | Vision-based web page extracting method | |
US20160292294A1 (en) | Extracting a portion of a document, such as a web page | |
CN104699714B (en) | Book version formatted file is converted to the method and device of EPUB formatted files | |
US20150143230A1 (en) | Method and device for displaying webpage contents in browser | |
CN103166981B (en) | A kind of radio web page code-transferring method and device | |
CN102270206A (en) | Method and device for capturing valid web page contents | |
US20080163077A1 (en) | System and method for visually generating an xquery document | |
CN104142985B (en) | A kind of semi-automatic vertical reptile Core Generator and method | |
CN103116592A (en) | Formatting output method of webpage content | |
CN104020984A (en) | Method and device for generating static page | |
CN105740355B (en) | Webpage context extraction method and device based on aggregation text density | |
CN103049536A (en) | Webpage main text content extracting method and webpage text content extracting system | |
CN107145591B (en) | Title-based webpage effective metadata content extraction method | |
CN103902571A (en) | Method and system for saving webpage complete content and corresponding client end and server | |
US20070180357A1 (en) | Story Tracking for Fixed Layout Markup Documents | |
CN106528509A (en) | Webpage information extracting method and apparatus | |
US10198408B1 (en) | System and method for converting and importing web site content | |
CN103246680B (en) | A kind of method in browser, web page contents polymerization being represented and device | |
CN106897287A (en) | Homepage Publishing decimation in time method and the device for Homepage Publishing decimation in time | |
US11514241B2 (en) | Method, apparatus, and computer-readable medium for transforming a hierarchical document object model to filter non-rendered elements | |
Hori et al. | Generating transformational annotation for web document adaptation: tool support and empirical evaluation | |
Wong et al. | Updating the ice annotation system: tagging, parsing and validation | |
CN108345589A (en) | A kind of translation on line method of full original text reference | |
JP5986896B2 (en) | Web browsing history management apparatus and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C12 | Rejection of a patent application after its publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20130522 |