CN103116592A

CN103116592A - Formatting output method of webpage content

Info

Publication number: CN103116592A
Application number: CN2012100091177A
Authority: CN
Inventors: 黄靖
Original assignee: KUNSHAN MAIKESITAI TECHNOLOGY Co Ltd
Current assignee: KUNSHAN MAIKESITAI TECHNOLOGY Co Ltd
Priority date: 2012-01-13
Filing date: 2012-01-13
Publication date: 2013-05-22

Abstract

The invention discloses a formatting output method of webpage content. The formatting output method of the webpage content is characterized in that the content of a specified area of a webpage is output in a text formatting in a formatting mode, indentation, carriage return and the like are added to proper places, and a display form of output text is enabled to be close to a display form in a browser as far as possible. The formatting output method of the webpage content comprises the steps of analyzing a hypertext markup language (Html) page, converting the html into a standard extensive markup language (xml) format, reserving a string butter in advance, generating a DOM Tree structure, browsing all nodes of the Tree structure, adding the content of the text to the string butter, and content in the string buffer is the final formatted text after the nodes are all browsed. According to the formatting output method of the webpage content, the text in the webpage is enabled to be output in the form closer to the form of the browser, and better reading experience is brought to a user.

Description

A kind of Formatting Output method of web page contents

Technical field

The present invention relates to a kind of network information export technique, especially a kind of Formatting Output method of internet web page contents.

Background technology

Along with the development of internet information technology with reach its maturity, the network information spreads to huge numbers of families.Though be the omnipotent epoch of internet, but can have needed web page contents is printed reading.These are more suitable for the electronics newspaper of giving birth to for net, electronic journal in the content of reading in network environment, how to generate suitable paper media's page by printout? long-time next, this is a problem always.

The content formatting of specific region in webpage is output into plain text format, is similar to the text of the form in browser to the user, become the problem that the technician need to solve.

Summary of the invention

In view of this, fundamental purpose of the present invention is to provide a kind of Formatting Output method of web page contents, can complete quickly and easily the Formatting Output of web page contents, can improve again the quality of output file, improves user's reading experience.

For achieving the above object, method provided by the invention is to find web page area and resolve to tree construction, according to (demonstration) implication of each label, and utilizes dual stack to find which point need to add carriage return, and paragraph begin need to add indentation.Technical scheme of the present invention is achieved in that

A, the parsing Html page convert html to standard xml form;

B, first have and reserve a character string buffer zone (StringBuffer), generate document tree structure (DOMTree) and travel through all nodes of this tree construction, content of text is appended in the character string buffer zone.After traversal is completed, the content of character string buffer zone is exactly final formatted text.

Further, be refined as described in steps A:

A1 analyzes original document, supplies the TAG that needs in the html standard, and lacks the TAG of ending, and the literary style of processing nonstandard TAG attribute, and the html page of non-good structure is become good structure;

A2 filters out＜SCRIPT〉＜STYLE〉＜SELECT〉＜INPUT〉and the irrelevant mark of text demonstration.

Preferably, steps A first finds the useful zone in the page, i.e. the text of webpage zone.

Further, step B resolves to document tree with the html page of good structure, and travels through this tree, finds the text node in tree.

Further, be refined as described in step B:

B1 traverse tree structure can be used a stack architecture, creates the father node of another stack record current accessed node, below is called the father node storehouse, and prepares what a character string buffer zone, prepares to accept character;

B2 is when running into＜address 〉,＜blackquote 〉,＜div 〉,＜dl 〉,＜h1 〉,＜h2 〉,＜h3 〉,＜h4 〉,＜h5 〉,＜h6 〉,＜ol 〉,＜table 〉,＜tr 〉,＜ul 〉,＜p 〉,＜br〉and during the label of non-html, explanation be a paragraph begin need append a carriage return in the character string buffer zone;

B3 is when the node that runs into #TEXT, and text appends in character buffer;

B4 is when certain node visit end, and this node has child node this node to be pressed into the father node storehouse;

B5 finishes when certain node visit, and this node do not have child node there is no the right brotgher of node yet, checks the storehouse of father node, and ejects the storehouse of this father node, and eject this father node, again accesses this node, and press the B2 processing;

B6 finishes when certain node visit, and this node is that node and this node of access again that ejects from the father node storehouse do not had the right brotgher of node, ejects this father node, and again accesses this node, and press the B2 processing.

Preferably, in step B2 append carriage return the time, check last character in the character string buffer zone if currently in carriage return or character string buffer zone there is no a character, just no longer appended, in order to avoid append too much carriage return, destroy the display format of article.

Preferably, step B3 checks last character in the character string buffer zone if currently in carriage return or character string buffer zone there is no a character, begins to add indentation at text, can make the text formatting bandwagon effect better.

Preferably, step B is right＜strong〉＜b〉need to add under character record in label black, can make the text formatting bandwagon effect better.

Web page contents output intent provided by the invention makes the text of webpage with more near Formatting Output like browser, gives the better reading experience of user.

Embodiment

The following stated embodiment describes the present invention in detail.Method of the present invention is used this tree of preorder traversal, utilizes a stack architecture to record the node of current accessed, and after its all descendants's node traversals are completed, again accesses this node to judge whether adding carriage return.Father node P as the tree construction aaaaaa of fragment B need to add carriage return after back-call after aaaaa.

Segment A

Corresponding tree construction is:

At first prepare a character string buffer area StringBuffer

The character string buffer area of better processing represents with OptStringBuffer

We with n represent the carriage return character, t represent indentation

Father node storehouse ParentStack:

1 access div, div is not that the node in B2 neither text node,

Child node is arranged, so div is stacked

ParentStack：div

2 access p (first), p is the node in B2, appends carriage return

StringBuffer：\n

Preferably, current buffer zone does not have character, so do not append carriage return

OptStringBuffer：

P has child node, so p is stacked

ParentStack：div?p

3 accessing text node aaaaaa,

StringBuffer：\naaaaaa

Preferably, current buffer zone does not have character, adds indentation

OptStringBuffer：\taaaaaa

No child node

ParentStack：div?p

4 backs do not have child node there is no the right brotgher of node yet,

So access ParentStack,

Eject p, and access again, ParentStack:div

P is the node in B2, appends carriage return

StringBuffer：\naaaaaa\n

OptStringBuffer：\taaaaaa\n

5p (first) has the right brotgher of node, continues traverse tree

Access p (second), p is the node in B2, appends carriage return

StringBuffer：\naaaaaa\n\n

Preferably, last character of current buffer zone is carriage return, so do not append carriage return

OptStringBuffer：\taaaaaa\n

P has child node, so p is stacked

ParentStack：div?p

6 accessing text node bbbbbb,

StringBuffer：\naaaaaa\n\nbbbbbb

Preferably, last character of current buffer zone is carriage return, adds indentation

OptStringBuffer：\taaaaaa\n\tbbbbbb

No child node

ParentStack：div?p

7 backs do not have child node there is no the right brotgher of node yet,

So access ParentStack,

Eject p, and access again, ParentStack:div

P is the node in B2, appends carriage return

StringBuffer：\naaaaaa\n\nbbbbbb\n

OptStringBuffer：\taaaaaa\n\tbbbbbb\n

ParentStack：div?p

8p (second) has the right brotgher of node, continues traverse tree

Access p (the 3rd), p is the node in B2, appends carriage return

StringBuffer：\naaaaaa\n\nbbbbbb\n\n

OptStringBuffer：\taaaaaa\n\tbbbbbb\n

P has child node, so p is stacked

ParentStack：div?p

9 accessing text node cccccc,

StringBuffer：\naaaaaa\n\nbbbbbb\n\ncccccc

OptStringBuffer：\taaaaaa\n\tbbbbbb\n\tcccccc

No child node

ParentStack：div?p

10 backs do not have child node there is no the right brotgher of node yet,

So access ParentStack,

Eject p, and access again, ParentStack:div

P is the node in B2, appends carriage return

StringBuffer：\naaaaaa\n\nbbbbbb\n\ncccccc\n

OptStringBuffer：\taaaaaa\n\tbbbbbb\n\tcccccc\n

ParentStack：div?p

11p (the 3rd) be get from ParentStack and also there is no the right brotgher of node

So access ParentStack ejects div, and access again,

ParentStack：

The node in B2 neither text node for div

Net result

StringBuffer：\naaaaaa\n\nbbbbbb\n\ncccccc\n

OptStringBuffer：\taaaaaa\n\tbbbbbb\n\tcccccc\n

Fragment B

Corresponding tree construction is

Notice that bbbbbb has lacked one deck＜p than example A here〉but itself is also independent paragraph.

1 access div, div is not that the node in B2 neither text node,

Child node is arranged, so div is stacked

ParentStack：div

2 access p (first), p is the node in B2, appends carriage return

StringBuffer：\n

OptStringBuffer：

P has child node, so p is stacked

ParentStack：div?p

3 accessing text node aaaaaa,

StringBuffer：\naaaaaa

Preferably, current buffer zone does not have character, adds indentation

OptStringBuffer：\taaaaaa

No child node

ParentStack：div?p

4 backs do not have child node there is no the right brotgher of node yet,

So access ParentStack,

Eject p, and access again, ParentStack:div

P is the node in B2, appends carriage return

StringBuffer：\naaaaaa\n

OptStringBuffer：\taaaaaa\n

5p (first) has the right brotgher of node, continues traverse tree

Accessing text node bbbbbb,

StringBuffer：\naaaaaa\nbbbbbb

OptStringBuffer：\taaaaaa\n\tbbbbbb

No child node

ParentStack：div

7 backs do not have child node, but the right brotgher of node is arranged,

So continuation traverse tree

Access p (second), p is the node in B2, appends carriage return

StringBuffer：\naaaaaa\nbbbbbb\n

OptStringBuffer：\taaaaaa\n\tbbbbbb\n

P has child node, so p is stacked

ParentStack：div?p

8 accessing text node cccccc,

StringBuffer：\naaaaaa\nbbbbbb\ncccccc

OptStringBuffer：\taaaaaa\n\tbbbbbb\n\tcccccc

No child node

ParentStack：div?p

9 backs do not have child node there is no the right brotgher of node yet,

So access ParentStack,

Eject p, and access again, ParentStack:div

P is the node in B2, appends carriage return

StringBuffer：\naaaaaa\nbbbbbb\ncccccc\n

OptStringBuffer：\taaaaaa\n\tbbbbbb\n\tcccccc\n

ParentStack：div?p

10p (second) be get from ParentStack and also there is no the right brotgher of node

So access ParentStack ejects div, and access again,

ParentStack：

Div is not that the node in B2 neither text node

Net result

StringBuffer：\naaaaaa\nbbbbbb\ncccccc\n

OptStringBuffer：\taaaaaa\n\tbbbbbb\n\tcccccc\n

Can see that Segment A and fragment B are slightly different, in Segment A, each node P is exactly a paragraph, although and the bbbbbb in fragment B is not a paragraph in node P yet, this is because its front and back are all a paragraph just.

Claims

1. the Formatting Output method of a web page contents, is characterized in that, the method comprises the following steps:

A, the parsing Html page convert html to standard xml form;

B, first have and reserve a character string buffer zone (StringBuffer), generate document tree structure (DOMTree) and travel through all nodes of this tree construction, content of text is appended in the character string buffer zone, and after traversal is completed, the content of character string buffer zone is exactly final formatted text.

2. method according to claim 1, is characterized in that, analyzes original document in steps A, supplies the TAG that needs in the html standard, and lack the TAG of ending, and the literary style of processing nonstandard TAG attribute, and the html page of non-good structure is become good structure.

3. method according to claim 2, is characterized in that, filters out＜SCRIPT〉＜STYLE〉＜SELECT〉＜INPUT〉and the irrelevant mark of text demonstration.

4. method according to claim 1, is characterized in that, the step B html page of good structure resolves to document tree, and travels through this tree, finds the text node in tree.

5. method according to claim 4, is characterized in that, the traverse tree structure can be used a stack architecture, create the father node of another stack record current accessed node, below be called the father node storehouse, and prepare what a character string buffer zone, prepare to accept character.

6. method according to claim 5, is characterized in that, when running into＜address 〉,＜blackquote 〉,＜div 〉,＜dl 〉,＜h1 〉,＜h2 〉,＜h3 〉,＜h4 〉,＜h5 〉,＜h6 〉,＜ol 〉,＜table 〉,＜tr 〉,＜ul 〉,＜p 〉,＜br and during the label of non-html, explanation be a paragraph begin need append a carriage return in the character string buffer zone.

7. method according to claim 6, is characterized in that, when the node that runs into #TEXT, text appends in character buffer.

8. method according to claim 7, is characterized in that, when certain node visit end, and this node has child node this node to be pressed into the father node storehouse.

9. method according to claim 8, is characterized in that, when certain node visit finishes, and this node does not have child node there is no the right brotgher of node yet, checks the storehouse of father node, and ejects the storehouse of this father node, and eject this father node, again access this node, and press B2 and process.

10. method according to claim 9, is characterized in that, when certain node visit finishes, and this node is that node and this node of access again that eject in uncle's storehouse have not had the right brotgher of node, eject this father node, and again access this node, and press B2 and process.