CN111475679A

CN111475679A - HTM L document processing method, page display method and device

Info

Publication number: CN111475679A
Application number: CN201910069208.1A
Authority: CN
Inventors: 许阳寅
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-01-24
Filing date: 2019-01-24
Publication date: 2020-07-31
Anticipated expiration: 2039-01-24
Also published as: CN111475679B

Abstract

The HTM L document processing method includes obtaining a text stream containing only text by separating tags contained in an HTM L document, constructing an index tree including one or more nodes and each node including data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment by parsing the tags and the text in the HTM L document, obtaining a style set which is a set of styles corresponding to each node in the index tree, and storing the text stream, the index tree, and the style set in association.

Description

HTM L document processing method, page display method and device

Technical Field

The invention relates to an HTM L document processing method, a page display method and equipment.

Background

The HyperText Markup language (HyperText Markup L anguage, referred to as HTM L for short) is a Markup language designed for pages and other information that may be viewed in a web browser or reader. code content written in accordance with the HTM L syntax is an HTM L document.A structure of the HTM L document includes a "Head" portion (Head) that provides information about the page and a "Body" portion (Body) that provides specific content of the page.

The web browser or viewer may accomplish the conversion of HTM L documents into pages by loading and parsing HTM L documents, DOM is an acronym for Document objection Model (Document Object Model.) existing browsers (including browsers used by mobile devices) and readers parse tags and text in HTM L documents into DOM trees through DOM, where each node of the tree appears as one HTM L tag or text associated with an HTM L tag.

However, because the tags and text in the HTM L document are all merged into one DOM tree in the prior art, there is a lack of independence between the tags and the text in this case, once the style of the page, such as font, line space, etc., of the HTM L document changes, the HTM L document will need to be re-parsed to generate a new DOM tree, resulting in a significant processing overhead.

Disclosure of Invention

In view of the above, it would be desirable to provide a new HTM L document processing method, page display method and apparatus that is capable of parsing HTM L documents in a more flexible manner and structure, thereby reducing processing overhead.

According to one aspect of the invention, an HTM L document processing method is provided and includes obtaining a text stream only containing text by separating tags contained in an HTM L document, constructing an index tree by parsing the tags and the text in the HTM L document, wherein the index tree comprises one or more nodes, each node comprises data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment, obtaining a style set which is a set of styles of the text segment corresponding to each node in the index tree, and storing the text stream, the index tree and the style set in association.

In addition, in the method according to an embodiment of the present invention, at least the text stream is stored in a non-volatile storage unit outside the memory.

In addition, in the method according to an embodiment of the present invention, the step of storing the text stream, the index tree, and the style set in association further includes: taking data included by each node in the index tree as a unit, and sequencing the data included by each node to form an index array; and storing the index array in a non-volatile storage unit outside the memory.

In addition, in the method according to an embodiment of the present invention, the step of sorting the data included in each node to form an index array further includes: sequencing according to the order from small to large of the left end point of the interval indicated by the data included in each node to obtain a first array; and for two units with the same left end point of the interval in the first array, further sorting the two units according to the sequence from large to small of the right end point of the interval to obtain the index array.

In addition, in the method according to an embodiment of the present invention, the step of storing the index array in a non-volatile storage unit outside the memory further includes: performing compression on the index array; and storing the compressed index array in a nonvolatile storage unit outside the memory.

In addition, in the method according to an embodiment of the present invention, the step of storing the text stream, the index tree, and the style set in association further includes: serializing the style set into a style array of a specific format which can be stored by the nonvolatile storage unit; and storing the pattern array in the non-volatile storage unit.

According to another aspect of the present invention, there is provided a page display method including, in response to an instruction to display a page, performing layout processing of the page based on an index tree, a style set, and a text stream corresponding to the page, the index tree, the style set, and the text stream being obtained in advance by processing an HTM L document corresponding to the page, the text stream being obtained by separating tags contained in an HTM L document and containing only text, the index tree including one or more nodes, and each node including data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment, the style set being a set of styles of the text segment corresponding to each node in the index tree, and the text style stream, the index tree, and the text set being stored in association, and displaying the page after the layout processing.

In addition, in the method according to the embodiment of the invention, at least the text stream is stored in a non-volatile storage unit outside a memory, and the step of performing layout processing of the page based on the index tree, the style set, and the text stream corresponding to the page further includes retrieving the text stream corresponding to the HTM L document in the non-volatile storage unit and loading only a part of the text stream into the memory, and performing layout processing of the page based on the index tree, the style set, and the part of the text stream.

In addition, in a method according to an embodiment of the present invention, the index tree is transformed into an index array and stored in a non-volatile storage unit outside the memory, the pattern set is serialized into a pattern array and stored in the non-volatile storage unit outside the memory, and wherein the method further comprises: loading an index array corresponding to the page into a memory, and recovering an index tree corresponding to the index array based on the index array; and loading the style array corresponding to the page into a memory, and deserializing into the style set.

In addition, in the method according to an embodiment of the present invention, the restoring the index tree corresponding to the index array based on the index array further includes: constructing an index tree with [0, ∞, 0) ] as a root node in a memory, and then giving a pointer P to point to a current node on the index tree; sequentially reading three numbers from the index array as a unit, and giving a pointer T to point to a first unit, wherein the pointer T represents a node to be placed currently; judging whether the following conditions are met: the left end point of the text segment interval of the node pointed by the pointer T is more than or equal to the left end point of the text segment interval of the node pointed by the pointer P, and the right end point of the text segment interval of the node pointed by the pointer T is less than or equal to the right end point of the text segment interval of the node pointed by the pointer P; if the judgment result is yes, inserting the node pointed by the pointer T into the index tree as a child node of the current node pointed by the pointer P, and if the current node has a child node, determining whether the current node is placed on the right side or the left side of the existing child node by comparing text segment intervals of the node to be inserted and the existing child node; if the judgment result is negative, the P pointer points to the father node of the current node, and the judgment of the steps is repeated until the T node meets the condition; the current node pointed to by pointer P is moved to the node pointed to by pointer T and pointer T is pointed to the next cell in the index array until the index array is empty.

In addition, the method according to an embodiment of the present invention further includes: modifying the style set in response to an instruction to change a style of the page; performing a layout process of a page based on the modified style set, the index tree, and the text stream; and displaying the page after the typesetting processing.

According to another aspect of the present invention, there is provided an HTM L document processing device including a text stream obtaining means for obtaining a text stream containing only text by separating tags contained in an HTM L document, an index tree constructing means for constructing an index tree including one or more nodes and each node including data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment by parsing the tags and the text in the HTM L document, a style set obtaining means for obtaining a style set which is a set of styles of the text segment corresponding to each node in the index tree, and a storage means for storing the text stream, the index tree, and the style set in association.

In addition, in the apparatus according to an embodiment of the present invention, the storage device includes a memory and a nonvolatile storage unit, and at least the text stream is stored in the nonvolatile storage unit outside the memory.

In addition, the apparatus according to an embodiment of the present invention further includes: the index tree conversion device is used for taking data included in each node in the index tree as a unit, sorting the data included in each node to form an index array, and storing the index array in a nonvolatile storage unit outside an internal memory.

In addition, the apparatus according to an embodiment of the present invention further includes: and the compression device is used for compressing the index array, wherein the compressed index array is stored in a nonvolatile storage unit outside the memory.

In addition, the apparatus according to an embodiment of the present invention further includes: and the style set conversion device is used for serializing the style set into a style array of a specific format which can be stored in the nonvolatile storage unit, wherein the style array is stored in the nonvolatile storage unit.

According to another aspect of the present invention, there is provided a page display apparatus including a storage device including a memory and a nonvolatile storage unit to store a text stream, an index tree, and a style set, which are obtained by preprocessing with respect to an HTM L document corresponding to a page, wherein the text stream is obtained by separating tags included in an HTM L document and contains only text, the index tree includes one or more nodes, and each node includes data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment, wherein the tags are used to label a style of the text segment corresponding thereto, and the style set is a set of styles of the text segment corresponding to each node in the index tree, a layout processing device to perform layout processing of the page based on the index tree, the style set, and the text stream corresponding to the page in response to an instruction to display the page, and a display device to display the page processed by the layout processing device.

In addition, in the apparatus according to an embodiment of the present invention, at least the text stream is stored in a non-volatile storage unit outside the memory, and the apparatus further includes a text stream loading device for retrieving the text stream corresponding to the HTM L document in the non-volatile storage unit in response to an instruction to display a page, and loading only a part of the text stream into the memory, wherein the layout processing device is further configured to perform layout processing of the page based on the index tree, the style set, and the part of the text stream.

In addition, in the apparatus according to an embodiment of the present invention, the index tree is transformed into an index array and stored in a non-volatile storage unit outside the memory, the pattern set is serialized into a pattern array and stored in the non-volatile storage unit outside the memory, and wherein the apparatus further includes: the index tree recovery device is used for loading the index array corresponding to the page into the memory and recovering the index tree corresponding to the index array based on the index array; and the style set recovery device is used for loading the style arrays corresponding to the pages into the memory and deserializing the style arrays into the style sets.

In addition, the apparatus according to an embodiment of the present invention further includes: a style set modification means for modifying the style set in response to an instruction to change a style of the page, wherein the layout processing means is further configured to perform layout processing of the page based on the modified style set, the index tree, and the text stream.

According to another aspect of the present invention, there is provided a computer readable recording medium having stored thereon a computer program which, when executed by a processor, implements the steps of obtaining a text stream containing only text by separating tags contained in an HTM L document, constructing an index tree including one or more nodes and each node including data indicating a text segment in the text corresponding to the node and data indicating a style of the text segment by parsing the tags and the text in the HTM L document, obtaining a style set which is a set of styles of the text segment corresponding to each node in the index tree, and storing the text stream, the index tree, and the style set in association.

Also, according to another aspect of the present invention, there is provided a computer-readable recording medium having stored thereon a computer program that, when executed by a processor, implements the steps of performing layout processing of a page based on an index tree, a style set, and a text stream corresponding to the page, the index tree, the style set, and the text stream being obtained in advance by processing an HTM L document corresponding to the page, the text stream being obtained by separating tags contained in an HTM L document and containing only text, the index tree including one or more nodes, and each node including data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment, the style set being a set of styles of the text segment corresponding to each node in the index tree, and displaying the layout-processed page.

In the HTM L document processing method, the page display method, the device and the medium according to the embodiments of the present invention, firstly, by separating the text and the tag, one DOM tree in the prior art is disassembled into three data structures to be stored separately, so that the situation that the HTM L document changes can be dealt with more flexibly, for example, when adjusting parameters such as font, reading line spacing, word spacing and the like, the HTM L document does not need to be re-analyzed, but only the style set is modified based on the existing index tree and text stream, the style adjustment can be performed, thereby reducing the system overhead and realizing efficient typesetting processing.

Drawings

FIG. 1 is a schematic diagram illustrating an application environment for an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a particular process of an HTM L document processing method according to an embodiment of the invention;

FIG. 3 is a diagram illustrating an index tree constructed by an HTM L document processing method according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating, as a comparative example, a DOM tree constructed in accordance with the HTM L document processing method of the prior art;

FIG. 5 illustrates a signal flow diagram corresponding to the HTM L document processing method in accordance with the present invention;

FIG. 6 shows a schematic diagram of a reader to which the present invention may be applied;

FIG. 7 shows a schematic diagram of a browser to which the present invention may be applied;

fig. 8 is a flowchart illustrating a specific procedure of a page display method according to an embodiment of the present invention;

9(A) -9 (D) are diagrams illustrating one possible implementation of restoring an index tree based on an index array;

FIG. 10 is a functional block diagram illustrating a configuration of an HTM L document processing device according to an embodiment of the present invention;

fig. 11 is a functional block diagram illustrating a configuration of a page display apparatus according to an embodiment of the present invention;

FIG. 12 illustrates an HTM L document processing device as one example of a hardware entity in accordance with the present invention;

fig. 13 shows a page display device according to the present invention as an example of a hardware entity; and fig. 14 illustrates a schematic diagram of a computer-readable recording medium according to an embodiment of the present invention.

Detailed Description

Various preferred embodiments of the present invention will be described below with reference to the accompanying drawings. The following description with reference to the accompanying drawings is provided to assist in understanding the exemplary embodiments of the invention as defined by the claims and their equivalents. It includes various specific details to assist understanding, but they are to be construed as merely illustrative. Accordingly, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of the present invention. Also, in order to make the description clearer and simpler, a detailed description of functions and configurations well known in the art will be omitted.

As shown in FIG. 1, a server 10 is connected to a plurality of terminal devices 20 via a network 30. the plurality of terminal devices 20 may be terminals used by users who are to view pages, for example, a terminal device 20 may include an HTM L document processing device, described below, and may also include a page display device, described below. the terminal may be a smart terminal, such as a smart phone, PDA (personal digital Assistant), desktop computer, notebook computer, tablet computer, etc., or other type of terminal, the server 10 is a server corresponding to the web address of a page to be displayed by a browser or a server corresponding to the content of a page to be displayed by a reader.

Next, various embodiments of the present invention will be described.

First, an HTM L document processing method according to an embodiment of the present invention will be described with reference to FIG. 2. the HTM L document processing method is parsing preprocessing performed before page display, as shown in FIG. 2, the HTM L document processing method includes the following steps.

First, in step S201, a text stream containing only text is obtained by separating tags contained in the HTM L document.

For example, given an HTM L document fragment:

WeReadRocks！

where p, i, b are labels representing styles of corresponding text. For example, the style of text may include, without limitation, font, color, line spacing, and the like.

Then, through the processing of step S201, specifically, by removing the tags p, i, b in the HTM L document fragment and retaining the text corresponding to each tag (i.e., the text contained within the tag), the text flow weredrocks | after separating the tags (p, i, b) will be obtained!

Then, in step S202, an index tree is constructed by parsing the tag and the text in the HTM L document, wherein the index tree includes one or more nodes, and each node includes data indicating a text segment in the text corresponding to the node and data indicating a style of the text segment, and the tag is used for labeling the style of the text segment corresponding to the tag.

For example, still taking the HTM L document fragment in the above as an example, the index tree may be constructed in such a manner that, first, for the tag p, the corresponding text segment is all text segments, and the start point and the end point of the text segment corresponding to the tag p are respectively represented by two numbers, in this example, [0, 12 ], further, a number (e.g., "0") is required to represent the style of the text segment corresponding to the tag p, then, by analyzing the tag p and the text corresponding to the tag p, one node [0, 12, 0 ] in the index tree may be obtained, and then, the same process is repeated for the tag i and the tag b, and further two further nodes [0, 6, 1 ], [6, 11, 2] are obtained, since the text segment interval of the node corresponding to the text segment marked by the tag p is the largest, the node is used as the root node, and the other two further nodes are respectively represented as the left and right child nodes of the root node in the text segment shown in fig. 3, the order of the text segment corresponding to the text segment 1, the index tree 1, the index segment corresponding to the index segment 1, and the index segment 1, 2) may be represented by one number, and the index segment 1,2, and the index segment 1, and the index segment corresponding to the index 1, 2) may be represented by the order of the three examples shown in the index tree 1,2, and the index tree 1,2, and the index 1,2, respectively, and the index 1,2, respectively, and the index 1, 2).

In addition, still taking the HTM L document fragment described hereinabove as an example, FIG. 4 shows a DOM tree constructed according to the HTM L document processing method of the prior art as a comparative example, it can be seen from FIG. 4 that the tags and text in the HTM L document are all merged into one DOM tree in the prior art, in contrast, referring back to FIG. 3, in the index tree constructed according to the HTM L document processing method of the present invention, only data for associating styles and relationships of text is contained, but no text and any specific data of styles are contained.

Next, in step S203, a style set is obtained, which is a set of styles corresponding to each node in the index tree.

For example, the Style set may be obtained based on CSS (Cascading Style Sheets) associated with the HTM L document.

p { color: blue; } (p-label represents style font blue);

i { font-style: italic; color is black; h (style represented by i label is italic, font black);

b { font-weight: bold; red is color; and (b, the style represented by the label is bold and font red).

Wherein, the labels p, i, b can also be regarded as a selector for selecting which style the corresponding text segment adopts for display, { color: blue; { font-style: italic; color is black; bold, { font-weight: bold; red is color; it is a style block for describing the specific content of the style.

Thus, the CSS itself is an array of "selector, style blocks". When constructing the index tree, the selector needs to match the tree node to determine whether the block pattern belongs to the node. Therefore, when saving, the information of the "selector" is not required to be reserved, but only the style block is reserved, and the position of the style block to which the style block belongs is recorded in each node on the index tree (for example, the third number: 0, 1,2 in the node in the example of FIG. 3).

Based on the index tree constructed in step S202, such as the index tree shown in fig. 3, and based on the CSS associated with the HTM L document, the following style set may be obtained:

[{color:blue；}，{font-style:italic；color:black；}，{font-weight:bold；color:red；}]。

it can be seen that in the case of the index tree shown in fig. 3, an array containing three style blocks may be obtained as a style set. Wherein, the 0 th element in the array { color: blue; corresponding to the specific content of the style indicated by 0 in the index tree, the 1 st element { font-style: italic; color is black; corresponding to the specific content of the style indicated by 1 in the index tree, the 2 nd element { font-weight: bold; red is color; corresponds to the specific content of the style indicated by 2 in the index tree.

Finally, in step S204, the text stream, the index tree, and the style set are stored in association.

It should be noted that although the steps of acquiring the text stream, the index tree, and the style set are shown in chronological order in fig. 2, the present invention is not limited thereto, and the present invention is not intended to specifically limit the chronological order of the steps of acquiring the text stream, the index tree, and the style set. For example, the index tree and the style set may be obtained first, and then the text stream may be obtained, in a different order than shown in fig. 2. Furthermore, the above steps may be performed in parallel, in addition to being performed sequentially in chronological order.

In the HTM L document processing method according to the embodiment of the invention, an HTM L document and an associated CSS can be obtained based on a ZIP-form file obtained from a server, and one DOM tree in the prior art can be disassembled into three data structures to be independently stored by separating text and tags, so that the situation that the HTM L document is changed can be more flexibly dealt with.

Here, it should be noted that, in the HTM L document processing method according to the embodiment of the present invention, the parsed text stream, index tree, and style set are stored as three data structures separately, but the storage locations of the text stream, index tree, and style set are not particularly limited.

However, as an alternative implementation, in the HTM L document processing method according to the embodiment of the present invention, the storage locations of the text stream, the index tree, and the style set may also be specifically defined.

In particular, the present invention is applicable to readers, web browsers, and the like. Of course, the invention is not limited thereto. Besides browsers and readers, the invention can be similarly applied to any other application scene related to page browsing, for example, pages with page browsing function such as news and social networks can be similarly applied to the invention.

Fig. 6 shows a schematic diagram of a reader to which the present invention can be applied. In fig. 6, three pages are shown, respectively. Where page 601 is a start page when the user starts the reader application, page 602 is a cover and a brief introduction of a certain book selected by the user, and page 603 is a page displayed when the user reads the book.

In the scenario of the present invention applied to a reader (as shown in fig. 6), when a user wishes to read a book, the terminal device shown in fig. 1 is operable to send a request to a server corresponding to the reader and receive an HTM L document corresponding to the book from the server, the terminal device first stores the received HTM L document in a non-volatile memory unit and then loads it into a memory for parsing, since the pages in a book are fixed, the text stream, index tree and style set corresponding to all chapters (as described later, one HTM L document corresponds to one chapter (multiple pages) of a book) can be parsed in advance and stored in a non-volatile memory unit outside the memory, and then when it is desired to display a page, the text stream, index tree and style set corresponding thereto are loaded into the memory for performing a layout process.

For memory, information can be represented using complex types, but for non-volatile storage (e.g., magnetic disks) information can only be represented using a single type, e.g., only a byte stream can be stored. For example, a set of arrays in memory can be described as [ (1,2,3), (4,5,6) ], whereas in a non-volatile memory unit, the set of arrays can only be described as [1,2,3,4,5,6 ]. Therefore, when the index tree constructed in step S202 is stored in the nonvolatile memory cell, the data included in each node in the index tree needs to be sorted to obtain an array that the nonvolatile memory cell can store.

Specifically, the step S204 of storing the text stream, the index tree, and the style set in association may further include: taking data included by each node in the index tree as a unit, and sequencing the data included by each node to form an index array; and storing the index array in a non-volatile storage unit outside the memory.

The index tree shown in fig. 3 is still used as an example for explanation. The three nodes are [0, 12, 0), [0, 6, 1), [6, 11, 2) respectively. The data included in each node is taken as one unit, for example, 0, 12, 0 is one unit, and the data in each unit is extracted and sorted. It should be noted here that the numbers contained in a unit are not permutable and are integral to the arrangement.

As a possible implementation, the data may be ordered randomly. For example, the numbers [0, 12, 0), [0, 6, 1), [6, 11, 2) of the three cells may be randomly ordered to form an index array [0, 6, 1, 0, 12, 0, 6, 11, 2 ].

However, as another possible implementation, the numbers contained in the units may be regularly ordered according to a specific rule in consideration of the processing overhead when subsequently restoring the index tree. For example, the number indicating the style included in each cell does not affect the sorting, and only two numbers indicating the start point and the end point of the text segment are considered in the sorting. Of course, the present invention is not limited thereto, and does not exclude a case where the numbers of the indication patterns included in the respective units may participate in the sorting, if necessary.

As an example, the ordering may be according to the following rules: first, an array [0, 12, 0, 0, 6, 1, 6, 11, 2] is obtained by sorting the left end point of the interval (i.e., the start of the text segment) in order from small to large. Then, for two units with the same left end point of the interval, namely [0, 12, 0 ] and [0, 6, 1 ], the final index array [0, 12, 0, 0, 6, 1, 6, 11, 2] is obtained by further sorting the right end point of the interval (namely, the end point of the text segment) from large to small.

In addition, in view of efficient utilization of storage space, as a possible implementation, the step of storing the index array in a non-volatile storage unit outside the memory may further include: and compressing the index array, and storing the compressed index array in a nonvolatile storage unit outside the memory.

Specifically, for each digit in the index array, a process is sequentially performed that determines whether the digit is less than 65535, if the digit is less than 65535, a first number of bytes (e.g., 1 to 2 bytes (byte)) are used for storage, and if the digit is not greater than 65535, a second number of bytes (e.g., 4 bytes) are used for storage.

Of course, in the case of performing compression processing when storing an index array, when restoring an index tree, it is accordingly necessary to first perform decompression on the index array and then perform restoration processing of the index tree.

In addition to the transformation of the index tree, the step of storing the text stream, the index tree, and the style set in association to accommodate the memory difference from the nonvolatile storage unit further comprises: serializing the style set into a style array of a specific format which can be stored by the nonvolatile storage unit; and storing the pattern array in the non-volatile storage unit. For example, the style set may be serialized into a JSON format style number save.

In contrast to the index tree and the style set, the text stream does not need to perform conversion, but is directly saved as plain text data in the nonvolatile storage unit as it is.

Further, fig. 7 shows a schematic view of a browser to which the present invention can be applied. In fig. 7, a user may type a web page address in a text box 701 and, in response, display web page content corresponding to the web page address in a page 702.

In the scenario of the present invention applied to a browser (as shown in fig. 7), when a user wishes to open a page, a request may be sent to a corresponding server by, for example, inputting a web address, and an HTM L document corresponding to the page may be received from the server.

In the above, an HTM L document processing method according to an embodiment of the present invention, i.e., a parsing preprocessing method performed before a page is displayed, is described with reference to the accompanying drawings, next, a page display method corresponding to the HTM L document processing method described above will be described with reference to fig. 8, wherein the page corresponds to one HTM L document, a text stream, an index tree, and a style set are obtained by preprocessing the HTM L document, and the text stream, the index tree, and the style set are stored in association, wherein at least the text stream is stored in a non-volatile storage unit outside a memory, as shown in fig. 8, the page display method includes the following steps.

First, in step S801, in response to an instruction to display a page, a layout process of the page is performed based on an index tree, a style set, and a text stream corresponding to the page, the index tree, the style set, and the text stream being obtained in advance by processing an HTM L document corresponding to the page, the text stream being obtained by separating tags contained in an HTM L document and containing only text, the index tree including one or more nodes, and each node including data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment, wherein the tags are used to label a style of the text segment corresponding thereto, the style set is a set of styles of text segments corresponding to each node in the index tree, and the text stream, the index tree, and the style set are stored in association.

In addition, as an alternative embodiment, at least the text stream is stored in a non-volatile storage unit outside the memory, and correspondingly, the step S801 of performing the layout processing of the page based on the index tree, the style set, and the text stream corresponding to the page may further include retrieving the text stream corresponding to the HTM L document in the non-volatile storage unit and loading only a part of the text stream into the memory in response to an instruction to display a page, and performing the layout processing of the page based on the index tree, the style set, and the part of the text stream.

For example, in the application scenario of a reader, one chapter (multiple pages) of a book corresponds to one HTM L document, that is, the parsed text stream, the index tree, and the style set all correspond to one chapter (multiple pages). when a page is displayed, the HTM L document corresponding to the chapter containing the page needs to be retrieved from the non-volatile memory unit to obtain the text stream.

Since the prevalence of text content in most HTM L documents for mobile device access tends to be large, memory overhead can be reduced by only maintaining a buffer of partially relevant text (e.g., two pages or two screens) in memory to load the content, and memory overhead does not increase linearly as text content increases.

It should be noted here that, in an application scenario of the browser, after the HTM L document is parsed to obtain the text stream, the index tree, and the style set, only the text stream is stored in a non-volatile storage unit outside the memory, in other words, the index tree and the style set are still retained in the memory and thus are not converted, so that only the required text needs to be loaded into the memory when performing the layout processing, and the layout processing of the page is performed based on the index tree, the style set, and the part of the text.

In the application scenario of the reader, since the HTM L document is parsed in advance to obtain the text stream, the index tree and the style set, and all of them are stored in the non-volatile storage unit outside the memory, in order to adapt to different storage characteristics of the non-volatile storage unit, both the index tree and the style set need to be converted.

Therefore, in this case, before step S801, the method further includes a step of restoring the index tree and the style set. Specifically, before step S801, the method further includes: loading an index array corresponding to the page into a memory, and recovering an index tree corresponding to the index array based on the index array; and loading the style array corresponding to the page into a memory, and deserializing into the style set.

Still using the example above to describe, as one possible implementation, the restoration process of the index tree may include the following steps.

The method comprises the following steps: the index array [0, 12, 0, 0, 6, 1, 6, 11, 2] is read from the non-volatile memory cells to memory.

Step two: and constructing a tree with [0, ∞, 0) ] as a root node in the memory, wherein the first two numbers represent the starting point and the end point of the text segment, and the third number represents the style corresponding to the text segment and does not influence the judgment in the fourth step of the following process, similar to the structure described above. Then a given pointer P points to the current node on the index tree. Since there is only one root node in the current index tree, the pointer P points to the root node on the index tree.

Step three: the data are sequentially read from the index array, and three digits are taken out at a time as a unit which is closed at the left and opened at the right corresponding to the array structure described above. First, a pointer T is given to point to the first element [0, 12, 0), where the pointer T represents a node to be currently placed. Fig. 9(a) shows the index tree at this time, in which the position of the node to which the pointer T points has not been determined. As described above, the third number in each cell represents only the style corresponding to the text segment, and is data that is irrelevant to the interval of the text segment, and therefore irrelevant to the subsequent determination. Note that in fig. 9(a) -9 (D), the part inside the box represents the index tree part that has been formed. The node pointed to by the pointer T is outside the box since its position in the index tree has not been determined.

Step four: judging whether the following conditions are met: the left end point of the text segment interval of the node pointed by the pointer T is more than or equal to the left end point of the text segment interval of the node pointed by the pointer P, and the right end point of the text segment interval of the node pointed by the pointer T is less than or equal to the right end point of the text segment interval of the node pointed by the pointer P.

If the judgment result is yes, in other words, if the text segment interval of the node pointed by the pointer T is included in the text segment interval of the node pointed by the pointer P, the node pointed by the pointer T is inserted into the index tree as a child node of the current node pointed by the pointer P. If the current node has a child node, whether the current node is placed on the right side or the left side of the existing child node is decided by comparing the text segment intervals of the node to be inserted and the existing child node.

On the other hand, if the judgment result is negative, the P pointer points to the father node of the current node, and the judgment of the fourth step is repeated until the T node meets the condition.

Then, pointer P is pointed to pointer T, i.e.: the current node pointed to by pointer P is moved to the node pointed to by pointer T and pointer T is pointed to the next cell in the index array until the index array is empty.

Fig. 9(B) shows a case where the node corresponding to the first cell is inserted into the index tree and the next cell is read. Since the first element [0, 12, 0) satisfies the condition, it is inserted into the index tree as a child of the root node. At the same time, pointer P points to [0, 12, 0) and pointer T points to the next node to be placed [0, 6, 1).

Fig. 9(C) shows a case where a node corresponding to the second cell is placed, and fig. 9(D) shows a case where a node corresponding to the third cell is placed. In fig. 9(C), the currently to-be-placed node [6, 11, 2) pointed to by the pointer T is not included in the node [0, 6, 1) pointed to by the pointer P, and thus the pointer P is transformed to point to the parent node [0, 12, 0) of the current pointer, as shown in fig. 9 (D). Then, the node [0, 12, 0) pointed to by the pointer P and the node [6, 11, 2) pointed to by the pointer T continue to be compared), when the above condition is satisfied, and the node is inserted into the index tree. Since the section of the node is located on the right side of the node [0, 6, 1), it is inserted to the right side of the node [0, 6, 1).

Then, in step S802, the page after the layout processing is displayed.

In addition, during the page display process, the user may adjust the parameters of font, line height, width, and the like. In the invention, because the style set is analyzed and stored independently, the index tree and the text stream are kept unchanged, only the style set is modified, and the typesetting display can be executed based on the modified style set and the original index tree and the text stream.

Therefore, after the step S902, the page display method according to the present invention may further include modifying the style set in response to an instruction for changing the style of the page, performing the layout processing of the page based on the modified style set, the original text stream, and the style set, and displaying the page after the layout processing, so that, when adjusting parameters such as a font, a reading line space, a word space, and the like, the adjustment of the style can be performed only by modifying the style set based on the existing index tree and the text stream without re-parsing the HTM L document, thereby reducing the system overhead and realizing efficient layout processing.

In the above, the HTM L document processing method and the page display method according to the embodiment of the present invention have been described in detail with reference to fig. 1 to 9(a) -9 (D), next, an HTM L document processing apparatus corresponding to the HTM L document processing method described above will be described with reference to fig. 10.

As shown in fig. 10, the HTM L document processing device 1000 includes text stream acquisition means 1001, index tree construction means 1002, style set acquisition means 1003, and storage means 1004. the HTM L document processing device 1000 may be one of the constituent elements of the terminal device 20 described above with reference to fig. 1.

The text stream acquisition means 1001 obtains a text stream containing only text by separating tags contained in the HTM L document.

The index tree constructing device 1002 constructs an index tree by parsing the tag and the text in the HTM L document, wherein the index tree comprises one or more nodes, and each node comprises data indicating a text segment in the text corresponding to the node and data indicating a style of the text segment, and the tag is used for marking the style of the text segment corresponding to the tag.

The style set acquisition means 1003 acquires a style set, which is a set of specific contents of a style corresponding to each node in the index tree.

The storage 1004 stores the text stream, the index tree, and the style set in association.

For example, when parameters such as font, reading line spacing, word spacing and the like are adjusted, the HTM L document does not need to be re-analyzed, and only a style set is modified based on an existing index tree and a text stream, so that the adjustment of the style can be executed, the system overhead is reduced, and efficient typesetting processing is realized.

Here, it should be noted that, in the HTM L document processing device according to the embodiment of the present invention, the text stream, the index tree, and the style set obtained by the text stream obtaining means 1001, the index tree constructing means 1002, and the style set obtaining means 1003 are stored separately as three data structures, but the storage locations of the text stream, the index tree, and the style set are not particularly limited.

However, as an alternative implementation, in the HTM L document processing device according to an embodiment of the present invention, the storage locations of the text stream, the index tree, and the style set may also be specifically defined.

For example, the storage 1004 may include a memory and a non-volatile storage unit, and at least the text stream is stored in the non-volatile storage unit outside the memory.

In particular, the present invention is applicable to readers, web browsers, etc. in a scenario in which the present invention is applied to a reader, when a user desires to read a certain book, the terminal device shown in fig. 1 is operable to transmit a request to a server corresponding to the reader and receive an HTM L document corresponding to the book from the server, the terminal device first stores the received HTM L document in a non-volatile storage unit and then loads it into a memory for parsing, since pages in a book are fixed, text streams, index trees, and style sets corresponding to all chapters may be resolved in advance, stored in a non-volatile storage unit outside the memory, and then loads the text streams, index trees, and style sets corresponding thereto into the memory to perform a composition process when a certain page needs to be displayed, that is, in this case, the processing by the text stream acquisition means, the index tree construction means 1002, the style acquisition means is performed to obtain the text streams, the index trees, and the style sets, after the processing by the text stream acquisition means, the memory is associated with the memory, the page resolution may be performed at a reduced time, and thus the page may be displayed (L).

For memory, information can be represented using complex types, but for non-volatile storage (e.g., magnetic disks) information can only be represented using a single type, e.g., only a byte stream can be stored. For example, a set of arrays in memory can be described as [ (1,2,3), (4,5,6) ], whereas in a non-volatile memory unit, the set of arrays can only be described as [1,2,3,4,5,6 ]. Therefore, when the index tree constructed by the index tree construction device 1002 is stored in the nonvolatile memory cell, it is necessary to sort the data included in each node in the index tree to obtain one array that the nonvolatile memory cell can store.

Therefore, the HTM L document processing device 1000 may further include an index tree transformation apparatus 1005 (shown in dotted lines in the figure because it is an unnecessary component) for sorting the data included in each node in the index tree in units of data included in each node to form an index array, wherein the data in one unit is as a whole and the order is not changeable at the time of sorting, and wherein the index array is stored in a non-volatile storage unit other than the memory.

In addition, in view of the efficient utilization of the storage space, as a possible implementation, the HTM L document processing device 1000 may further include a compression unit 1006 (shown in dashed lines since it is an unnecessary component) for performing compression on the index array.

In addition to the transformation of the index tree, in order to accommodate the difference between the memory and the nonvolatile storage unit, the HTM L document processing device may further include, in addition to the index tree transformation means, a style set transformation means 1007 (shown in dotted lines in the figure because it is an unnecessary component) for serializing the style set into a style array of a specific format that the nonvolatile storage unit can store.

In addition, in the scenario of the application of the present invention to a browser, when a user wishes to open a page, a request may be sent to a corresponding server by, for example, inputting a web address, and an HTM L document corresponding to the page may be received from the server.

Next, a page display apparatus according to an embodiment of the present invention will be described with reference to FIG. 11, which corresponds to the page display method described hereinabove and is used in cooperation with the HTM L document processing apparatus 1000 described hereinabove, as shown in FIG. 11, the page display apparatus 1100 includes storage means 1101, layout processing means 1102, and display means 1103. the page display apparatus 1100 may be the terminal apparatus 20 described hereinabove with reference to FIG. 1 or one of the components thereof.

The storage 1101 includes a memory and a non-volatile storage unit, and stores a text stream, an index tree, and a style set, which are obtained by preprocessing with respect to an HTM L document corresponding to a page, wherein the text stream is obtained by separating tags included in an HTM L document and contains only text, the index tree includes one or more nodes, and each node includes data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment, wherein the tags are used to label a style of the text segment corresponding thereto, the style set is a set of styles of the text segment corresponding to each node in the index tree, and the text stream, the index tree, and the style set are stored in association.

The layout processing device 1102 is configured to perform layout processing of a page based on the index tree, the style set, and the text stream.

The display device 1103 is used for displaying the page processed by the typesetting processing device.

Corresponding to the HTM L document processing device according to the embodiment of the present invention, in the page display device, layout display of pages can be performed more flexibly based on the text stream, the index tree, and the style set stored as three independent data structures.

In addition, as an alternative embodiment, at least the text stream is stored in a non-volatile storage unit outside the memory, and accordingly, the page display device 1100 may further include a text stream loading means 1104 (shown in dotted lines in the drawing because it is an unnecessary component) for retrieving the text stream corresponding to the HTM L document in the non-volatile storage unit in response to an instruction to display a page, and loading only a part of the text stream into the memory, for example, in an application scenario of a reader, one chapter(s) of a book corresponds to one HTM L document, that is, the parsed text stream, index tree, and style set all correspond to one chapter(s), when a page is displayed, it is necessary to retrieve the HTM L corresponding to the chapter(s) containing the page in the non-volatile storage unit to obtain the text stream.

Since the prevalence of text content in most HTM L documents for mobile device access tends to be large, memory overhead can be reduced by only maintaining a buffer of partially relevant text (e.g., two pages or two screens) in memory to load the content, and does not grow linearly as the text content grows.

Therefore, in this case, the page display apparatus may further include: an index tree recovery unit 1105 (shown by a dotted line since it is an unnecessary component) for loading the index array corresponding to the page into the memory and recovering the index tree corresponding to the index array based on the index array and a second predetermined rule; and a style set restoring unit 1106 (shown in a dotted line since it is an unnecessary component) for loading the style array corresponding to the page into the memory and deserializing into the style set.

Of course, corresponding to the HTM L document processing device, the page display apparatus may further include a decompression device 1107 for decompressing the index array before restoring the index tree in the case of compressing the index array.

Therefore, the page display apparatus may further include: a style set modification means 1107 (shown in dashed lines as it is an unnecessary component) for modifying the style set in response to an instruction to change the style of the page. Wherein the layout processing means 1102 adjusts the layout of the page based on the modified style set.

In addition, it should be noted herein that, since the specific processes of each device in the HTM L document processing apparatus and the page display apparatus completely correspond to the HTM L document processing method and the page display method described above, the specific details of each process are not described herein to avoid redundancy.

An example of an HTM L document processing device according to the present invention as a hardware entity is shown in fig. 12. the HTM L processing device includes a processor 1201, a memory 1202, and at least one external communication interface 1203. the processor 1201, the memory 1202, and the external communication interface 1203 are all connected by a bus 1204.

The processor 1201 for data Processing may be implemented by a microprocessor, a Central Processing Unit (CPU), a Digital Signal Processor (DSP), or a Programmable logic Array (FPGA), and the memory 1202 may include an operation instruction, which may be a computer executable code, to implement the steps in the flow of the HTM L document Processing method according to the embodiment of the present invention.

Fig. 13 shows an example of a page display device according to the present invention as a hardware entity, where the HTM L processing device includes a processor 1301, a memory 1302, a display 1303, and at least one external communication interface 1304, and the processor 1301, the memory 1302, the display 1303, and the external communication interface 1304 are all connected via a bus 1305.

The processor 1301 for data Processing may be implemented by a microprocessor, a Central Processing Unit (CPU), a Digital Signal Processor (DSP), or a Programmable logic Array (FPGA), and the memory 1302 may include an operation instruction, which may be a computer executable code, to implement the steps in the flow of the HTM L document Processing method according to the embodiment of the present invention.

Fig. 14 illustrates a schematic diagram of a computer-readable recording medium according to an embodiment of the present invention, as illustrated in fig. 14, a computer-readable recording medium 1400 according to an embodiment of the present invention has computer program instructions 1401 stored thereon, and when the computer program instructions 1401 are executed by a processor, the HTM L document processing method or page display method according to an embodiment of the present invention described with reference to the above drawings is performed.

In the HTM L document processing method, the page display method, the device and the medium according to the embodiment of the invention, firstly, by separating text and tags, a DOM tree in the prior art is disassembled into three data structures to be independently stored, so that the situation that the HTM L document is changed can be more flexibly dealt with.

It should be noted that, in the present specification, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Finally, it should be noted that the series of processes described above includes not only processes performed in time series in the order described herein, but also processes performed in parallel or individually, rather than in time series.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention may be implemented by software plus a necessary hardware platform, and may also be implemented by software entirely. With this understanding in mind, all or part of the technical solutions of the present invention that contribute to the background can be embodied in the form of a software product, which can be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes instructions for causing a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments or some parts of the embodiments of the present invention.

The present invention has been described in detail, and the principle and embodiments of the present invention are explained herein by using specific examples, which are only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An HTM L document processing method, comprising:

obtaining a text stream containing only text by separating tags contained in an HTM L document;

constructing an index tree by parsing the tag and text in the HTM L document, the index tree including one or more nodes, and each node including data indicating a text segment in the text corresponding to the node and data indicating a style of the text segment;

obtaining a style set, which is a set of styles of text segments corresponding to each node in the index tree; and

storing the text stream, the index tree, and the style set in association.

2. The method of claim 1, wherein the text stream is stored in a non-volatile storage unit outside of memory.

3. The method of claim 2, wherein the step of storing the text stream, the index tree, and the stylegroup in association further comprises:

taking data included by each node in the index tree as a unit, and sequencing the data included by each node to form an index array; and

and storing the index array in a nonvolatile storage unit outside the memory.

4. The method of claim 3, wherein the step of sorting the data included in each node to form an index array further comprises:

sequencing according to the order from small to large of the left end point of the interval indicated by the data included in each node to obtain a first array;

and for two units with the same left end point of the interval in the first array, further sorting the two units according to the sequence from large to small of the right end point of the interval to obtain the index array.

5. The method of claim 3, wherein storing the index array in a non-volatile storage unit outside of memory further comprises:

performing compression on the index array; and

and storing the compressed index array in a nonvolatile storage unit outside the memory.

6. The method of claim 2, wherein the step of storing the text stream, the index tree, and the stylegroup in association further comprises:

serializing the style set into a style array of a specific format which can be stored by the nonvolatile storage unit; and

storing the pattern array in the non-volatile storage unit.

7. A page display method includes:

in response to an instruction to display a page, performing layout processing of the page based on an index tree, a style set, and a text stream corresponding to the page, the index tree, the style set, and the text stream being obtained in advance by processing an HTM L document corresponding to the page, the text stream being obtained by separating tags included in an HTM L document and containing only text, the index tree including one or more nodes, and each node including data indicating a text segment in the text corresponding to the node and data indicating a style of the text segment, the style set being a set of styles of the text segment corresponding to each node in the index tree, and a method of displaying a page using the index tree, the method including the steps of performing layout processing of the page based on the index tree, the style set, and the text stream

And displaying the page after the typesetting processing.

8. The method of claim 7, wherein at least the text stream is stored in a non-volatile storage unit outside of memory, and the step of performing layout processing of the page based on the index tree, style set, and text stream corresponding to the page further comprises:

retrieving a text stream corresponding to the HTM L document in the non-volatile storage unit and loading only a portion of the text stream into memory, and

and executing typesetting processing of the page based on the index tree, the style set and the partial text stream.

9. The method of claim 7, wherein the index tree is transformed into an index array and stored in a non-volatile storage unit outside of memory, the pattern sets are serialized into a pattern array and stored in a non-volatile storage unit outside of memory, and wherein the method further comprises:

loading an index array corresponding to the page into a memory, and recovering an index tree corresponding to the index array based on the index array; and

and loading the style array corresponding to the page into a memory, and deserializing into the style set.

10. The method of claim 9, wherein the step of recovering an index tree corresponding to the index array based on the index array further comprises:

constructing an index tree with [0, ∞, 0) ] as a root node in a memory, and then giving a pointer P to point to a current node on the index tree;

sequentially reading three numbers from the index array as a unit, and giving a pointer T to point to a first unit, wherein the pointer T represents a node to be placed currently;

judging whether the following conditions are met: the left end point of the text segment interval of the node pointed by the pointer T is more than or equal to the left end point of the text segment interval of the node pointed by the pointer P, and the right end point of the text segment interval of the node pointed by the pointer T is less than or equal to the right end point of the text segment interval of the node pointed by the pointer P;

if the judgment result is yes, inserting the node pointed by the pointer T into the index tree as a child node of the current node pointed by the pointer P, and if the current node has a child node, determining whether the current node is placed on the right side or the left side of the existing child node by comparing text segment intervals of the node to be inserted and the existing child node;

if the judgment result is negative, the P pointer points to the father node of the current node, and the judgment of the steps is repeated until the T node meets the condition;

the current node pointed to by pointer P is moved to the node pointed to by pointer T and pointer T is pointed to the next cell in the index array until the index array is empty.

11. The method of claim 7, further comprising:

modifying the style set in response to an instruction to change a style of the page;

performing a layout process of a page based on the modified style set, the index tree, and the text stream; and

and displaying the page after the typesetting processing.

12. An HTM L document processing device comprising:

text stream acquisition means for acquiring a text stream containing only a text by separating tags contained in the HTM L document;

index tree construction means for constructing an index tree by parsing the tag and text in the HTM L document, the index tree including one or more nodes, and each node including data indicating a text segment in the text corresponding to the node and data indicating a style of the text segment;

style set obtaining means for obtaining a style set which is a set of styles of text segments corresponding to each node in the index tree; and

a storage device for storing the text stream, the index tree, and the style set in association.

13. A page display device, comprising:

a storage device including a memory and a non-volatile storage unit to store a text stream, an index tree, and a style set, which are obtained by preprocessing with respect to an HTM L document corresponding to a page, wherein the text stream is obtained by separating tags included in an HTM L document and includes only text, the index tree includes one or more nodes, and each node includes data indicating a text segment in the text corresponding to the node and data indicating a style of the text segment, the style set is a set of styles of the text segment corresponding to each node in the index tree;

typesetting processing means for performing, in response to an instruction to display a page, typesetting processing of the page based on an index tree, a style set, and a text stream corresponding to the page; and

and the display device is used for displaying the page processed by the typesetting processing device.

14. A computer-readable recording medium storing thereon a computer program that, when executed by a processor, performs a process of:

storing the text stream, the index tree, and the style set in association.

15. A computer-readable recording medium storing thereon a computer program that, when executed by a processor, performs a process of:

And displaying the page after the typesetting processing.