CN111475679B - HTML document processing method, page display method and equipment - Google Patents

HTML document processing method, page display method and equipment Download PDF

Info

Publication number
CN111475679B
CN111475679B CN201910069208.1A CN201910069208A CN111475679B CN 111475679 B CN111475679 B CN 111475679B CN 201910069208 A CN201910069208 A CN 201910069208A CN 111475679 B CN111475679 B CN 111475679B
Authority
CN
China
Prior art keywords
text
node
style
index tree
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910069208.1A
Other languages
Chinese (zh)
Other versions
CN111475679A (en
Inventor
许阳寅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910069208.1A priority Critical patent/CN111475679B/en
Publication of CN111475679A publication Critical patent/CN111475679A/en
Application granted granted Critical
Publication of CN111475679B publication Critical patent/CN111475679B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

An HTML document processing method, a page display method and a device are disclosed. The HTML document processing method comprises the following steps: obtaining a text stream containing only text by separating tags contained in the HTML document; constructing an index tree by parsing tags and text in the HTML document, which includes one or more nodes, and each node includes data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment; obtaining a style set, which is a set of styles corresponding to each node in the index tree; and storing the text stream, the index tree, and the style set in association.

Description

HTML document processing method, page display method and equipment
Technical Field
The invention relates to an HTML document processing method, a page display method and equipment.
Background
The hypertext markup language (HyperText Markup Language, abbreviated HTML) is a markup language designed for pages and other information that can be viewed in a web browser or reader. The code content written in accordance with the HTML syntax is an HTML document. The structure of an HTML document includes a "Head" section (Head) that provides information about a page and a "Body" section (Body) that provides specific content of the page. For example, the specific content of the page may include text and a label for indicating a display style of the text. For example, the display style may include, but is not limited to, font, color, line spacing, and the like.
The web browser or reader may complete the conversion of the HTML document to a page by loading and parsing the HTML document. The DOM is an abbreviation for document objectification model (Document Object Model). Existing browsers (including those used by mobile devices) and readers parse tags and text in an HTML document through the DOM into a DOM tree, where each node of the tree appears as an HTML tag or text associated with an HTML tag. The tree structure precisely describes the interrelationship between tags and between texts in an HTML document.
However, since tags and text in an HTML document are all fused into one DOM tree in the related art, independence between tags and text is lacking. In this case, once the style of the page of the HTML document such as font, line spacing, etc. changes, it will be necessary to re-parse the HTML document to generate a new DOM tree, resulting in a large processing overhead.
Disclosure of Invention
In view of the above, it is desirable to provide a new HTML document processing method, page display method, and apparatus capable of parsing an HTML document in a more flexible manner and structure, thereby reducing processing overhead.
According to an aspect of the present invention, there is provided an HTML document processing method including: obtaining a text stream containing only text by separating tags contained in the HTML document; constructing an index tree by parsing tags and text in the HTML document, which includes one or more nodes, and each node includes data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment; obtaining a style set, which is a set of styles of text segments corresponding to each node in the index tree; and storing the text stream, the index tree, and the style set in association.
In addition, in the method according to an embodiment of the present invention, at least the text stream is stored in a non-volatile storage unit outside the memory.
In addition, in the method according to an embodiment of the present invention, the step of storing the text stream, the index tree, and the style set in association further includes: ordering the data included in each node in the index tree by taking the data included in each node in the index tree as a unit to form an index array; and storing the index array in a non-volatile storage unit outside the memory.
In addition, in the method according to the embodiment of the present invention, the step of ordering the data included in each node to form an index array further includes: sequencing according to the order from small to large of the left end points of the intervals indicated by the data included by each node to obtain a first array; and for the two units with the same left end point of the interval in the first array, further sequencing the units according to the order from the big end point to the small end point of the interval to obtain the index array.
In addition, in the method according to an embodiment of the present invention, the step of storing the index array in a non-volatile storage unit outside the memory further includes: performing compression on the index array; and storing the compressed index array in a nonvolatile storage unit outside the memory.
In addition, in the method according to an embodiment of the present invention, the step of storing the text stream, the index tree, and the style set in association further includes: serializing the style set into a style array in a specific format which can be stored by the nonvolatile storage unit; and storing the style array in the nonvolatile memory unit.
According to another aspect of the present invention, there is provided a page display method including: in response to an instruction to display a page, performing typesetting processing of the page based on an index tree, a style set, and a text stream corresponding to the page, wherein the index tree, the style set, and the text stream are obtained in advance by processing an HTML document corresponding to the page, the text stream is obtained by separating a tag contained in the HTML document and contains only text, the index tree includes one or more nodes, and each node includes data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment, the style set is a set of styles of the text segment corresponding to each node in the index tree, and the text stream, the index tree, and the style set are stored in association; and displaying the typeset page.
In addition, in the method according to the embodiment of the present invention, at least the text stream is stored in a non-volatile storage unit outside the memory, and the step of performing typesetting processing of the page based on the index tree, the style set, and the text stream corresponding to the page further includes: retrieving a text stream corresponding to the HTML document in the nonvolatile storage unit, and loading only a part of the text stream into a memory; and performing typesetting processing of pages based on the index tree, the style set and the partial text stream.
In addition, in a method according to an embodiment of the present invention, the index tree is transformed into an index array and stored in a non-volatile storage unit outside the memory, the style set is serialized into a style array and stored in the non-volatile storage unit outside the memory, and wherein the method further comprises: loading an index array corresponding to the page into a memory, and recovering an index tree corresponding to the index array based on the index array; and loading the style array corresponding to the page into a memory, and inversely sequencing the style array into the style set.
In addition, in the method according to the embodiment of the present invention, the step of restoring the index tree corresponding to the index array based on the index array further includes: constructing an index tree taking [0, ] and [0 ] as a root node in a memory, and then giving a pointer P to point to a current node on the index tree; sequentially reading three numbers from the index array as a unit, and giving a pointer T to the first unit, wherein the pointer T represents a node to be placed currently; judging whether the following conditions are satisfied: the left end point of the text segment interval of the node pointed by the pointer T is larger than or equal to the left end point of the text segment interval of the node pointed by the pointer P, and the right end point of the text segment interval of the node pointed by the pointer T is smaller than or equal to the right end point of the text segment interval of the node pointed by the pointer P; if the judgment result is yes, inserting the node pointed by the pointer T into the index tree to serve as a child node of the current node pointed by the pointer P, and if the child node exists in the current node, determining whether the node is placed on the right side or the left side of the existing child node by comparing the node to be inserted with a text segment interval of the existing child node; if the judgment result is negative, the P pointer points to the father node of the current node, and the judgment of the steps is repeated until the T node meets the condition; the current node pointed to by pointer P is moved to the node pointed to by pointer T and pointer T is pointed to the next element in the index array until the index array is empty.
In addition, the method according to the embodiment of the invention further comprises the following steps: modifying the set of styles in response to instructions to change the styles of the pages; performing typesetting processing of pages based on the modified style set, the index tree, and the text stream; and displaying the typeset page.
According to another aspect of the present invention, there is provided an HTML document processing apparatus comprising: text stream obtaining means for obtaining a text stream containing only text by separating tags contained in an HTML document; index tree construction means for constructing an index tree including one or more nodes by parsing tags and text in the HTML document, and each node including data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment; style set obtaining means for obtaining a style set, which is a set of styles of text segments corresponding to each node in the index tree; and storage means for storing the text stream, the index tree, and the style set in association.
In addition, in the device according to the embodiment of the present invention, the storage means includes a memory and a nonvolatile storage unit, and at least the text stream is stored in the nonvolatile storage unit outside the memory.
In addition, the apparatus according to the embodiment of the present invention further includes: and the index tree conversion device is used for ordering the data included in each node in the index tree by taking the data included in each node as a unit to form an index array, and the index array is stored in a nonvolatile storage unit outside the memory.
In addition, the apparatus according to the embodiment of the present invention further includes: and the compression device is used for compressing the index array, wherein the compressed index array is stored in a nonvolatile storage unit outside the memory.
In addition, the apparatus according to the embodiment of the present invention further includes: and the style set conversion device is used for serializing the style set into a style array in a specific format which can be stored in the nonvolatile storage unit, wherein the style array is stored in the nonvolatile storage unit.
According to another aspect of the present invention, there is provided a page display device including: a storage device including a memory and a nonvolatile storage unit for storing a text stream obtained by preprocessing with respect to an HTML document corresponding to a page, an index tree and a style set, wherein the text stream is obtained by separating a tag contained in the HTML document and contains only text, the index tree includes one or more nodes, and each node includes data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment, wherein the tag is used for labeling a style of the text segment corresponding to the tag, and the style set is a set of styles of the text segment corresponding to each node in the index tree; typesetting processing means for executing typesetting processing of a page based on an index tree, a style set, and a text stream corresponding to the page in response to an instruction to display the page; and the display device is used for displaying the page processed by the typesetting processing device.
In addition, in the device according to the embodiment of the present invention, at least the text stream is stored in a non-volatile storage unit outside the memory, and the device further includes: and text stream loading means for retrieving a text stream corresponding to the HTML document in the nonvolatile storage unit in response to an instruction to display a page, and loading only a part of the text stream into a memory, wherein the typesetting processing means is further configured to perform typesetting processing of the page based on the index tree, the style set, and the part of the text stream.
In addition, in the apparatus according to the embodiment of the present invention, the index tree is transformed into an index array and stored in a non-volatile storage unit outside the memory, the style set is serialized into a style array and stored in the non-volatile storage unit outside the memory, and wherein the apparatus further includes: the index tree restoring device is used for loading the index array corresponding to the page into the memory and restoring the index tree corresponding to the index array based on the index array; and the style set recovery device is used for loading the style array corresponding to the page into the memory and inversely serializing the style set.
In addition, the apparatus according to the embodiment of the present invention further includes: a style set modifying means for modifying the style set in response to an instruction to change a style of the page, wherein the typesetting processing means is further configured to perform typesetting processing of the page based on the modified style set, the index tree, and the text stream.
According to another aspect of the present invention, there is provided a computer-readable recording medium having stored thereon a computer program which, when executed by a processor, realizes the steps of: obtaining a text stream containing only text by separating tags contained in the HTML document; constructing an index tree by parsing tags and text in the HTML document, which includes one or more nodes, and each node includes data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment; obtaining a style set, which is a set of styles of text segments corresponding to each node in the index tree; and storing the text stream, the index tree, and the style set in association.
In addition, according to another aspect of the present invention, there is provided a computer-readable recording medium having stored thereon a computer program which, when executed by a processor, realizes the steps of: in response to an instruction to display a page, performing typesetting processing of the page based on an index tree, a style set, and a text stream corresponding to the page, wherein the index tree, the style set, and the text stream are obtained in advance by processing an HTML document corresponding to the page, the text stream is obtained by separating a tag contained in the HTML document and contains only text, the index tree includes one or more nodes, and each node includes data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment, the style set being a set of styles of the text segment corresponding to each node in the index tree; and displaying the typeset page.
In the HTML document processing method, the page display method, the device and the medium according to the embodiments of the present invention, first, one DOM tree in the prior art is disassembled into three data structures to be stored separately by separating text and tags, so that the situation that the HTML document is changed can be more flexibly handled. For example, when parameters such as font adjustment, line spacing reading, word spacing and the like are performed, the adjustment of the style can be performed by modifying only the style set based on the existing index tree and text flow without re-analyzing the HTML document, so that the system overhead is reduced, and efficient typesetting processing is realized.
Drawings
FIG. 1 is a schematic diagram illustrating an application environment of an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a specific procedure of an HTML document processing method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating an index tree constructed by an HTML document processing method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram illustrating a DOM tree constructed according to an HTML document processing method of the related art as a comparative example;
FIG. 5 shows a signal flow diagram corresponding to the HTML document processing method according to the present invention;
FIG. 6 shows a schematic diagram of a reader to which the present invention may be applied;
FIG. 7 shows a schematic diagram of a browser to which the present invention may be applied;
FIG. 8 is a flowchart illustrating a specific process of a page display method according to an embodiment of the present invention;
9 (A) through 9 (D) are diagrams illustrating one possible implementation of recovering an index tree based on an index array;
fig. 10 is a functional block diagram illustrating a configuration of an HTML document processing apparatus according to an embodiment of the present invention;
fig. 11 is a functional block diagram illustrating a configuration of a page display device according to an embodiment of the present invention;
FIG. 12 shows an HTML document processing device according to the present invention as one example of a hardware entity;
FIG. 13 shows a page display device according to the present invention as an example of a hardware entity; and fig. 14 shows a schematic view of a computer-readable recording medium according to an embodiment of the present invention.
Detailed Description
Various preferred embodiments of the present invention will be described below with reference to the accompanying drawings. The following description is provided with reference to the accompanying drawings to assist in the understanding of the exemplary embodiments of the invention as defined by the claims and their equivalents. It includes various specific details that aid in understanding, but they are to be considered exemplary only. Accordingly, those skilled in the art will recognize that various changes and modifications may be made to the embodiments described herein without departing from the scope and spirit of the invention. Moreover, a detailed description of functions and configurations well known in the art will be omitted for the sake of clarity and conciseness of the present specification.
First, an application environment of an embodiment of the present invention will be briefly described. As shown in fig. 1, the server 10 is connected to a plurality of terminal apparatuses 20 through a network 30. The plurality of terminal devices 20 may be terminals used by a user who is to view a page. For example, the terminal device 20 may include an HTML document processing device described below, and may also include a page display device described below. The terminal may be a smart terminal, such as a smart phone, PDA (personal digital assistant), desktop computer, notebook computer, tablet computer, etc., but may also be other types of terminals. The server 10 is a server corresponding to a web address of a page to be displayed by a browser or a server corresponding to page content to be displayed by a reader. For example, when a user wishes to view a page, an HTML document may be acquired from a corresponding server, and then the corresponding page is viewed by parsing the HTML document at the terminal device side. The network 30 may be any type of wired or wireless network, such as the Internet. It should be appreciated that the number of client devices 20 shown in fig. 1 is illustrative and not limiting.
Next, various embodiments of the present invention will be described.
First, an HTML document processing method according to an embodiment of the present invention will be described with reference to fig. 2. The HTML document processing method is analysis preprocessing performed before page display. As shown in fig. 2, the HTML document processing method includes the following steps.
First, in step S201, a text stream containing only text is obtained by separating tags contained in an HTML document.
For example, given an HTML document fragment:
<p><i>WeRead</i><b>Rocks</b>!</p>
where p, i, b are labels representing styles of corresponding text. For example, the style of text may include, but is not limited to, font, color, line spacing, and the like.
Then, through the processing of step S201, specifically, by removing the tags p, i, b in the HTML document fragment and retaining the text corresponding to each tag (i.e., the text contained in the tag), a text stream WeReadRocks!
Then, at step S202, an index tree is constructed by parsing a tag and text in the HTML document, the tag being used to annotate a style of a text segment corresponding thereto, and including one or more nodes, and each node including data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment.
For example, still taking the above HTML document fragment as an example, the index tree may be constructed as follows. First, the text segment corresponding to the tag p is all text segments, and the start point and the end point of the storage location of the text segment corresponding to the tag p are respectively indicated by two numbers, in this example, [0, 12 ]. In addition, a number (e.g., "0") is required to represent the style of the text segment corresponding to the tag p. Then, by analyzing the tag p and its corresponding text, one node [0, 12,0 ] in the index tree can be obtained. Then, the same processing is repeated for the tag i and the tag b, respectively. Further, two other nodes [0,6,1 ], [6, 11, 2) were obtained. Since the text segment section of the node corresponding to the text segment marked by the tag p is the largest, the node is taken as the root node. The other two nodes are respectively used as left and right child nodes of the root node according to the size sequence of the text segment interval. Fig. 3 shows an example of an index tree constructed by step S202. As shown in fig. 3, the index tree includes one root node [0, 12, 0) and two child nodes [0,6,1), [6, 11, 2). In the tree structure shown in fig. 3, the section of the text segment corresponding to one node is represented in a left-closed and right-open manner. The first two digits represent the start and end of the storage location for the corresponding text segment, and the last digit indicates the style of the text segment. In the example shown in fig. 3, the styles of text segments corresponding to tags p, i, b in the HTML document are corresponding to three numbers 0,1, 2. As described later, 0,1, 2 represent indexes of styles of text segments corresponding to the tags p, i, b in a style set, respectively. In addition, in one node in the index tree, the arrangement order of the numbers indicating the start and end points of the storage position of the text segment corresponding to the tag and the numbers indicating the style of the text segment corresponding to the tag is not limited to the order shown in fig. 3. For example, three nodes in the index tree may also be represented as [0, 12), [1,0,6 ], [2,6, 11), or [0, 12), [0,1,6), [6,2, 11 ].
Further, still taking the HTML document fragment described above as an example, fig. 4 shows a DOM tree constructed according to the HTML document processing method of the related art as a comparative example. As can be seen from fig. 4, in the prior art, tags and text in an HTML document are all fused into one DOM tree. In contrast, referring back to fig. 3, in the index tree constructed according to the HTML document processing method of the present invention, only data for associating a style and a relation of text is contained, but no specific data of text and style is contained.
Next, in step S203, a style set is obtained, which is a set of styles corresponding to each node in the index tree.
For example, the style set may be obtained based on a CSS (Cascading Style Sheets, cascading style sheet) associated with the HTML document. CSS is used to define the specific content of the style. For example, corresponding to the HTML document fragment exemplified above, the specific style defined by the CSS may be:
p { color: blue; -pattern represented by p-tag is font blue);
i { font-style: italic; color: black; the style represented by the i label is italic, font black);
b { font-weight: bold; color: red; the style represented by the b-tag is bolded, red in font).
The labels p, i and b can also be used as selectors for selecting which style the corresponding text segment adopts to display, { color: blue; -font-style, italic; color: black; -child, { font-weight; color: red; and a style block for describing the specific content of the style.
Thus, the CSS itself is an array of "selectors, style blocks". When constructing an index tree, a selector needs to match a tree node to determine if the block style belongs to that node. Therefore, when saving, the information "selector" need not be reserved, but only the style block, and the index tree will record the location of the style block to which it belongs at each node (e.g., the third digit: 0, 1, 2 in the node in the example of fig. 3).
Based on the index tree constructed in step S202, such as the index tree shown in fig. 3, and based on the CSS associated with the HTML document, the following style set can be obtained:
[{color:blue;},{font-style:italic;color:black;},{font-weight:bold;color:red;}]。
it can be seen that in the case of the index tree shown in fig. 3, an array containing three style blocks can be obtained as a style set. Wherein, the 0 th element { color: blue in the array; the 1 st element { font-style: italic in the array corresponds to the specific content of the style indicated by 0 in the index tree; color: black; the 2 nd element { font-weight: bold in the array corresponds to the specific content of the style indicated by 1 in the index tree; color: red; the specific content of the style indicated by 2 in the index tree.
Finally, in step S204, the text stream, the index tree and the style set are stored in association.
It should be noted that although the steps of acquiring the text stream, the index tree, and the style set are shown in chronological order in fig. 2, the present invention is not limited thereto, and the present invention is not intended to specifically limit the chronological order of the steps of acquiring the text stream, the index tree, and the style set. For example, instead of the order shown in fig. 2, the index tree and style set may be acquired first, and then the text stream may be acquired. Furthermore, the above steps may be performed in parallel, in addition to being performed sequentially in chronological order.
Fig. 5 shows a signal flow diagram corresponding to the HTML document processing method according to the present invention. In the HTML document processing method according to the embodiment of the present invention, an HTML document and an associated CSS can be obtained based on a ZIP-form file obtained from a server, and one DOM tree in the related art is disassembled into three data structures to be separately saved by separating text and tags, so that it is possible to more flexibly cope with a case where an HTML document is changed. For example, when parameters such as font adjustment, line spacing reading, word spacing and the like are performed, the adjustment of the style can be performed by modifying only the style set based on the existing index tree and text flow without re-analyzing the HTML document, so that the system overhead is reduced, and efficient typesetting processing is realized.
Here, in the HTML document processing method according to the embodiment of the present invention, the parsed text stream, the index tree, and the style set are stored separately as three data structures, but the storage locations of the text stream, the index tree, and the style set are not particularly limited. For example, the text stream, index tree, and style set may all be stored in memory. Alternatively, the text stream, index tree, and style set may all be stored in a nonvolatile memory unit. Alternatively, a part of them may be stored in the memory, and another part may be stored in the nonvolatile memory unit.
However, as an alternative implementation, in the HTML document processing method according to an embodiment of the present invention, the storage locations of the text stream, the index tree, and the style set may be specifically defined. For example, at least the text stream is stored in a non-volatile storage unit external to the memory.
In particular, the present invention is applicable to readers, web browsers, and the like. Of course, the present invention is not limited thereto. In addition to browsers and readers, any other application scenario involving page browsing may similarly apply the present invention, such as news, social networking, etc. pages with page browsing functionality may similarly apply the present invention.
Fig. 6 shows a schematic diagram of a reader to which the invention can be applied. In fig. 6, three pages are shown, respectively. Where page 601 is the start page when the user starts the reader application, page 602 is the cover and brief introduction of a book selected by the user, and page 603 is the page that is displayed when the user reads the book.
In a scenario where the present invention is applied to a reader (as shown in fig. 6), when a user wishes to read a certain book, the terminal device shown in fig. 1 may be operated to send a request to a server corresponding to the reader and receive an HTML document corresponding to the book from the server. The terminal equipment firstly stores the received HTML document in a nonvolatile storage unit, and then loads the HTML document into a memory for analysis. Since the pages in a book are fixed, the text streams, index trees and style sets corresponding to all chapters (as described later, one chapter (multiple pages) of a book corresponds to one HTML document) can be parsed out in advance, stored in a non-volatile storage unit outside the memory, and then loaded into the memory when a certain page needs to be displayed to perform typesetting processing. That is, in this case, after the text stream, the index tree, and the style set are obtained through the processing of steps S201 to S203, these three parts are stored in association in a nonvolatile storage unit outside the memory. Therefore, the time taken to parse the HTML document can be reduced when displaying the page, and the loading speed at the time of page display can be increased. Also, when performing a page display (e.g., page 603 in fig. 6), only a portion of text associated with the current page display may be loaded into memory (details of which will be described below), and memory overhead may be further reduced.
For memories, information may be represented using complex types, but for non-volatile storage units (e.g., disks), information may be represented using only a single type, e.g., a byte stream may be stored. For example, a set of arrays in memory can be described as [ (1, 2, 3), (4, 5, 6) ], whereas in a non-volatile memory cell, only the set of arrays can be described as [1,2,3,4,5,6]. Therefore, in the case of storing the index tree constructed in step S202 in the nonvolatile storage unit, it is necessary to sort the data included in each node in the index tree to obtain one array that can be stored in the nonvolatile storage unit.
Specifically, the step S204 of storing the text stream, the index tree, and the style set in association may further include: ordering the data included in each node in the index tree by taking the data included in each node in the index tree as a unit to form an index array; and storing the index array in a non-volatile storage unit outside the memory.
The index tree shown in fig. 3 is still taken as an example for illustration. Three nodes are [0, 12, 0), [0,6,1), [6, 11, 2), respectively. The data included in each node is taken as a unit, for example, 0, 12,0 is taken as a unit, and the data in each unit is extracted and ordered. Here, it is to be noted that numerals included in one unit are not permutable and are as a whole.
As a possible implementation, the data may be ordered randomly. For example, the numbers of the three units [0, 12, 0), [0,6,1), [6, 11,2] may be randomly ordered to form an index array [0,6,1,0, 12,0,6, 11,2].
However, in view of the processing overhead in subsequently restoring the index tree, as another possible implementation, the numbers contained in the units may be ordered regularly according to a specific rule. For example, the numbers indicating styles included in the respective units do not affect the ranking, and only two numbers indicating the start and end points of a text segment are considered in the ranking. Of course, the present invention is not limited thereto, and does not exclude the case where numerals indicating patterns included in the respective units may also participate in the sorting, if necessary.
As an example, the ordering may be according to the following rules: first, an array [0, 12,0,0,6,1,6, 11,2] is obtained by sorting the left end point of the section (i.e., the start point of the text segment) in order from small to large. Then, for the same two units of the left end point of the section, i.e., [0, 12,0 ] and [0,6,1 ], the right end point of the section (i.e., the end point of the text segment) is further sorted in order from large to small, resulting in a final index array [0, 12,0,0,6,1,6, 11,2].
Furthermore, in view of efficient use of storage space, as a possible implementation, the step of storing the index array in a non-volatile storage unit outside the memory may further include: and compressing the index array, and storing the compressed index array in a nonvolatile storage unit outside the memory.
For example, the index number may be compressed using variable length compression coding. Since most HTML documents are relatively small, the number of nodes in the index tree representing the start and end points of a text segment will be small (especially the nodes on the left hand side), in which case the number of bytes used to store the data can be reduced for data compression purposes. Specifically, for each number in the index array, the following processing is performed in turn: it is determined whether the number is less than 65535. If the determination is yes, i.e., the number is less than 65535, a first number of bytes (e.g., 1 to 2 bytes) is used for saving. On the other hand, if the determination is no, i.e., the number is greater than 65535, a second number of bytes (e.g., 4 bytes) is used for saving. Here, the first number is smaller than the second number. Therefore, the storage space can be further compressed than in the case of using a fixed number of bytes to hold the index array.
Of course, in the case of performing the compression processing when storing the index array, when restoring the index tree, it is accordingly necessary to perform decompression on the index array first and then perform the restoration processing of the index tree.
In addition to the conversion of the index tree, the step of storing the text stream, the index tree and the style set in association to accommodate differences in memory and non-volatile storage units further comprises: serializing the style set into a style array in a specific format which can be stored by the nonvolatile storage unit; and storing the style array in the nonvolatile memory unit. For example, style sets may be serialized into a style number save in JSON format.
In contrast to index trees and style sets, text streams do not need to perform transformations, but are saved directly in the original text as plain text data in a non-volatile memory unit.
In addition, fig. 7 shows a schematic view of a browser to which the present invention can be applied. In fig. 7, a user may type a web page address in text box 701 and in response thereto display web page content corresponding to the web page address in page 702.
In a scenario where the present invention is applied to a browser (as shown in fig. 7), when a user wishes to open a certain page, a request may be sent to a corresponding server by, for example, inputting a web address, and an HTML document corresponding to the page may be received from the server. The terminal equipment firstly stores the received HTML document in a nonvolatile storage unit, and then loads the HTML document into a memory for analysis. Because the content of the web page is uncertain, the HTML document is re-parsed each time a new page is desired to be opened. In this case, the parsed index tree and style set are directly stored in the memory, and only the text stream is stored in the non-volatile storage unit outside the memory. Also, when performing page display, only a portion of text related to the current page display (e.g., the content displayed in page 702 in FIG. 7) may be loaded into memory (the specific details of which will be described below). This is particularly useful when the text content of the page is large or applied to resource-constrained terminal devices, which can significantly reduce memory overhead.
In the above, the HTML document processing method according to the embodiment of the present invention, that is, the parsing preprocessing method performed before page display is described with reference to the accompanying drawings. Next, a page display method corresponding to the HTML document processing method described above will be described with reference to fig. 8. Wherein the page corresponds to an HTML document, a text stream, an index tree and a style set are obtained by preprocessing said HTML document, and said text stream, index tree and style set are stored in association, wherein at least said text stream is stored in a non-volatile storage unit outside the memory, as described above. As shown in fig. 8, the page display method includes the following steps.
First, in step S801, in response to an instruction to display a page, typesetting processing of the page is performed based on an index tree, a style set, and a text stream corresponding to the page, which are obtained in advance by processing an HTML document corresponding to the page, the text stream being obtained by separating a tag contained in the HTML document and containing only text, the index tree including one or more nodes, and each node including data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment, wherein the tag is used to annotate a style of the text segment corresponding thereto, the style set being a set of styles of the text segment corresponding to each node in the index tree, and the text stream, the index tree, and the style set being stored in association.
In addition, as an alternative embodiment, at least the text stream is stored in a non-volatile storage unit outside the memory, and in correspondence therewith, the step S801 of performing typesetting processing of pages based on the index tree, style set, and text stream corresponding to the pages may further include retrieving the text stream corresponding to the HTML document in the non-volatile storage unit and loading only a part of the text stream into the memory in response to an instruction to display a page; and performing typesetting processing of pages based on the index tree, the style set and the partial text stream.
For example, in the application scenario of the reader, one chapter (multiple pages) of a book corresponds to one HTML document, that is, the parsed text stream, index tree, and style set all correspond to one chapter (multiple pages). When a certain page is displayed, an HTML document corresponding to a chapter including the page needs to be retrieved in a nonvolatile storage unit to obtain a text stream. However, as previously mentioned, the resulting text stream here is a text stream corresponding to the entire chapter, which would cause excessive memory overhead if it were loaded entirely into memory. Therefore, the text of the entire chapter does not need to be loaded into the memory, and only the text of the current display page and the previous page or the next page (two pages in total) need to be loaded into the memory. In addition, in the application scenario of the web browser, if the length of one page is long (e.g., exceeds the screen size, and scrolling is required), only the text corresponding to the display screen size and the text of the previous screen or the next screen (two screens in total) may be loaded into the memory.
Since the text content in most HTML documents for access by mobile devices tends to be very large, by only maintaining a buffer of partially related text (e.g., two pages or screens) in memory to load the content, the memory overhead can be reduced and does not grow linearly with the increase in text content.
It should be noted here that, in the application scenario of the browser, since only the text stream is stored in the nonvolatile storage unit outside the memory after parsing the HTML document to obtain the text stream, the index tree, and the style set, in other words, the index tree and the style set are not subjected to any conversion since they remain in the memory, it is only necessary to load the necessary text into the memory when performing the typesetting processing, and perform the typesetting processing of the page based on the index tree, the style set, and the portion of the text.
In the application scenario of the reader, since the HTML document is parsed in advance to obtain the text stream, the index tree, and the style set, and all of them are stored in the nonvolatile storage unit outside the memory, the index tree and the style set need to be converted in order to adapt to different storage characteristics of the nonvolatile storage unit. That is, as described above, by transforming the index tree into an index array and storing it in a non-volatile storage unit outside the memory, and by serializing the style set into a style array and storing it in a non-volatile storage unit outside the memory.
In this case, therefore, the method further includes a step of restoring the index tree and the style set before step S801. Specifically, before step S801, the method further includes: loading an index array corresponding to the page into a memory, and recovering an index tree corresponding to the index array based on the index array; and loading the style array corresponding to the page into a memory, and inversely sequencing the style array into the style set.
As one possible implementation, still described using the example above, the recovery process of the index tree may include the following steps.
Step one: the index array [0, 12,0,0,6,1,6, 11,2] is read from the nonvolatile storage unit to the memory.
Step two: constructing a tree with [0, ], 0) as a root node in the memory, wherein the first two numbers represent the start and end of a text segment, and the third number represents the style corresponding to the text segment and does not affect the judgment in the subsequent step four, similar to the structure described above. Then a pointer P is given to the current node on the index tree. Since there is only one root node in the index tree at present, pointer P points to the root node on the index tree.
Step three: sequentially reading from the index array, and taking out three numbers at a time as a left-closed and right-open unit corresponding to the array structure described above. First, a pointer T is given to the first element [0, 12,0 ], where pointer T represents a node currently to be placed. Fig. 9 (a) shows the index tree at this time, in which the position of the node pointed to by the pointer T has not been determined yet. As described above, the third number in each cell indicates only the style to which the text segment corresponds, and is data that is irrelevant to the text segment section, and thus is irrelevant to the subsequent judgment. Note that in fig. 9 (a) -9 (D), the portion within the box represents the formed index tree portion. The node pointed to by pointer T is located outside the box since its position in the index tree has not yet been determined.
Step four: judging whether the following conditions are satisfied: the left end point of the text segment section of the node pointed by the pointer T is larger than or equal to the left end point of the text segment section of the node pointed by the pointer P, and the right end point of the text segment section of the node pointed by the pointer T is smaller than or equal to the right end point of the text segment section of the node pointed by the pointer P.
If the determination result is yes, in other words, if the text segment interval of the node pointed to by the pointer T is included in the text segment interval of the node pointed to by the pointer P, the node pointed to by the pointer T is inserted into the index tree as a child node of the current node pointed to by the pointer P. If the current node already has a child node, then a decision is made as to whether to place the current node to the right or left of the existing child node by comparing the text segment interval of the node to be inserted with the existing child node.
On the other hand, if the judgment result is no, the P pointer is pointed to the father node of the current node, and the judgment of the step four is repeated until the T node meets the condition.
Then, pointer P is pointed to pointer T, i.e.: the current node pointed to by pointer P is moved to the node pointed to by pointer T and pointer T is pointed to the next element in the index array until the index array is empty.
Fig. 9 (B) shows a case where a node corresponding to a first cell is inserted into the index tree and a next cell is read. Since the first element [0, 12, 0) satisfies the condition, it is inserted into the index tree as a child node of the root node. At the same time, pointer P is pointed to [0, 12, 0) and pointer T is pointed to the next node to be placed [0,6,1 ].
Fig. 9 (C) shows a case where a node corresponding to the second unit is placed, and fig. 9 (D) shows a case where a node corresponding to the third unit is placed. In fig. 9 (C), the current node to be placed [6, 11,2 ] pointed to by the pointer T is not included in the node [0,6,1 ] pointed to by the pointer P, and thus the pointer P is transformed to point to the parent node [0, 12, 0) of the current pointer, as shown in fig. 9 (D). Then, the comparison of the node [0, 12,0 ] pointed to by pointer P with the node [6, 11,2 ] pointed to by pointer T continues, at which point the above conditions are met, and the node is inserted into the index tree. Since the section of this node is to the right of node [0,6,1), it is inserted to the right of node [0,6,1).
Then, in step S802, the typeset processed page is displayed.
In addition, during the page display process, the user may adjust parameters such as font, line height, width and height. In the invention, because the style set is analyzed and stored independently, the index tree and the text stream are kept unchanged, the style set is only modified, and typesetting display can be executed based on the modified style set and the original index tree and text stream.
Therefore, the page display method according to the present invention may further include, after step S902: modifying the set of styles in response to instructions to change the styles of the pages; executing typesetting processing of the page based on the modified style set, the original text stream and the style set; and displaying the typeset page. Thus, when the parameters such as fonts, reading line spacing, word spacing and the like are adjusted, the HTML document does not need to be parsed again, but the HTML document is based on the existing index tree and text stream, only modifying the style set can execute style adjustment, thereby reducing system overhead and realizing efficient typesetting processing.
Hereinabove, the HTML document processing method and the page display method according to the embodiment of the present invention have been described in detail with reference to fig. 1 to 9 (a) -9 (D). Next, an HTML document processing apparatus corresponding to the HTML document processing method described above will be described with reference to fig. 10.
As shown in fig. 10, the HTML document processing apparatus 1000 includes a text stream obtaining device 1001, an index tree constructing device 1002, a style set obtaining device 1003, and a storage device 1004. The HTML document processing apparatus 1000 may be an integral part of the terminal apparatus 20 described above with reference to fig. 1.
The text stream acquiring means 1001 acquires a text stream containing only text by separating tags contained in an HTML document.
The index tree constructing means 1002 constructs an index tree including one or more nodes by parsing tags and text in the HTML document, wherein the tags are used for annotating a style of a text segment corresponding thereto, and each node includes data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment.
The style set acquisition means 1003 acquires a style set, which is a set of specific contents of a style corresponding to each node in the index tree.
The storage 1004 stores the text stream, the index tree, and the style set in association with each other.
In the HTML document processing device according to the embodiment of the invention, one DOM tree in the prior art is disassembled into three data structures to be stored separately by separating the text and the tag, so that the situation that the HTML document is changed can be more flexibly dealt with. For example, when parameters such as font adjustment, line spacing reading, word spacing and the like are performed, the adjustment of the style can be performed by modifying only the style set based on the existing index tree and text flow without re-analyzing the HTML document, so that the system overhead is reduced, and efficient typesetting processing is realized.
Here, in the HTML document processing apparatus according to the embodiment of the present invention, the text stream, the index tree, and the style set obtained by the text stream obtaining means 1001, the index tree constructing means 1002, and the style set obtaining means 1003 are individually stored as three data structures, but the storage positions of the text stream, the index tree, and the style set are not particularly limited. For example, the text stream, index tree, and style set may all be stored in memory. Alternatively, the text stream, index tree, and style set may all be stored in a nonvolatile memory unit. Alternatively, a part of them may be stored in the memory, and another part may be stored in the nonvolatile memory unit.
However, as an alternative implementation, in the HTML document processing apparatus according to an embodiment of the present invention, the storage locations of the text stream, the index tree, and the style set may be specifically defined. For example, at least the text stream is stored in a non-volatile storage unit external to the memory.
For example, the storage 1004 may include memory and non-volatile storage units, and at least the text stream is stored in the non-volatile storage units outside the memory.
In particular, the present invention is applicable to readers, web browsers, and the like. In a case where the present invention is applied to a reader, when a user wishes to read a certain book, the terminal device shown in fig. 1 may be operated to send a request to a server corresponding to the reader and receive an HTML document corresponding to the book from the server. The terminal equipment firstly stores the received HTML document in a nonvolatile storage unit, and then loads the HTML document into a memory for analysis. Because the pages in the book are fixed, the text streams, index trees and style sets corresponding to all chapters can be resolved in advance and stored in a nonvolatile storage unit outside the memory, and then the text streams, index trees and style sets corresponding to a certain page are loaded into the memory when the page is required to be displayed so as to execute typesetting processing. That is, in this case, after the text stream, the index tree, and the style set are obtained by the processing of the text stream obtaining means 1001, the index tree constructing means 1002, and the style set obtaining means 1003, these three parts are stored in association in a nonvolatile storage unit outside the memory. Therefore, the time taken to parse the HTML document can be reduced when displaying the page, and the loading speed at the time of page display can be increased. Also, when performing page display, only a portion of text related to the current page display may be loaded into the memory (details will be described later), and memory overhead may be further reduced.
For memories, information may be represented using complex types, but for non-volatile storage units (e.g., disks), information may be represented using only a single type, e.g., a byte stream may be stored. For example, a set of arrays in memory can be described as [ (1, 2, 3), (4, 5, 6) ], whereas in a non-volatile memory cell, only the set of arrays can be described as [1,2,3,4,5,6]. Therefore, in the case of storing the index tree constructed by the index tree constructing apparatus 1002 in the nonvolatile memory unit, it is necessary to sort the data included in each node in the index tree to obtain one array that can be stored in the nonvolatile memory unit.
Accordingly, the HTML document processing apparatus 1000 may further include: index tree converting means 1005 (shown in broken line because it is an unnecessary part) for sorting the data included in each node in the index tree to form an index array with the data included in each node as a unit, wherein the data in one unit is as a whole and the order is not changeable at the time of sorting. And wherein the index array is stored in a non-volatile memory unit other than memory.
In addition, in view of effective utilization of the storage space, as a possible embodiment, the HTML document processing apparatus 1000 may further include: compression means 1006 (shown in broken lines in the figure because it is an unnecessary component) for performing compression on the index array. Wherein the compressed index array is stored in a non-volatile memory unit other than the memory.
For example, the index number may be compressed using variable length compression coding. Since most HTML documents are relatively small, the number of nodes in the index tree representing the start and end points of a text segment will be small (especially the nodes on the left hand side), in which case the number of bytes used to store the data can be reduced for data compression purposes. Specifically, for each number in the index array, the following processing is performed in turn: it is determined whether the number is less than 65535. If the determination is yes, i.e., the number is less than 65535, a first number of bytes (e.g., 1 to 2 bytes) is used for saving. On the other hand, if the determination is no, i.e., the number is greater than 65535, a second number of bytes (e.g., 4 bytes) is used for saving. Here, the first number is smaller than the second number. Therefore, the storage space can be further compressed than in the case of using a fixed number of bytes to hold the index array.
Of course, in the case of performing the compression processing when storing the index array, when restoring the index tree, it is accordingly necessary to perform decompression on the index array first and then perform the restoration processing of the index tree.
In addition to the conversion of the index tree, in order to accommodate the difference between the memory and the nonvolatile storage unit, the HTML document processing apparatus may further include, in addition to the index tree conversion means: style set conversion means 1007 (shown in broken lines because it is an unnecessary component) for serializing the style set into a style array of a specific format that can be stored by the nonvolatile memory unit. Wherein the style array is stored in the nonvolatile memory unit. For example, style sets may be serialized into a style number save in JSON format.
In contrast to index trees and style sets, text streams do not need to perform transformations, but are saved directly in the original text as plain text data in a non-volatile memory unit.
In addition, in a case where the present invention is applied to a browser, when a user wishes to open a certain page, a request may be transmitted to a corresponding server by, for example, inputting a web address, and an HTML document corresponding to the page may be received from the server. The terminal equipment firstly stores the received HTML document in a nonvolatile storage unit, and then loads the HTML document into a memory for analysis. Because the content of the web page is uncertain, the HTML document is re-parsed each time a new page is desired to be opened. In this case, the parsed index tree and style set are directly stored in the memory, and only the text stream is stored in the non-volatile storage unit outside the memory. When the page display is executed, only a part of text related to the current page display may be loaded into the memory. This is particularly useful when the text content of the page is substantial, which can significantly reduce memory overhead.
Next, a page display device according to an embodiment of the present invention will be described with reference to fig. 11. The page display device corresponds to the page display method described above and is used in conjunction with the HTML document processing device 1000 described above. As shown in fig. 11, the page display device 1100 includes a storage means 1101, a layout processing means 1102, and a display means 1103. The page display device 1100 may be the terminal device 20 or one of the components described above with reference to fig. 1.
The storage 1101 includes a memory and a nonvolatile storage unit for storing a text stream obtained by preprocessing with respect to an HTML document corresponding to a page, an index tree including one or more nodes and each including data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment, and a style set for labeling a style of the text segment corresponding thereto, the style set being a set of styles of the text segment corresponding to each node in the index tree, and storing the text stream, the index tree, and the style set in association.
The typesetting processing means 1102 is for performing typesetting processing of pages based on the index tree, the style set, and the text stream.
The display device 1103 is configured to display the page processed by the typesetting processing device.
In correspondence with the HTML document processing apparatus according to the embodiment of the present invention, in the page display apparatus, typesetting display of pages can be performed more flexibly based on the text stream, the index tree, and the style set held as three independent data structures.
In addition, as an alternative embodiment, at least the text stream is stored in a nonvolatile storage unit outside the memory, and in correspondence therewith, the page display device 1100 may further include: text stream loading means 1104 (shown in broken lines because it is an unnecessary component) for retrieving a text stream corresponding to the HTML document in the nonvolatile storage unit and loading only a part of the text stream into the memory in response to an instruction to display a page. For example, in the application scenario of the reader, one chapter (multiple pages) of a book corresponds to one HTML document, that is, the parsed text stream, index tree, and style set all correspond to one chapter (multiple pages). When a certain page is displayed, an HTML document corresponding to a chapter including the page needs to be retrieved in a nonvolatile storage unit to obtain a text stream. However, as previously mentioned, the resulting text stream here is a text stream corresponding to the entire chapter, which would cause excessive memory overhead if it were loaded entirely into memory. Thus, the text of the entire chapter need not be added to the memory, and only the text of the currently displayed page and the previous or next page (two pages in total) need be loaded into the memory. In addition, in the application scenario of the web browser, if the length of one page is long (e.g., exceeds the screen size, and scrolling is required), only the text corresponding to the display screen size and the text of the previous screen or the next screen (two screens in total) may be loaded into the memory. Wherein the typesetting processing apparatus 1102 is further configured to perform typesetting processing of pages based on the index tree, the style set, and the partial text stream.
Since the text content in most HTML documents for access by mobile devices tends to be very large, by only maintaining a buffer of partially related text (e.g., two pages or screens) in memory to load the content, memory overhead can be reduced and does not grow linearly with the growth of text content.
It should be noted here that, in the application scenario of the browser, since only the text stream is stored in the nonvolatile storage unit outside the memory after parsing the HTML document to obtain the text stream, the index tree, and the style set, in other words, the index tree and the style set are not subjected to any conversion since they remain in the memory, it is only necessary to load the necessary text into the memory when performing the typesetting processing, and perform the typesetting processing of the page based on the index tree, the style set, and the portion of the text.
In the application scenario of the reader, since the HTML document is parsed in advance to obtain the text stream, the index tree, and the style set, and all of them are stored in the nonvolatile storage unit outside the memory, the index tree and the style set need to be converted in order to adapt to different storage characteristics of the nonvolatile storage unit. That is, as described above, by transforming the index tree into an index array and storing it in a non-volatile storage unit outside the memory, and by serializing the style set into a style array and storing it in a non-volatile storage unit outside the memory.
Thus, in this case, the page display device may further include: index tree restoration means 1105 (shown in broken lines because it is an unnecessary component) for loading an index array corresponding to the page into the memory and restoring an index tree corresponding to the index array based on the index array and a second predetermined rule; and a style set restoring means 1106 (shown in dotted lines in the figure because it is an unnecessary component) for loading the style arrays corresponding to the pages into the memory and inversely sequencing the style sets.
Of course, the page display device may further include decompression means 1107 for decompressing the index array before restoring the index tree in the case of compressing the index array, corresponding to the HTML document processing means.
In addition, during the page display process, the user may adjust parameters such as font, line height, width and height. In the invention, because the style set is analyzed and stored independently, the index tree and the text stream are kept unchanged, the style set is only modified, and typesetting display can be executed based on the modified style set and the original index tree and text stream.
Accordingly, the page display device may further include: style set modification means 1107 (shown in broken lines in the figure because it is an unnecessary part) for modifying the style set in response to an instruction to change the style of the page. Wherein the typesetting processing means 1102 adjusts typesetting of the page based on the modified style set.
In addition, it is to be noted here that, since the specific processing of each device in the HTML document processing apparatus and the page display apparatus corresponds entirely to the HTML document processing method and the page display method described above, specific details of each processing are not described here for the sake of avoiding redundancy. Those skilled in the art will appreciate that the processing in the HTML document processing method and the page display method described above can be applied to each device in the HTML document processing apparatus and the page display apparatus entirely similarly.
An HTML document processing apparatus according to the present invention is shown in fig. 12 as one example of a hardware entity. The HTML processing device includes a processor 1201, a memory 1202, and at least one external communication interface 1203. The processor 1201, the memory 1202 and the external communication interface 1203 are all connected via a bus 1204.
For the processor 1201 for data processing, when performing processing, it may be implemented with a microprocessor, a central processing unit (CPU, central Processing Unit), a digital signal processor (DSP, digital Singnal Processor), or a programmable logic array (FPGA, field-Programmable Gate Array); the memory 1202 contains operation instructions, which may be computer-executable code, by which the steps in the flow of the HTML document processing method of the above-described embodiment of the present invention are implemented.
A page display device according to the present invention is shown in fig. 13 as an example of a hardware entity. The HTML processing device includes a processor 1301, a memory 1302, a display 1303, and at least one external communication interface 1304. The processor 1301, memory 1302, display 1303 and external communication interface 1304 are all connected by a bus 1305.
For the processor 1301 for data processing, when performing processing, it may be implemented with a microprocessor, a central processing unit (CPU, central Processing Unit), a digital signal processor (DSP, digital Singnal Processor), or a programmable logic array (FPGA, field-Programmable Gate Array); the memory 1302 contains operation instructions, which may be computer-executable code, by which each step in the flow of the HTML document processing method according to the embodiment of the present invention described above is implemented.
Fig. 14 shows a schematic diagram of a computer-readable recording medium according to an embodiment of the present invention. As shown in fig. 14, a computer-readable recording medium 1400 according to an embodiment of the present invention has stored thereon computer program instructions 1401. When the computer program instructions 1401 are executed by a processor, the HTML document processing method or the page display method according to the embodiment of the present invention described with reference to the above drawings is performed.
Heretofore, an HTML document processing method, a page display method, an apparatus, and a medium according to an embodiment of the present invention have been described in detail with reference to fig. 1 to 14. In the HTML document processing method, the page display method, the device and the medium according to the embodiments of the present invention, first, one DOM tree in the prior art is disassembled into three data structures to be stored separately by separating text and tags, so that the situation that the HTML document is changed can be more flexibly handled. For example, when parameters such as font adjustment, line spacing reading, word spacing and the like are performed, the adjustment of the style can be performed by modifying only the style set based on the existing index tree and text flow without re-analyzing the HTML document, so that the system overhead is reduced, and efficient typesetting processing is realized. In addition, when the page display is executed, only partial text related to the current page display is loaded into the memory, which is particularly useful when the text content of the page is large or the page is applied to terminal equipment with limited resources, so that the memory overhead can be greatly reduced, and the situation that the memory is blocked due to overlarge occupation, even the memory is insufficient and page typesetting cannot be completed is prevented.
It should be noted that in this specification the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
Finally, it is also to be noted that the above-described series of processes includes not only processes performed in time series in the order described herein, but also processes performed in parallel or separately, not in time series.
From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software plus the necessary hardware platform, but may of course also be implemented entirely in software. With such understanding, all or part of the technical solution of the present invention contributing to the background art may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the method described in the embodiments or some parts of the embodiments of the present invention.
The foregoing has outlined rather broadly the more detailed description of the invention in order that the detailed description of the principles and embodiments of the invention may be implemented in conjunction with the detailed description of the invention that follows, the examples being merely intended to facilitate an understanding of the method of the invention and its core concepts; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims (15)

1. An HTML document processing method, comprising:
obtaining a text stream containing only text by separating tags contained in the HTML document;
constructing an index tree by parsing tags and text in the HTML document, which includes one or more nodes, and each node includes data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment;
obtaining a style set, which is a set of styles of text segments corresponding to each node in the index tree; and
the text stream, the index tree, and the style set are stored in association.
2. The method of claim 1, wherein the text stream is stored in a non-volatile storage unit external to memory.
3. The method of claim 2, wherein the step of storing the text stream, the index tree, and the style set in association further comprises:
ordering the data included in each node in the index tree by taking the data included in each node in the index tree as a unit to form an index array; and
and storing the index array in a nonvolatile storage unit outside the memory.
4. The method of claim 3, wherein the step of ordering the data included by the nodes to form an index array further comprises:
sequencing according to the order from small to large of the left end points of the intervals indicated by the data included by each node to obtain a first array;
and for the two units with the same left end point of the interval in the first array, further sequencing the units according to the order from the big end point to the small end point of the interval to obtain the index array.
5. The method of claim 3, wherein storing the index array in a non-volatile storage unit outside of memory further comprises:
performing compression on the index array; and
and storing the compressed index array in a nonvolatile storage unit outside the memory.
6. The method of claim 2, wherein the step of storing the text stream, the index tree, and the style set in association further comprises:
serializing the style set into a style array in a specific format which can be stored by the nonvolatile storage unit; and
the style array is stored in the nonvolatile memory unit.
7. A page display method, comprising:
in response to an instruction to display a page, performing typesetting processing of the page based on an index tree, a style set, and a text stream corresponding to the page, wherein the index tree, the style set, and the text stream are obtained in advance by processing an HTML document corresponding to the page, the text stream is obtained by separating a tag contained in the HTML document and contains only text, the index tree includes one or more nodes, and each node includes data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment, the style set being a set of styles of the text segment corresponding to each node in the index tree; and
and displaying the page after typesetting.
8. The method of claim 7, wherein at least the text stream is stored in a non-volatile storage unit outside a memory, and the step of performing typesetting processing of pages based on the index tree, style set, and text stream corresponding to the pages further comprises:
Retrieving a text stream corresponding to the HTML document in the nonvolatile storage unit, and loading only a part of the text stream into a memory; and
and executing typesetting processing of the page based on the index tree, the style set and the partial text stream.
9. The method of claim 7, wherein the index tree is transformed into an index array and stored in a non-volatile storage unit outside of memory, the style set is serialized into a style array and stored in a non-volatile storage unit outside of memory, and wherein the method further comprises:
loading an index array corresponding to the page into a memory, and recovering an index tree corresponding to the index array based on the index array; and
and loading the style array corresponding to the page into a memory, and inversely sequencing the style array into the style set.
10. The method of claim 9, wherein recovering an index tree corresponding to the index array based on the index array further comprises:
constructing an index tree taking [0, ] and [0 ] as a root node in a memory, and then giving a pointer P to point to a current node on the index tree;
Sequentially reading three numbers from the index array as a unit, and giving a pointer T to the first unit, wherein the pointer T represents a node to be placed currently;
judging whether the following conditions are satisfied: the left end point of the text segment interval of the node pointed by the pointer T is larger than or equal to the left end point of the text segment interval of the node pointed by the pointer P, and the right end point of the text segment interval of the node pointed by the pointer T is smaller than or equal to the right end point of the text segment interval of the node pointed by the pointer P;
if the judgment result is yes, inserting the node pointed by the pointer T into the index tree to serve as a child node of the current node pointed by the pointer P, and if the child node exists in the current node, determining whether the node is placed on the right side or the left side of the existing child node by comparing the node to be inserted with a text segment interval of the existing child node;
if the judgment result is negative, the P pointer points to the father node of the current node, and the judgment of the steps is repeated until the T node meets the condition;
the current node pointed to by pointer P is moved to the node pointed to by pointer T and pointer T is pointed to the next element in the index array until the index array is empty.
11. The method of claim 7, further comprising:
modifying the set of styles in response to instructions to change the styles of the pages;
performing typesetting processing of pages based on the modified style set, the index tree, and the text stream; and
and displaying the page after typesetting.
12. An HTML document processing apparatus comprising:
text stream obtaining means for obtaining a text stream containing only text by separating tags contained in an HTML document;
index tree construction means for constructing an index tree including one or more nodes by parsing tags and text in the HTML document, and each node including data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment;
style set obtaining means for obtaining a style set, which is a set of styles of text segments corresponding to each node in the index tree; and
and the storage device is used for storing the text stream, the index tree and the style set in an associated mode.
13. A page display device, comprising:
a storage device including a memory and a nonvolatile storage unit for storing a text stream obtained by preprocessing an HTML document corresponding to a page, an index tree, and a style set, wherein the text stream is obtained by separating tags contained in the HTML document and contains only text, the index tree includes one or more nodes, and each node includes data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment, and the style set is a set of styles of the text segment corresponding to each node in the index tree;
Typesetting processing means for executing typesetting processing of a page based on an index tree, a style set, and a text stream corresponding to the page in response to an instruction to display the page; and
and the display device is used for displaying the page processed by the typesetting processing device.
14. A computer-readable recording medium for storing thereon a computer program which, when executed by a processor, performs the following processes:
obtaining a text stream containing only text by separating tags contained in the HTML document;
constructing an index tree by parsing tags and text in the HTML document, which includes one or more nodes, and each node includes data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment;
obtaining a style set, which is a set of styles of text segments corresponding to each node in the index tree; and
the text stream, the index tree, and the style set are stored in association.
15. A computer-readable recording medium for storing thereon a computer program which, when executed by a processor, performs the following processes:
In response to an instruction to display a page, performing typesetting processing of the page based on an index tree, a style set, and a text stream corresponding to the page, wherein the index tree, the style set, and the text stream are obtained in advance by processing an HTML document corresponding to the page, the text stream is obtained by separating a tag contained in the HTML document and contains only text, the index tree includes one or more nodes, and each node includes data indicating a text segment corresponding to the node in the text and data indicating a style of the text segment, the style set being a set of styles of the text segment corresponding to each node in the index tree; and
and displaying the page after typesetting.
CN201910069208.1A 2019-01-24 2019-01-24 HTML document processing method, page display method and equipment Active CN111475679B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910069208.1A CN111475679B (en) 2019-01-24 2019-01-24 HTML document processing method, page display method and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910069208.1A CN111475679B (en) 2019-01-24 2019-01-24 HTML document processing method, page display method and equipment

Publications (2)

Publication Number Publication Date
CN111475679A CN111475679A (en) 2020-07-31
CN111475679B true CN111475679B (en) 2023-06-23

Family

ID=71743613

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910069208.1A Active CN111475679B (en) 2019-01-24 2019-01-24 HTML document processing method, page display method and equipment

Country Status (1)

Country Link
CN (1) CN111475679B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112699642B (en) * 2020-12-31 2023-03-28 医渡云(北京)技术有限公司 Index extraction method and device for complex medical texts, medium and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8051372B1 (en) * 2007-04-12 2011-11-01 The New York Times Company System and method for automatically detecting and extracting semantically significant text from a HTML document associated with a plurality of HTML documents
CN103329122A (en) * 2011-01-18 2013-09-25 苹果公司 Storage of a document using multiple representations
CN103635897A (en) * 2011-06-23 2014-03-12 微软公司 Dynamically updating a running page

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100677429B1 (en) * 2005-02-01 2007-02-02 엘지전자 주식회사 Method for processing user interface in mobile communication terminal
US8266151B2 (en) * 2009-10-30 2012-09-11 Oracle International Corporationn Efficient XML tree indexing structure over XML content
US20150135061A1 (en) * 2013-11-08 2015-05-14 Qualcomm Incorporated Systems and methods for parallel traversal of document object model tree
US9965451B2 (en) * 2015-06-09 2018-05-08 International Business Machines Corporation Optimization for rendering web pages

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8051372B1 (en) * 2007-04-12 2011-11-01 The New York Times Company System and method for automatically detecting and extracting semantically significant text from a HTML document associated with a plurality of HTML documents
CN103329122A (en) * 2011-01-18 2013-09-25 苹果公司 Storage of a document using multiple representations
CN103635897A (en) * 2011-06-23 2014-03-12 微软公司 Dynamically updating a running page

Also Published As

Publication number Publication date
CN111475679A (en) 2020-07-31

Similar Documents

Publication Publication Date Title
US10417348B2 (en) Method for processing and loading web pages supporting multiple languages and system thereof
CN110765385B (en) Method and system for browsing OFD document webpage end
CN109408783A (en) Electronic document online editing method and system
CN104753540B (en) Data compression method, data decompression method and apparatus
US20060107206A1 (en) Form related data reduction
US9496891B2 (en) Compression device, compression method, decompression device, decompression method, and computer-readable recording medium
CN105005472B (en) The method and device of Uyghur Character is shown on a kind of WEB
US20180260389A1 (en) Electronic document segmentation and relation discovery between elements for natural language processing
CN109948518B (en) Neural network-based PDF document content text paragraph aggregation method
US20150178263A1 (en) System and Method for Constructing Markup Language Templates and Input Data Structure Specifications
US20220318515A1 (en) Intelligent text cleaning method and apparatus, and computer-readable storage medium
CN102012894A (en) Method and system for displaying documents by terminals
CN110851136A (en) Data acquisition method and device, electronic equipment and storage medium
CN104978325B (en) A kind of web page processing method, device and user terminal
CN113515928A (en) Electronic text generation method, device, equipment and medium
CN111475679B (en) HTML document processing method, page display method and equipment
CN112433995B (en) File format conversion method, system, computer device and storage medium
TW201530322A (en) Font process method and font process system
CN103458037A (en) Method and device for providing complex web applications in resource-constrained environment
JP5549177B2 (en) Compression program, method and apparatus, and decompression program, method and apparatus
CN115758011A (en) Data unloading method, data display method, device, equipment and storage medium
CN113139145B (en) Page generation method and device, electronic equipment and readable storage medium
CN112035408B (en) Text processing method, device, electronic equipment and storage medium
CN114625658A (en) APP stability test method, device, equipment and computer readable storage medium
CN114925125A (en) Data processing method, device and system, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40026147

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant