CN106339381B - Information processing method and device - Google Patents

Information processing method and device Download PDF

Info

Publication number
CN106339381B
CN106339381B CN201510394321.9A CN201510394321A CN106339381B CN 106339381 B CN106339381 B CN 106339381B CN 201510394321 A CN201510394321 A CN 201510394321A CN 106339381 B CN106339381 B CN 106339381B
Authority
CN
China
Prior art keywords
node
type node
layout information
row
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510394321.9A
Other languages
Chinese (zh)
Other versions
CN106339381A (en
Inventor
马莘权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201510394321.9A priority Critical patent/CN106339381B/en
Publication of CN106339381A publication Critical patent/CN106339381A/en
Application granted granted Critical
Publication of CN106339381B publication Critical patent/CN106339381B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents

Abstract

The embodiment of the invention discloses an information processing method and device; the method comprises the following steps: acquiring and analyzing directory information, and generating a tree structure based on the directory information; wherein the tree structure comprises a plurality of nodes; the node represents an element in the directory information; analyzing nodes in the tree structure, and generating row and column layout information according to a first preset rule based on the attributes of the nodes; determining position information of text block nodes in the row and column layout information in the tree structure; and reading the text block nodes in the row-column layout information according to a second preset rule to generate a directory information list.

Description

Information processing method and device
Technical Field
The present invention relates to communications technologies, and in particular, to an information processing method and apparatus.
Background
With the development of network technology and intelligent terminals, more and more people use intelligent terminals (such as smart phones, tablet computers, etc.) to read information on the internet, including novels.
In the process of implementing the technical solution of the embodiment of the present application, the inventor of the present application finds at least the following technical problems in the related art:
when a third-party content provider (such as an XX novel network) provides webpage content, a hypertext Markup Language (HTML) webpage Text of a novel directory is exchanged, so that a disordered novel directory is obtained when other people convert the HTML webpage Text of the novel directory. How to directly obtain the directory of the correct novel from the HTML webpage text provided by the third-party content provider, in the related art, there is no effective solution to the problem.
Disclosure of Invention
In order to solve the existing technical problem, embodiments of the present invention provide an information processing method and apparatus, which can implement correct extraction of directory information.
In order to achieve the above purpose, the technical solution of the embodiment of the present invention is realized as follows:
the embodiment of the invention provides an information processing method, which comprises the following steps:
acquiring and analyzing directory information, and generating a tree structure based on the directory information; wherein the tree structure comprises a plurality of nodes; the node represents an element in the directory information;
analyzing nodes in the tree structure, and generating row and column layout information according to a first preset rule based on the attributes of the nodes;
determining position information of text block nodes in the row and column layout information in the tree structure;
and reading the text block nodes in the row-column layout information according to a second preset rule to generate a directory information list.
In the foregoing solution, the generating row-column layout information according to a first preset rule based on the attribute of the node includes:
searching a first type node and a second type node taking the first type node as a parent node based on the attribute of the node;
generating row and column layout information based on the first type node and the second type node;
wherein the first type node has a layout branch attribute; the second type node has a layout sublist attribute.
In the above solution, after the first type node and the second type node using the first type node as a parent node are searched based on the attribute of the node, the method further includes: searching a third type node taking the second type node as a parent node based on the attribute of the node; the third type node has a text branch attribute;
correspondingly, the generating row-column layout information based on the first type node and the second type node includes:
generating row-column layout information based on the first type node, the second type node, and the third type node.
In the above solution, the searching for the first type node and the second type node using the first type node as a parent node based on the attribute of the node includes:
searching a first type node from a leaf node to a parent node direction based on the attribute of the node, and generating row layout information based on the first type node;
searching a second type node from the first type node to a leaf node direction based on the attribute of the node, and generating column layout information based on the second type node; the column layout information is column layout information matched with the first type node.
In the above solution, after searching for a second type node from the first type node to a leaf node based on the attribute of the node, the method further includes:
searching a third type node which takes the second type node as a father node based on the attribute of the node, and generating virtual row layout information based on the third type node; the virtual row layout information is row layout information matched with the second type node.
In the foregoing solution, the generating row-column layout information based on the first type node and the second type node includes:
and generating row and column layout information based on the row layout information and the column layout information.
In the foregoing solution, the generating row-column layout information based on the first type node, the second type node, and the third type node includes:
generating rank layout information based on the row layout information, the column layout information, and the virtual row layout information.
In the foregoing solution, the reading the text block nodes in the row-column layout information according to a second preset rule includes:
reading text block nodes in the row and column layout information according to the sequence of rows and columns; and reading the text block nodes in the same line according to the sequence of the text block nodes in the tree structure.
An embodiment of the present invention further provides an information processing apparatus, where the apparatus includes: a conversion unit, an analysis unit and a list generation unit; wherein the content of the first and second substances,
the conversion unit is used for acquiring and analyzing the directory information and generating a tree structure based on the directory information; wherein the tree structure comprises a plurality of nodes; the node represents an element in the directory information;
the analysis unit is used for analyzing the nodes in the tree structure generated by the conversion unit and generating row and column layout information according to a first preset rule based on the attributes of the nodes; determining position information of text block nodes in the row and column layout information in the tree structure;
the list generating unit is used for reading the text block nodes in the row and column layout information generated by the analyzing unit according to a second preset rule to generate a directory information list.
In the foregoing solution, the analysis unit is configured to search for a first type node and a second type node using the first type node as a parent node based on an attribute of the node; generating row and column layout information based on the first type node and the second type node; wherein the first type node has a layout branch attribute; the second type node has a layout sublist attribute.
In the foregoing solution, the analysis unit is further configured to search a first type node and a second type node using the first type node as a parent node based on the attribute of the node, and then search a third type node using the second type node as a parent node based on the attribute of the node; the third type node has a text branch attribute; and the processor is further configured to generate row-column layout information based on the first type node, the second type node, and the third type node.
In the foregoing solution, the analyzing unit is configured to search for a first type node from a leaf node to a parent node based on the attribute of the node, and generate row layout information based on the first type node; searching a second type node from the first type node to a leaf node direction based on the attribute of the node, and generating column layout information based on the second type node; the column layout information is column layout information matched with the first type node.
In the foregoing solution, the analyzing unit is further configured to search a third type node using the second type node as a parent node based on the attribute of the node after searching the second type node from the first type node to the leaf node based on the attribute of the node, and generate virtual row layout information based on the third type node; the virtual row layout information is row layout information matched with the second type node.
In the foregoing solution, the analysis unit is configured to generate row-column layout information based on the row layout information and the column layout information.
In the foregoing solution, the analysis unit is configured to generate row and column layout information based on the row layout information, the column layout information, and the virtual row layout information.
In the above scheme, the list generating unit is configured to read text block nodes in the row and column layout information according to the sequence of rows and columns; and reading the text block nodes in the same line according to the sequence of the text block nodes in the tree structure.
According to the information processing method and device provided by the embodiment of the invention, the directory information is acquired and analyzed, and the tree structure is generated based on the directory information; wherein the tree structure comprises a plurality of nodes; the node represents an element in the directory information; analyzing nodes in the tree structure, and generating row and column layout information according to a first preset rule based on the attributes of the nodes; determining position information of text block nodes in the row and column layout information in the tree structure; and reading the text block nodes in the row-column layout information according to a second preset rule to generate a directory information list. By adopting the technical scheme of the embodiment of the invention, the directory information is converted into the tree structure, the row and column layout information is generated based on the attribute (namely the appearance sequence) of each node in the tree structure, and the position information of the text block nodes in the tree structure in the row and column layout information is determined; and reading the text block nodes in the row and column layout information based on the sequence of the rows and the columns so as to generate a directory information list with correct sequence. Therefore, the correctly sequenced and complete directory information list can be quickly and accurately obtained.
Drawings
Fig. 1 is a schematic flowchart of an information processing method according to a first embodiment of the present invention;
FIG. 2 is a flowchart illustrating an information processing method according to a second embodiment of the present invention;
FIG. 3 is a diagram illustrating directory information according to a second embodiment of the present invention;
FIG. 4 is a diagram illustrating a tree structure according to a second embodiment of the present invention;
FIG. 5a is a diagram illustrating row layout information according to a second embodiment of the present invention;
FIG. 5b is a diagram illustrating column layout information according to a second embodiment of the present invention;
fig. 5c is a schematic diagram of virtual row layout information in the second embodiment of the present invention;
FIG. 6 is a schematic diagram of a virtual layout of text block nodes according to a second embodiment of the present invention;
fig. 7 is a schematic diagram of a configuration of an information processing apparatus according to a third embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
Example one
The embodiment of the invention provides an information processing method. Fig. 1 is a schematic flowchart of an information processing method according to a first embodiment of the present invention; as shown in fig. 1, the information processing method includes:
step 101: acquiring and analyzing directory information, and generating a tree structure based on the directory information; wherein the tree structure comprises a plurality of nodes; the node characterizes an element in the directory information.
The information processing method provided by the embodiment of the invention is applied to an information processing device, and in practical application, the information processing device can be realized by a computer, a server or a server cluster, and the server or the server cluster can be a WEB page (WEB) server or a WEB server cluster. In this step, the obtaining and analyzing the directory information, and generating a tree structure based on the directory information are: the information processing apparatus acquires and analyzes directory information, and generates a tree structure based on the directory information.
In this embodiment, the generating a tree structure based on the directory information includes: and extracting elements in the directory information, and generating a tree structure based on the sequence of the elements in the directory information and the parent-child relationship among the elements. Wherein, the element in the directory information may be any keyword or keyword set in the directory information.
Step 102: and analyzing the nodes in the tree structure, and generating row and column layout information according to a first preset rule based on the attributes of the nodes.
Here, as an embodiment, the generating row-column layout information according to a first preset rule based on the attribute of the node includes:
searching a first type node and a second type node taking the first type node as a parent node based on the attribute of the node;
generating row and column layout information based on the first type node and the second type node;
wherein the first type node has a layout branch attribute; the second type node has a layout sublist attribute.
Further, as another embodiment, after the searching for the first type node based on the attribute of the node and the second type node using the first type node as a parent node, the method further includes: searching a third type node taking the second type node as a parent node based on the attribute of the node; the third type node has a text branch attribute;
correspondingly, the generating row-column layout information based on the first type node and the second type node includes:
generating row-column layout information based on the first type node, the second type node, and the third type node.
In this step, when the tree structure (i.e., the directory information) does not include a third type node (i.e., a node representing a text branch attribute), the first row-column layout information generation manner is adopted; when the tree structure (i.e., the directory information) includes a third type node (i.e., a node representing a text branch attribute), the second row-column layout information generation manner is adopted.
In this step, as a specific implementation manner, the searching for the first type node and the second type node using the first type node as a parent node based on the attribute of the node includes:
searching a first type node from a leaf node to a parent node direction based on the attribute of the node, and generating row layout information based on the first type node;
searching a second type node from the first type node to a leaf node direction based on the attribute of the node, and generating column layout information based on the second type node; the column layout information is column layout information matched with the first type node.
Correspondingly, the generating row-column layout information based on the first type node and the second type node includes: and generating row and column layout information based on the row layout information and the column layout information.
Still further, after searching for a second type node from the first type node to a leaf node direction based on the attribute of the node, the method further comprises:
searching a third type node which takes the second type node as a father node based on the attribute of the node, and generating virtual row layout information based on the third type node; the virtual row layout information is row layout information matched with the second type node.
Correspondingly, the generating row-column layout information based on the first type node, the second type node and the third type node includes: generating rank layout information based on the row layout information, the column layout information, and the virtual row layout information.
Specifically, the first type node may be a "TR" node, the second type node may be a "TD" node, and the third type node may be a "BR" node, which is not limited to the above node. When the tree structure comprises the nodes, row and column layout information can be directly generated according to the attributes and the sequence of the nodes. When the first type node, the second type node, or even the third type node is another node, such as the first type node is an "LI" node, the row-column layout information may be obtained through a pre-configured mapping set. For example, the set of mapping configurations may be represented by the following characters:
Figure BDA0000754418240000071
Figure BDA0000754418240000081
when the number of directories is more than or equal to 8, adjusting the directories of {0,1,2,3,4,5,6,7} in the directory information according to the sequence of {0,4,1,5,2,6,3,7} based on the mapping mode of {0,4,1,5,2,6,3,7} in the mapping configuration set; by analogy, when the number of directories is equal to 7, the directories of {0,1,2,3,4,5,6} in the directory information are adjusted according to the order of {0,4,1,5,2,6,3} based on the mapping manner of {0,4,1,5,2,6,3} in the mapping configuration set.
The mapping configuration set may be preconfigured based on directory information provided by a third-party content provider, that is, the adjustment order in the mapping configuration set is preconfigured based on directory information provided by a third-party content provider.
Step 103: and determining the position information of the text block nodes in the row and column layout information in the tree structure.
In this embodiment, the determining the position information of the text block node in the row and column layout information in the tree structure includes: and determining the position information of the text block nodes in the row and column layout information based on the appearance sequence of the nodes in the tree structure. Specifically, the position information of the text block nodes in the row and column layout information is determined based on the appearance sequence of the text block nodes in the tree structure and the sequence between the text block nodes and other attribute nodes (including layout branch attribute nodes, layout branch attribute nodes and text branch attribute nodes).
Step 104: and reading the text block nodes in the row-column layout information according to a second preset rule to generate a directory information list.
Here, the reading the text block node in the row-column layout information according to a second preset rule includes:
reading text block nodes in the row and column layout information according to the sequence of rows and columns; and reading the text block nodes in the same line according to the sequence of the text block nodes in the tree structure.
By adopting the technical scheme of the embodiment of the invention, the directory information is converted into the tree structure, the row and column layout information is generated based on the attribute (namely the appearance sequence) of each node in the tree structure, and the position information of the text block nodes in the tree structure in the row and column layout information is determined; and reading the text block nodes in the row and column layout information based on the sequence of the rows and the columns so as to generate a directory information list with correct sequence. Therefore, the correctly sequenced and complete directory information list can be quickly and accurately obtained.
Example two
The present embodiment takes a specific application scenario as an example to describe in detail the information processing method provided by the present embodiment. FIG. 2 is a flowchart illustrating an information processing method according to a second embodiment of the present invention; as shown in fig. 2, the information processing method includes:
step 201: and acquiring and analyzing the directory information, and converting the directory information into a tree structure.
Here, the generating a tree structure based on the directory information includes: and extracting elements in the directory information, and generating a tree structure based on the sequence of the elements in the directory information and the parent-child relationship among the elements.
FIG. 3 is a diagram illustrating directory information according to a second embodiment of the present invention; as shown in fig. 3, the directory information provided by the third-party content provider is exchanged, so that when the nodes in the directory information are directly extracted in the order to generate the directory information list, the directory information list shown in table 1 is generated, and it can be seen from table 1 that the generated directory information list is out of order.
Mountain edge village of chapter I
Chapter iv bone-melting cliff
Chao Qing niu Zhen II
Chapter doctor blade
Chapter iii seven door
Chapter six unknown pithy formula
TABLE 1
Specifically, taking the directory information shown in fig. 3 as an example, the elements in the directory information are field information in the directory information, and the field information includes: "html", "body", "table", "TR", "td", "first chapter mountain side village", and the like. Generating a tree structure based on the sequence of the elements of the directory information appearing in the directory information and the parent-child relationship among the elements, as shown in table 2; wherein each element is a node in the tree structure; the numbers in parentheses after each node represent the order in which the nodes appear in the tree structure. More intuitively and vividly, fig. 4 is a schematic diagram of a tree structure in the embodiment of the present invention, where the tree structure may be as shown in fig. 4, and according to the appearance sequence of each node in the tree structure, a preceding node may be referred to as a parent node of a succeeding node, such as a TR (3) node, and may be referred to as a parent node of a TD (4), TD (9) node or TD (14) node; accordingly, the TD (4), TD (9) or TD (14) node may be referred to as a leaf node of the TR (3) node. In this embodiment, the tree structure may be a Document Object Model (DOM) tree structure. Wherein, the nodes in the tree structure capable of representing the text content, such as the node of "first chapter mountain edge village (5)" is referred to as the text block node in this embodiment; nodes having a layout effect in the tree structure, such as TR nodes and TD nodes, may be referred to as attribute nodes in this embodiment.
HTML(0)
BODY(1)
TABLE(2)
TR(3)
TD(4)
First chapter mountain side village (5)
BR(6)
BR(7)
Chapter iv refining bone cliff (8)
TD(9)
Chao Qing niu Zhen II (10)
BR(11)
BR(12)
Chapter doctor black (13)
TD(14)
Chapter seven door (15)
BR(16)
BR(17)
Chapter six unknown pithy formula (18)
TABLE 2
Step 202: searching a first type node from a leaf node to a parent node direction based on the attribute of the node, and generating row layout information based on the first type node; the first type node has a layout branch attribute.
Step 203: searching a second type node from the first type node to a leaf node direction based on the attribute of the node, and generating column layout information based on the second type node; the second type node has a layout sublist attribute; the column layout information is column layout information matched with the first type node.
Step 204: searching a third type node which takes the second type node as a father node based on the attribute of the node, and generating virtual row layout information based on the third type node; the third type node has a text branch attribute; the virtual row layout information is row layout information matched with the second type node.
Step 205: generating rank layout information based on the row layout information, the column layout information, and the virtual row layout information.
In this embodiment, taking the directory information shown in fig. 3 as an example, a tree structure shown in table 2 or fig. 4 is generated, where the tree structure includes a plurality of nodes. Analyzing nodes in the tree structure, wherein each node has a corresponding attribute, namely a function in the tree structure; based on the method, the first type node and the second type node can be searched in the tree structure, and even a third type node can be searched; wherein the first type node has a layout branch attribute; the second type node has a layout sublist attribute; the third type node has a text branch attribute. Taking the tree structure shown in table 2 or fig. 4 as an example, where the "TR" node is a node having a layout row attribute, it may also be understood that the "TR" node divides the page into three blocks in the vertical direction, for example, when there are three "TR" nodes in the tree structure, it may be understood that the page is divided into three blocks in the vertical direction. The "TD" node is a node with layout columns, and can also be understood as a node which divides a page into blocks in the horizontal direction; in this example, each "TD" node has one "TR" node as a parent node, that is, the "TD" node has a function of dividing the "TR" node as a parent node, for example, taking table 2 or fig. 4 as an example, there are three "TD" nodes below the TR (3) node, which means that in a block divided based on the TR (3) node, the block is divided into three blocks in the horizontal direction. Further, in this illustration, a "BR" node is a node having a text-division attribute, and when the "BR" node appears in a node representing text information, the "BR" node can display text information before and after the "BR" node in a division manner; in this illustration, each "BR" node can be considered to have one "TD" node as a parent node, that is, the "BR" node is a node having a text-line attribute under its "TD" node as a parent node. In this illustration, the first type node is a "TR" node, the second type node is a "TD" node, and the third type node is a "BR" node, but the first type node, the second type node, and the third type node are not limited to the above illustration, and may be other nodes, for example, the first type node may also be a "LI" node.
Specifically, taking the tree structure shown in table 2 or fig. 4 as an example, according to the sequence of appearance of nodes in the tree structure, a node before can be called a parent node, and a node after can be called a leaf node, so that the node "TR (3)" can be called a parent node of the node "TD (4)"; accordingly, the direction from the "TD (4)" node to the "TR (3)" node is the direction from the leaf node to the parent node. In this embodiment, the direction from the leaf node to the parent node is a direction from the node at the bottom of the tree structure to the parent node, and taking the tree structure shown in table 2 or fig. 4 as an example, the search is started from the node of "chapter iv chai gu cliff (8)", the node of chapter v kayasu (13) ", and the node of" chapter sixth nameless pithy formula (18) ", until the corresponding node of the first type, that is, the node of" TR (3) ", is searched. Furthermore, each first type node has a layout and branch function, and can also be regarded as a subtree structure formed by taking each first type node as a root node. Row layout information is generated based on the searched first type nodes. In the specific implementation process, the search can be carried out by configuring keywords in advance, and when the nodes matched with the configured keywords are searched, the nodes are determined to be the first type nodes. FIG. 5a is a diagram illustrating row layout information according to a second embodiment of the present invention; as shown in fig. 5a, based on the first type of node (i.e., the node having the layout branch attribute), the page may be divided into several blocks in the vertical direction, presenting a virtual layout effect as shown in fig. 5 a.
After the first type node is searched, searching is performed from the first type node to the leaf node direction until the second type node is the "TD (4)" node, the "TD (9)" node, and the "TD (14)" node, taking the tree structure shown in table 2 or fig. 4 as an example, searching is performed from the "TR (3)" node to the leaf node direction until the second type node is searched. Further, column layout information within the row layout is generated based on the second type node. It is understood that the "TD (4)", "TD (9)", and "TD (14)" nodes divide the table of the "TR (3)" node layout into three columns, and in this illustration, the generated row-column layout information may be understood as including three columns of table contents. FIG. 5b is a diagram illustrating column layout information according to a second embodiment of the present invention; as shown in fig. 5b, based on the first type node (i.e., the node having the layout division property) and the second type node (i.e., the node having the layout division property), the sub-blocks vertically distributed in the first type node may be subdivided into several blocks in the horizontal direction, presenting the virtual layout effect as shown in fig. 5 b.
After searching the second type node, searching the second type node to a third type node in the direction of the leaf node, taking the tree structure shown in table 2 or fig. 4 as an example, and searching the leaf node in the direction of the TD node; taking a "TD (4)" node as an example, searching towards the direction of a leaf node until a third type node is a "BR (6)" node and a "BR (7)" node; taking a "TD (9)" node as an example, searching towards the direction of a leaf node until a third type node is a "BR (11)" node and a "BR (12)" node; taking TD (14) node as an example, searching is performed in the leaf node direction until the third type node is BR (16) node and BR (17) node. Further, virtual row layout information within the column layout is generated based on the above-described third-type node. It is understood that, taking "TD (4)" node as an example, the "BR (6)" node and the "BR (7)" node divide the column of "TD (4)" node into three rows. Fig. 5c is a schematic diagram of virtual row layout information in the second embodiment of the present invention; as shown in fig. 5c, the sub-block horizontally distributed in the second type node may be subdivided into several virtual text blocks in the vertical direction based on the first type node (i.e., the node having the layout division attribute), the second type node (i.e., the node having the layout division attribute), and the third type node (i.e., the node having the text division attribute), and in the present embodiment, the virtual layout effect shown in fig. 5c is presented based on all of the TR node, the TD node, and the BR node.
In a specific implementation process, the column layout information and/or the virtual row layout information are not generated independently, the column layout information is generated based on the row layout information, and the virtual row layout information is generated based on the row layout information and the column layout information, that is, when only the first type node and the second type node are present in the embodiment, the row and column layout information is generated when the column layout information is generated; when the embodiment includes the first type node, the second type node, and the third type node, the line layout information is generated when the virtual line layout information is generated.
Step 206: and determining the position information of the text block nodes in the row and column layout information in the tree structure.
In this embodiment, the determining the position information of the text block node in the row and column layout information in the tree structure includes: and determining the position information of the text block nodes in the row and column layout information based on the appearance sequence of the nodes in the tree structure.
FIG. 6 is a schematic diagram of a virtual layout of text block nodes according to a second embodiment of the present invention; as shown in fig. 6, determining position information of a text block node in the row-column layout information based on an appearance sequence of the text block node in the tree structure and a sequence between the text block node and other nodes having a layout branch attribute, and a text branch attribute; specifically, the position of the text block node in the row and column layout information may be as shown in fig. 6.
Step 207: and reading the text block nodes in the row and column layout information to generate a directory information list.
Here, the reading of the text block node in the row and column layout information includes: reading text block nodes in the row and column layout information according to the sequence of rows and columns; and reading the text block nodes in the same line according to the sequence of the text block nodes in the tree structure.
Based on the contents of the directories read out in the reading order, the generated directory information list can be as shown in table 4, and thus, a directory information list in the correct order is generated.
First chapter mountain side village (5)
Chao Qing niu Zhen II (10)
Chapter seven door (15)
Chapter iv refining bone cliff (8)
Chapter doctor black (13)
Chapter six unknown pithy formula (18)
TABLE 4
By adopting the technical scheme of the embodiment of the invention, the directory information is converted into the tree structure, the row and column layout information is generated based on the attribute (namely the appearance sequence) of each node in the tree structure, and the position information of the text block nodes in the tree structure in the row and column layout information is determined; and reading the text block nodes in the row and column layout information based on the sequence of the rows and the columns so as to generate a directory information list with correct sequence. Therefore, the correctly sequenced and complete directory information list can be quickly and accurately obtained.
The information processing method according to the first and second embodiments of the present invention realizes fast and accurate reduction of the directory information list for out-of-order directory information provided by a third-party content provider (e.g., XX novel network). Thus, the information processing method according to the embodiment of the present invention can be applied to the following scenarios: a user or technician browses to the XX novel on the XX novel network and wants to make the XX novel into an electronic book. Then, the directory information list of xx novel is quickly and accurately obtained through the information processing method, and of course, specific contents can also be obtained based on the information processing method; and making an electronic book based on the restored directory information list and the specific content in the directory information list, so as to facilitate browsing on the mobile terminal.
EXAMPLE III
Fig. 5 is a schematic diagram of a structure of an information processing apparatus according to a third embodiment of the present invention, and as shown in fig. 5, the information processing apparatus includes: a conversion unit 51, an analysis unit 52, and a list generation unit 53; wherein the content of the first and second substances,
the conversion unit 51 is configured to obtain and analyze directory information, and generate a tree structure based on the directory information; wherein the tree structure comprises a plurality of nodes; the node represents an element in the directory information;
the analyzing unit 52 is configured to analyze the nodes in the tree structure generated by the converting unit 51, and generate row and column layout information according to a first preset rule based on the attributes of the nodes; determining position information of text block nodes in the row and column layout information in the tree structure;
the list generating unit 53 is configured to read the text block node in the row and column layout information generated by the analyzing unit 52 according to a second preset rule to generate a directory information list.
The analysis unit 52 is configured to search a first type node and a second type node using the first type node as a parent node based on the attribute of the node; generating row and column layout information based on the first type node and the second type node; wherein the first type node has a layout branch attribute; the second type node has a layout sublist attribute.
Specifically, the analyzing unit 52 is further configured to search for a first type node and a second type node using the first type node as a parent node based on the attribute of the node, and then search for a third type node using the second type node as a parent node based on the attribute of the node; the third type node has a text branch attribute; and the processor is further configured to generate row-column layout information based on the first type node, the second type node, and the third type node.
Correspondingly, the analyzing unit 52 is configured to generate row-column layout information based on the row layout information and the column layout information.
As another embodiment, the analyzing unit 52 is further configured to search a third type node using the second type node as a parent node based on the attribute of the node after searching the first type node and the second type node using the first type node as a parent node based on the attribute of the node; the third type node has a text branch attribute; and the processor is further configured to generate row-column layout information based on the first type node, the second type node, and the third type node.
Specifically, the analyzing unit 52 is configured to search for a first type node from a leaf node to a parent node based on the attribute of the node, and generate row layout information based on the first type node; searching a second type node from the first type node to a leaf node direction based on the attribute of the node, and generating column layout information based on the second type node; the column layout information is column layout information matched with the first type node.
Correspondingly, the analyzing unit 52 is configured to generate row and column layout information based on the row layout information, the column layout information and the virtual row layout information.
Specifically, the list generating unit 53 is configured to read text block nodes in the row and column layout information according to the sequence of rows and columns; and reading the text block nodes in the same line according to the sequence of the text block nodes in the tree structure.
It should be understood by those skilled in the art that the functions of each processing unit in the information processing apparatus according to the embodiment of the present invention may be understood by referring to the description of the information processing method, and each processing unit in the information processing apparatus according to the embodiment of the present invention may be implemented by an analog circuit that implements the functions described in the embodiment of the present invention, or may be implemented by running software that implements the functions described in the embodiment of the present invention on an intelligent terminal.
Example four
An embodiment of the present invention further provides an information processing apparatus, which is shown in fig. 5, and includes: a conversion unit 51, an analysis unit 52, and a list generation unit 53; wherein the content of the first and second substances,
the conversion unit 51 is configured to obtain and analyze directory information, and generate a tree structure based on the directory information; wherein the tree structure comprises a plurality of nodes; the node represents an element in the directory information;
the analyzing unit 52 is configured to analyze the nodes in the tree structure generated by the converting unit 51, search for a first type of node from a leaf node to a parent node based on the attribute of the node, and generate row layout information based on the first type of node; searching a second type node from the first type node to a leaf node direction based on the attribute of the node, and generating column layout information based on the second type node; the column layout information is column layout information matched with the first type node; searching a third type node which takes the second type node as a father node based on the attribute of the node, and generating virtual row layout information based on the third type node; the virtual row layout information is row layout information matched with the second type node; wherein the first type node has a layout branch attribute; the second type node has a layout sublist attribute; the third type node has a text branch attribute; the virtual row layout information is used for generating row and column layout information according to the row layout information, the column layout information and the virtual row layout information; determining position information of text block nodes in the row and column layout information in the tree structure;
the list generating unit 53 is configured to read the text block node in the row and column layout information generated by the analyzing unit 52 according to a second preset rule to generate a directory information list.
The list generating unit 53 is configured to read text block nodes in the row and column layout information according to the sequence of rows and columns; and reading the text block nodes in the same line according to the sequence of the text block nodes in the tree structure.
Specifically, taking the directory information shown in fig. 3 as an example, the elements in the directory information are field information in the directory information, and the field information includes: "html", "body", "table", "TR", "td", "first chapter mountain side village", and the like. Generating a tree structure based on the sequence of the elements of the directory information appearing in the directory information and the parent-child relationship between the elements, as shown in the foregoing table 2; wherein each element is a node in the tree structure; the numbers in parentheses after each node represent the order in which the nodes appear in the tree structure. More intuitively and vividly, the tree structure may be as shown in fig. 4, and according to the sequence of appearance of each node in the tree structure, a preceding node may be referred to as a parent node of a succeeding node, such as a TR (3) node, and may be referred to as a parent node of a TD (4), TD (9) or TD (14) node; accordingly, the TD (4), TD (9) or TD (14) node may be referred to as a leaf node of the TR (3) node. In this embodiment, the tree structure may be a DOM tree structure. Wherein, the nodes in the tree structure capable of representing the text content, such as the node of "first chapter mountain edge village (5)" is referred to as the text block node in this embodiment; nodes having a layout effect in the tree structure, such as TR nodes and TD nodes, may be referred to as attribute nodes in this embodiment.
In this embodiment, taking the directory information shown in fig. 3 as an example, the conversion unit 51 generates a tree structure shown in table 2 or fig. 4, where the tree structure includes a plurality of nodes. The analyzing unit 52 analyzes the nodes in the tree structure, each node having its corresponding attribute, i.e. the function in the tree structure; based on this, the analysis unit 52 may search for the first type node, the second type node, and even may search for the third type node in the tree structure; wherein the first type node has a layout branch attribute; the second type node has a layout sublist attribute; the third type node has a text branch attribute. Taking the tree structure shown in table 2 or fig. 4 as an example, where the "TR" node is a node having a layout row attribute, it may also be understood that the "TR" node divides the page into three blocks in the vertical direction, for example, when there are three "TR" nodes in the tree structure, it may be understood that the page is divided into three blocks in the vertical direction. The "TD" node is a node with layout columns, and can also be understood as a node which divides a page into blocks in the horizontal direction; in this example, each "TD" node has one "TR" node as a parent node, that is, the "TD" node has a function of dividing the "TR" node as a parent node, for example, taking table 2 or fig. 4 as an example, there are three "TD" nodes below the TR (3) node, which means that in a block divided based on the TR (3) node, the block is divided into three blocks in the horizontal direction. Further, in this illustration, a "BR" node is a node having a text-division attribute, and when the "BR" node appears in a node representing text information, the "BR" node can display text information before and after the "BR" node in a division manner; in this illustration, each "BR" node can be considered to have one "TD" node as a parent node, that is, the "BR" node is a node having a text-line attribute under its "TD" node as a parent node. In this illustration, the first type node is a "TR" node, the second type node is a "TD" node, and the third type node is a "BR" node, but the first type node, the second type node, and the third type node are not limited to the above illustration, and may be other nodes, for example, the first type node may also be a "LI" node.
Specifically, taking the tree structure shown in table 2 or fig. 4 as an example, according to the sequence of appearance of nodes in the tree structure, a node before can be called a parent node, and a node after can be called a leaf node, so that the node "TR (3)" can be called a parent node of the node "TD (4)"; accordingly, the direction from the "TD (4)" node to the "TR (3)" node is the direction from the leaf node to the parent node. In this embodiment, the direction from the leaf node to the parent node is a direction from the node at the bottom of the tree structure to the parent node, and taking the tree structure shown in table 2 or fig. 4 as an example, the analysis unit 52 starts to search from the nodes of "chapter iv caricature (8)", chapter v (13) ", and chapter vi (18)", until the corresponding nodes of the first type, that is, "TR (3)", are searched. Furthermore, each first type node has a layout and branch function, and can also be regarded as a subtree structure formed by taking each first type node as a root node. Row layout information is generated based on the searched first type nodes. In a specific implementation process, the analysis unit 52 may perform a search through a pre-configured keyword, and when a node matching the configured keyword is searched, determine that the node is a first type node. As shown in fig. 5a, based on the first type of node (i.e., the node having the layout branch attribute), the page may be divided into several blocks in the vertical direction, presenting a virtual layout effect as shown in fig. 5 a.
After the analyzing unit 52 searches for the first type node, it searches from the first type node to the leaf node direction to the second type node, taking the tree structure shown in table 2 or fig. 4 as an example, it searches from the "TR (3)" node to the leaf node direction, and it searches until the second type node is the "TD (4)", the "TD (9)" node and the "TD (14)". Further, column layout information within the row layout is generated based on the second type node. It is understood that the "TD (4)", "TD (9)", and "TD (14)" nodes divide the table of the "TR (3)" node layout into three columns, and in this illustration, the generated row-column layout information may be understood as including three columns of table contents. As shown in fig. 5b, based on the first type node (i.e., the node having the layout division property) and the second type node (i.e., the node having the layout division property), the sub-blocks vertically distributed in the first type node may be subdivided into several blocks in the horizontal direction, presenting the virtual layout effect as shown in fig. 5 b.
After searching for the second type node, the analyzing unit 52 searches for a third type node from the second type node in the direction of the leaf node, and takes the tree structure shown in table 2 or fig. 4 as an example, searches for the leaf node from the "TD" node; taking a "TD (4)" node as an example, searching towards the direction of a leaf node until a third type node is a "BR (6)" node and a "BR (7)" node; taking a "TD (9)" node as an example, searching towards the direction of a leaf node until a third type node is a "BR (11)" node and a "BR (12)" node; taking TD (14) node as an example, searching is performed in the leaf node direction until the third type node is BR (16) node and BR (17) node. Further, virtual row layout information within the column layout is generated based on the above-described third-type node. It is understood that, taking "TD (4)" node as an example, the "BR (6)" node and the "BR (7)" node divide the column of "TD (4)" node into three rows. As shown in fig. 5c, the sub-block horizontally distributed in the second type node may be subdivided into several virtual text blocks in the vertical direction based on the first type node (i.e., the node having the layout division attribute), the second type node (i.e., the node having the layout division attribute), and the third type node (i.e., the node having the text division attribute), and in the present embodiment, the virtual layout effect shown in fig. 5c is presented based on all of the TR node, the TD node, and the BR node.
The analyzing unit 52 generates row and column layout information based on the row layout information, the column layout information, and the virtual row layout information, which may be shown in table 3.
In a specific implementation process, the column layout information and/or the virtual row layout information are not generated independently, the column layout information is generated based on the row layout information, and the virtual row layout information is generated based on the row layout information and the column layout information, that is, when only the first type node and the second type node are present in the embodiment, the row and column layout information is generated when the column layout information is generated; when the embodiment includes the first type node, the second type node, and the third type node, the line layout information is generated when the virtual line layout information is generated.
By adopting the technical scheme of the embodiment of the invention, the information processing device converts the directory information into the tree structure, generates the row and column layout information based on the attribute (namely the appearance sequence) of each node in the tree structure, and determines the position information of the text block nodes in the row and column layout information in the tree structure; and reading the text block nodes in the row and column layout information based on the sequence of the rows and the columns so as to generate a directory information list with correct sequence. Therefore, the correctly sequenced and complete directory information list can be quickly and accurately obtained.
It should be understood by those skilled in the art that the functions of each processing unit in the information processing apparatus according to the embodiment of the present invention may be understood by referring to the description of the information processing method, and each processing unit in the information processing apparatus according to the embodiment of the present invention may be implemented by an analog circuit that implements the functions described in the embodiment of the present invention, or may be implemented by running software that implements the functions described in the embodiment of the present invention on an intelligent terminal.
In the third and fourth embodiments of the present invention, the information processing apparatus may be implemented by a computer, a server or a server cluster in practical application, and the server or the server cluster may specifically be a WEB server. The conversion Unit 51, the analysis Unit 52, and the list generation Unit 53 in the information Processing apparatus may be implemented by a Central Processing Unit (CPU), a Digital Signal Processor (DSP), or a Programmable Gate Array (FPGA) in the computer or the server in practical applications.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims (16)

1. An information processing method, characterized in that the method comprises:
acquiring and analyzing directory information, and generating a tree structure based on the sequence of the elements in the directory information and the parent-child relationship among the elements; wherein the tree structure comprises a plurality of nodes; the node represents an element in the directory information;
analyzing nodes in the tree structure, and searching a first type node and a second type node taking the first type node as a parent node based on the attribute of the nodes; wherein the first type node has a layout and row attribute, and the second type node has a layout and column attribute;
generating row-column layout information based on the first type node, the second type node and a mapping configuration set, wherein the mapping configuration set comprises an adjustment sequence of the directory information;
determining position information of text block nodes in the row and column layout information in the tree structure;
and reading the text block nodes in the row-column layout information according to a second preset rule to generate a directory information list.
2. The method of claim 1, wherein after searching for a first type node and a second type node having the first type node as a parent node based on the attributes of the nodes, the method further comprises: searching a third type node taking the second type node as a parent node based on the attribute of the node; the third type node has a text branch attribute;
correspondingly, the generating row-column layout information based on the first type node and the second type node includes:
generating row-column layout information based on the first type node, the second type node, and the third type node.
3. The method of claim 1, wherein the searching for a first type node and a second type node having the first type node as a parent node based on the attribute of the node comprises:
searching a first type node from a leaf node to a parent node direction based on the attribute of the node, and generating row layout information based on the first type node;
searching a second type node from the first type node to a leaf node direction based on the attribute of the node, and generating column layout information based on the second type node; the column layout information is column layout information matched with the first type node.
4. The method of claim 3, wherein after searching for a second type node from the first type node to a leaf node direction based on the attribute of the node, the method further comprises:
searching a third type node which takes the second type node as a father node based on the attribute of the node, and generating virtual row layout information based on the third type node; the virtual row layout information is row layout information matched with the second type node.
5. The method of claim 3, wherein generating row-column layout information based on the first type node and the second type node comprises:
and generating row and column layout information based on the row layout information and the column layout information.
6. The method of claim 4, wherein generating row-column layout information based on the first type node, the second type node, and the third type node comprises:
generating rank layout information based on the row layout information, the column layout information, and the virtual row layout information.
7. The method according to claim 1, wherein the reading the text block node in the row and column layout information according to a second preset rule comprises:
reading text block nodes in the row and column layout information according to the sequence of rows and columns; and reading the text block nodes in the same line according to the sequence of the text block nodes in the tree structure.
8. An information processing apparatus characterized in that the apparatus comprises: a conversion unit, an analysis unit and a list generation unit; wherein the content of the first and second substances,
the conversion unit is used for acquiring and analyzing the directory information and generating a tree structure based on the sequence of the elements in the directory information and the parent-child relationship among the elements; wherein the tree structure comprises a plurality of nodes; the node represents an element in the directory information;
the analysis unit is used for analyzing the nodes in the tree structure generated by the conversion unit and searching a first type node and a second type node taking the first type node as a parent node based on the attributes of the nodes; wherein the first type node has a layout and row attribute, and the second type node has a layout and column attribute; generating row-column layout information based on the first type node, the second type node and a mapping configuration set, wherein the mapping configuration set comprises an adjustment sequence of the directory information; determining position information of text block nodes in the row and column layout information in the tree structure;
the list generating unit is used for reading the text block nodes in the row and column layout information generated by the analyzing unit according to a second preset rule to generate a directory information list.
9. The apparatus according to claim 8, wherein the analyzing unit is further configured to search for a third type node having a parent node of a second type based on the attribute of the node after searching for the first type node and the second type node having the parent node of the first type based on the attribute of the node; the third type node has a text branch attribute; and the processor is further configured to generate row-column layout information based on the first type node, the second type node, and the third type node.
10. The apparatus of claim 8, wherein the analyzing unit is configured to search for a first type node from a leaf node to a parent node direction based on the attribute of the node, and generate row layout information based on the first type node; searching a second type node from the first type node to a leaf node direction based on the attribute of the node, and generating column layout information based on the second type node; the column layout information is column layout information matched with the first type node.
11. The apparatus according to claim 10, wherein the analyzing unit is further configured to search a third type node having a parent node of the second type node based on the attribute of the node after searching the second type node from the first type node to a leaf node direction based on the attribute of the node, and generate virtual row layout information based on the third type node; the virtual row layout information is row layout information matched with the second type node.
12. The apparatus of claim 10, wherein the analyzing unit is configured to generate row and column layout information based on the row layout information and the column layout information.
13. The apparatus according to claim 11, wherein the analyzing unit is configured to generate row and column layout information based on the row layout information, the column layout information, and the virtual row layout information.
14. The apparatus according to claim 8, wherein the list generating unit is configured to read text block nodes in the row and column layout information according to a row and column sequence; and reading the text block nodes in the same line according to the sequence of the text block nodes in the tree structure.
15. An electronic device, characterized in that the electronic device comprises:
a memory for storing executable instructions;
a processor for implementing the information processing method of any one of claims 1 to 7 when executing the executable instructions.
16. A computer-readable storage medium, characterized in that the storage medium has stored therein executable instructions that, when executed, implement the information processing method of any one of claims 1 to 7.
CN201510394321.9A 2015-07-07 2015-07-07 Information processing method and device Active CN106339381B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510394321.9A CN106339381B (en) 2015-07-07 2015-07-07 Information processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510394321.9A CN106339381B (en) 2015-07-07 2015-07-07 Information processing method and device

Publications (2)

Publication Number Publication Date
CN106339381A CN106339381A (en) 2017-01-18
CN106339381B true CN106339381B (en) 2020-11-06

Family

ID=57826407

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510394321.9A Active CN106339381B (en) 2015-07-07 2015-07-07 Information processing method and device

Country Status (1)

Country Link
CN (1) CN106339381B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107368237A (en) * 2017-07-19 2017-11-21 环球智达科技(北京)有限公司 Layout method based on user interface presentation
CN111857718B (en) * 2020-07-29 2024-04-09 网易(杭州)网络有限公司 List editing method, device, equipment and storage medium
CN116976286B (en) * 2023-09-22 2024-02-27 北京紫光芯能科技有限公司 Method and device for text layout, electronic equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101464874A (en) * 2007-12-17 2009-06-24 金宝电子(上海)有限公司 Method for representing electronic dictionary catalog data by XML

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101464874A (en) * 2007-12-17 2009-06-24 金宝电子(上海)有限公司 Method for representing electronic dictionary catalog data by XML

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于Internet的农业信息资源采集系统;赵洋等;《农机化研究》;20081031(第10期);139—141 *

Also Published As

Publication number Publication date
CN106339381A (en) 2017-01-18

Similar Documents

Publication Publication Date Title
KR102170929B1 (en) User keyword extraction device, method, and computer-readable storage medium
US10565293B2 (en) Synchronizing DOM element references
US9015657B2 (en) Systems and methods for developing and delivering platform adaptive web and native application content
US20150278359A1 (en) Method and apparatus for generating a recommendation page
CN101950312B (en) Method for analyzing webpage content of internet
JP5930496B2 (en) Method and apparatus for acquiring structured information in layout file
CN106897251B (en) Rich text display method and device
CN104572668B (en) Method and apparatus based on multiple pattern file generated Merge Styles files
US20140289612A1 (en) Merging web page style addresses
WO2013106595A2 (en) Processing store visiting data
US8290925B1 (en) Locating product references in content pages
CN106339381B (en) Information processing method and device
CN103970898A (en) Method and device for extracting information based on multistage rule base
CN106933916B (en) JSON character string processing method and device
CN109710224B (en) Page processing method, device, equipment and storage medium
CN110209780B (en) Question template generation method and device, server and storage medium
CN106033444B (en) Text content clustering method and device
CN112650529B (en) System and method for configurable generation of mobile terminal APP codes
US20130318133A1 (en) Techniques to manage universal file descriptor models for content files
CN103914479A (en) Resource request matching method and device
CN104484449A (en) Web page text extraction method and web page text extraction device
CN115065945B (en) Short message link generation method and device, electronic equipment and storage medium
CN103440231A (en) Equipment and method for comparing texts
EP4322025A1 (en) Method and apparatus for searching for clipping template
CN104991920A (en) Label generation method and apparatus

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant