CN114637505A - Page content extraction method and device - Google Patents

Page content extraction method and device Download PDF

Info

Publication number
CN114637505A
CN114637505A CN202011488109.6A CN202011488109A CN114637505A CN 114637505 A CN114637505 A CN 114637505A CN 202011488109 A CN202011488109 A CN 202011488109A CN 114637505 A CN114637505 A CN 114637505A
Authority
CN
China
Prior art keywords
element node
node
information block
dom tree
page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011488109.6A
Other languages
Chinese (zh)
Inventor
何熠皓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guoxinjunhe Beijing Technology Co ltd
Original Assignee
Guoxinjunhe Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guoxinjunhe Beijing Technology Co ltd filed Critical Guoxinjunhe Beijing Technology Co ltd
Priority to CN202011488109.6A priority Critical patent/CN114637505A/en
Publication of CN114637505A publication Critical patent/CN114637505A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/38Creation or generation of source code for implementing user interfaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a page content extraction method and device. The method comprises the following steps: acquiring a DOM tree corresponding to a page; starting from the root element node of the target DOM tree, executing rendering area duplicate removal processing aiming at the element node in the target DOM tree to obtain a duplicate removal DOM tree; extracting the visual characteristics of each element node in the de-duplication DOM tree; analyzing and executing a preset interface template, and extracting the page content of the page from the visual features through the interface template; and setting a content extraction condition for matching the page content in the interface template. The embodiment utilizes the stable characteristic of the visual characteristic to position the element node, is more convenient to understand, is not easy to cause the problem of positioning failure, and the interface template extracts the page content in a declarative mode, so that the extraction mode is convenient and easy to operate.

Description

Page content extraction method and device
Technical Field
The invention relates to the technical field of internet, in particular to a page content extraction method and device.
Background
The page content extraction is a process of parsing useful data from an HTML (hypertext Markup Language) page of unformatted data and converting the useful data into formatted data.
At present, page content extraction mainly uses a CSS (Cascading Style Sheets) selector or an XPATH (Xml Path Language) Path, locates element nodes in a DOM (Document Object Model) tree corresponding to an HTML web page, and extracts page content corresponding to the element nodes. However, since the DOM tree of the page has variability, the problem of positioning failure easily occurs by using the CSS selector or the XPATH path, and the complexity of the page pattern is high and the variation is large, so that the encoding complexity is high.
Disclosure of Invention
The embodiment of the invention mainly aims to provide a method and a device for extracting page content, so as to solve the problem that in the prior art, positioning failure is easy to occur by using a CSS selector or an XPATH path because a DOM tree of a page has changeability.
In view of the above technical problems, the embodiments of the present invention are solved by the following technical solutions:
the embodiment of the invention provides a page content extraction method, which comprises the following steps: acquiring a DOM tree corresponding to a page; starting from the root element node of the target DOM tree, executing rendering area duplicate removal processing aiming at the element node in the target DOM tree to obtain a duplicate removal DOM tree; extracting the visual characteristics of each element node in the de-duplication DOM tree; analyzing and executing a preset interface template, and extracting the page content of the page from the visual features through the interface template; and setting content extraction conditions for matching the page content in the interface template.
Before extracting the visual features of the element nodes of the display class in the target DOM tree, the method further comprises the following steps: analyzing the page into a DOM tree; determining a target element node in the DOM tree; acquiring a target DOM tree in the DOM tree; and the root element node of the target DOM tree is the target element node.
Wherein, the executing rendering region deduplication processing aiming at the element nodes in the target DOM tree to obtain a deduplication DOM tree comprises: starting from the root element node of the target DOM tree, executing the following operations aiming at each father element node in the target DOM tree to obtain a deduplication DOM tree: comparing the parent element node and child element nodes of the parent element node; discarding the parent element node in the target DOM tree when the rendering area of the parent element node is the same as the rendering area of the child element node; and when the rendering area of the parent element node covers the rendering area of the child element node, and only the child element node of the parent element node is of a text semantic type, discarding the child element node in the target DOM tree.
Wherein, in the deduplication DOM tree, extracting visual features of element nodes comprises: performing a drill-down operation on each element node in the de-duplication DOM tree so as to determine a visual feature corresponding to the element node; wherein the visual features corresponding to the element nodes comprise: the information block corresponding to the element node, or the information block set corresponding to the element node; the information block set corresponding to the element node comprises: an information block or a set of information blocks corresponding to each sub-element node of the element node; information block the information block is constructed according to attribute information of an element node corresponding to the information block.
Wherein the performing a drill-down operation for each element node in the deduplication DOM tree comprises: when the element node is a text node or a leaf node, constructing an information block corresponding to the element node according to the attribute information of the element node; when at least one sub-element node exists in the element node, constructing an information block set for the element node, and executing the following operations aiming at the element node: when the child element node of the element node is a text node, constructing an information block corresponding to the child element node according to the attribute information of the child element node, and adding the information block corresponding to the child element node into an information block set corresponding to the element node; when the child element node of the element node is of a text semantic type, or when the child element node of the element node is of a non-text semantic type and the rendering area of the child element node is not blocked, constructing an information block set for the child element node, executing the drill-down operation on the child element node, and adding the information block set corresponding to the child element node into the information block set corresponding to the element node until the object of the drill-down operation is a leaf node.
Wherein, after the constructing the information block corresponding to the element node, the method further comprises: marking the information block type for the information block corresponding to the element node according to the node type of the element node; wherein the information block type is used as a constraint condition in the content extraction condition; after the constructing the information block corresponding to the sub-element node, further comprising: and marking the information block type for the information block corresponding to the sub-element node according to the node type of the sub-element node.
Wherein, in the process of constructing the information block set for the element node and executing the drill-down operation for the element node, the method further comprises: determining a rendering coverage area according to the rendering area of each sub-element node of the element node; determining a longitudinal average distance deviation and a transverse average distance deviation according to the rendering area and the rendering coverage area of each sub-element node of the element nodes; when the longitudinal average distance deviation is larger than the transverse average distance deviation, determining the information block set corresponding to the element node as a column mode information block set, and after adding the information block set corresponding to the sub-element node into the information block set corresponding to the element node, performing column sorting on the information block set corresponding to the element node; when the longitudinal average distance deviation is smaller than or equal to the transverse average distance deviation, determining the information block set corresponding to the element node as a row mode information block set, and after adding the information block set corresponding to the sub-element node into the information block set corresponding to the element node, performing row sorting on the information block set corresponding to the element node; wherein the column ordering and the row ordering are used as constraints in the content extraction conditions.
The page content is element content in attribute information of element nodes; the content extraction conditions include: element constraints, matching constraints, and/or extraction transformation rules; the parsing and executing an interface template through which to extract page content of the page from the visual features, comprising: generating a state transition strategy by analyzing element constraint conditions, matching constraint conditions and/or extracting transformation rules in the interface template; wherein at least one state node is included in the state transition policy, each state node for matching an element content among a plurality of the visual features; executing the state transition policy to extract different element contents in the plurality of visual features respectively by the at least one state node in the state transition policy.
Wherein, a plurality of branches are included in the state transition strategy, and each branch comprises at least one state node; wherein a plurality of the branches correspond to the same rendering range and different rendering styles; the executing the state transition policy includes: executing a plurality of said branches, respectively; while executing each of the branches, at least one state node in the branch is executed separately so as to extract different element contents in a plurality of the visual features through the at least one state node.
The embodiment of the present invention further provides a device for extracting page content, including: the acquisition module is used for acquiring a DOM tree corresponding to the page; the duplication removing module is used for executing duplication removing processing of a rendering area aiming at the element nodes in the target DOM tree from the root element nodes of the target DOM tree to obtain a duplication removing DOM tree; the first extraction module is used for extracting the visual characteristics of each element node in the de-duplication DOM tree; the second extraction module is used for analyzing and executing a preset interface template, and extracting the page content of the page from the plurality of visual features through the interface template; and setting a content extraction condition for matching the page content in the interface template.
An embodiment of the present invention provides an apparatus, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the page content extraction method of any one of the above.
An embodiment of the present invention provides a computer-readable storage medium, where a page content extraction program is stored on the computer-readable storage medium, and when being executed by a processor, the page content extraction program implements the steps of any one of the page content extraction methods described above.
The embodiment of the invention has the following beneficial effects:
the embodiment extracts the visual characteristics of the element nodes in the DOM tree corresponding to the page, and extracts the page elements from the visual characteristics through the interface template. Because the visual structure of page has certain continuity and stability, can not frequent reprint, even if reprint, the holistic visual logic structure of page is also unchangeable, this makes the rendering region of element node in the page difficult for changing, so this embodiment utilizes this comparatively stable characteristic of visual characteristic to fix a position the element node, be convenient for understand more, the problem of location failure is difficult for appearing, and the interface template is with the mode extraction page content of statement, the convenient easy operation of extraction mode.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow diagram of a method of page content extraction according to an embodiment of the invention;
FIG. 2 is a flow chart of the steps of a drill-down operation according to one embodiment of the present invention;
FIG. 3 is a flowchart of the steps for determining a rank mode of child element nodes of an element node, according to one embodiment of the invention;
FIG. 4 is a schematic diagram of rendering coverage areas according to an embodiment of the present invention;
FIG. 5 is a flowchart of the steps of parsing and executing an interface module, according to one embodiment of the invention;
FIG. 6 is a schematic diagram of content extraction conditions according to an embodiment of the invention;
fig. 7 is a structural diagram of a page content extracting apparatus according to an embodiment of the present invention;
fig. 8 is a block diagram of an apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and specific embodiments.
According to an embodiment of the invention, a page content extraction method is provided. Fig. 1 is a flowchart illustrating a page content extracting method according to an embodiment of the present invention.
Step S101, a target DOM tree corresponding to the page is obtained.
The DOM is a collection of nodes or pieces of information organized in a hierarchy. Specific information can be queried in the hierarchy. Because the DOM is hierarchical-based information, the DOM is considered a tree-based or object-based structure. The HTML DOM defines standard methods of accessing and manipulating HTML documents. The HTML DOM can render an HTML document as a tree structure (node tree).
The target DOM tree is a DOM tree or a portion of a DOM tree corresponding to the page.
Analyzing a page to be obtained page content into a DOM tree for the page; determining a target element node in the DOM tree; acquiring a target DOM tree in the DOM tree; and the root element node of the target DOM tree is the target element node.
Only the target DOM tree which takes the target element node as the root node is obtained, useless contents in the page can be eliminated, the complexity of page extraction is reduced, redundant information is reduced, and the subsequent page extraction is facilitated. For example: the search page comprises a hot search list, when the content in the hot search list is extracted, a local DOM tree corresponding to the hot search list can be obtained from the DOM corresponding to the search page, and the content of the page is extracted based on the local DOM tree.
And step S102, starting from the root element node of the target DOM tree, executing rendering area duplicate removal processing aiming at the element node in the target DOM tree to obtain a duplicate removal DOM tree.
And the rendering area deduplication processing is used for overlapping or overlaying the rendering area and removing element nodes which are not used for page rendering from the target DOM tree.
Specifically, starting from the root element node of the target DOM tree, the following operations are performed for each parent element node in the target DOM tree, resulting in a deduplication DOM tree: comparing the parent element node and child element nodes of the parent element node; discarding the parent element node in the target DOM tree when the rendering area of the parent element node is the same as the rendering area of the child element node; and when the rendering area of the parent element node covers the rendering area of the child element node, and the parent element node only discards the child element node in the target DOM tree when the child element node is of a text semantic type.
For example: the parent element nodes and the child element nodes are nested 2-layer DIV labels, the rendering area of the parent element nodes is larger than that of the child element nodes, the child element nodes are of text semantic types, the DIV labels of the parent element nodes can be discarded, and the DIV labels of the child element nodes are reserved.
And step S103, extracting the visual characteristics of each element node in the de-duplication DOM tree.
The visual features refer to visual structural features in the rendering area where the element nodes are located.
Specifically, a drill-down operation is performed for each element node in the deduplication DOM tree to determine a visual feature corresponding to the element node.
The drill-down operation is also called drill-down operation (drill down). The drilling operation is used for changing the dimension level and transforming the analysis granularity. The drill-down operation is used for analyzing the summarized data deep into the detail data, namely analyzing the detail data from the parent element node to the direct child element node and from the indirect child element node to the leaf node in an iterative mode. The direct child element node is a child element node to which the parent element node is directly connected. An indirect child node is a child node of a child node.
The visual features corresponding to the element nodes comprise: the information block corresponding to the element node, or the information block set corresponding to the element node; the information block set corresponding to the element node comprises: an information block or set of information blocks corresponding to each sub-element node of the element node.
The information block is constructed according to the attribute information of the element node corresponding to the information block. For example: the information Block is a Block object constructed according to the attribute information of the element node.
The information block set is a set constructed from information blocks and a next-level information block set. The information Block set is a Stack object, and the Stack object comprises a Block object and a Stack object.
Step S104, analyzing and executing a preset interface template, and extracting the page content of the page from the plurality of visual features through the interface template; and setting a content extraction condition for matching the page content in the interface template.
The page content refers to element content in attribute information of the element node. For example: the element content is text content.
The interface template is used for extracting the required page content from the extracted multiple visual features.
The interface template includes: call address, return value type and parameter. Parameters of the interface template include, but are not limited to: content extraction conditions. The parameters of the interface template can be set according to requirements.
The content extraction conditions include: element constraints, matching constraints, and/or extraction transformation rules. The element constraints are used to match the element nodes. The matching constraints are used to define the matching means. The extraction transformation rules are used to define the extraction manner.
The embodiment extracts the visual characteristics of the element nodes in the DOM tree corresponding to the page, and extracts the page elements from the visual characteristics through the interface template. Because the visual structure of the page has certain continuity and stability, frequent reprinting is avoided, and even if the page is reprinted, the overall visual logic structure of the page is also unchanged (for example, the basic presentation logic of a result list of a search engine is always a 3-layer structure of title + abstract + origin, and various other elements are supplemented), so that the rendering area of the element node in the page is not easy to change.
Compared with a mode of positioning element nodes through attribute features of the element nodes, the method has the advantages that the complexity of the DOM tree is not needed to be considered, the positioning can be realized based on a plurality of limited visual features, the complexity of extracting the page content is greatly reduced, the applicability is improved, and the better maintainability can be maintained under the condition of higher complexity of page layout.
In this embodiment, compared to a mode of extracting page content by a command, a command code does not need to be maintained, and a process of repeatedly debugging the command code due to frequent changes of node attributes is also not needed.
The process of performing a drill-down operation for each element node in the target DOM tree is described further below.
And when the element node is a text node or a leaf node, constructing an information block corresponding to the element node according to the attribute information of the element node.
And when at least one sub-element node exists in the element node, constructing an information block set for the element node, and executing a drill-down operation aiming at the element node. Performing a drill-down operation for the element node, comprising:
and when the sub-element node of the element node is a text node, constructing an information block corresponding to the sub-element node according to the attribute information of the sub-element node, and adding the information block corresponding to the sub-element node into an information block set corresponding to the element node.
When the child element node of the element node is of a text semantic type, or when the child element node of the element node is of a non-text semantic type and the rendering area of the child element node is not blocked, constructing an information block set for the child element node, continuing to execute the drill-down operation on the child element node until the object of the drill-down operation is a leaf node, and adding the information block set corresponding to the child element node into the information block set corresponding to the element node.
Specifically, the element node that needs to perform the drill-down operation is taken as the target element node, and the steps of the drill-down operation are shown in fig. 2.
Step S201, acquiring attribute information of the target element node and attribute information of all child element nodes of the target element node.
The attribute information includes, but is not limited to: rendering Region (Region), element content, font size, color, and additional attributes.
The Region records a Rect attribute (rectangle attribute), which is a rectangle area (top, left, bottom, right) formed by the top left coordinate and the bottom right coordinate of the rendering area.
Step S202, identifying whether the node type of the target element node is a text node; if so, go to step S213; if not, step S203 is executed.
When the target element node is a text node and contains valid text content, the target element node is used for showing the text content in the page, and attribute information of the target element node can be returned for use in subsequent page content extraction.
Step S203, identifying whether the element node is a leaf node, namely, does not contain a child node; if so, go to step S213; if not, step S204 is performed.
And step S204, constructing a target Stack object aiming at the target element node.
In step S205, each child element node of the target element node is sequentially queried.
Step S206, identifying whether the node type of the current child element node is a text node; if yes, go to step S210; if not, step S207 is performed.
When the child element node is a text node and contains valid text content, the child element node is used for showing the text content in the page, and the attribute information of the child element node can be returned for use in subsequent page content extraction.
Step S207, identifying whether the rendering area of the current child element node is visible (size < 0); if yes, go to step S208; if not, the current child element node is ignored, and the process jumps to step S205.
Step S208, identifying whether the rendering type of the current sub-element node is a text semantic type; if yes, go to step S211; if not, step S209 is performed.
HTML tags corresponding to text semantic types include, but are not limited to: H1-H6, P, SPAN and I.
Step S209, determining whether the rendering area of the current child element node is shielded by the rendering area of the target element node, if so, jumping to step S212; if not, step S211 is performed.
And step S210, returning the attribute information of the current child element node, constructing a Block object, and adding the Block object into a target Stack object.
Step S211, marking the rendering area of the current child element node as an effective rendering area, taking the current child element node as a target element node, jumping to step S201, executing a drill-down operation, after the drill-down operation for the current child element node is completed, adding the obtained Block object and/or Stack object to the target Stack object, and jumping to step S212.
Step S213, building a Block object according to the attribute information of the target element node.
And after constructing the Block object or the Stack object of the completed target element node, taking the Block object or the Stack object as the visual characteristic of the target element node. The Stack object of the target element node includes a Block object or a Stack object of a direct child element node and a Block object or a Stack object of an indirect child element node of the target element node. Thus, the Stack object of the target element node is a visual-based hierarchy. The Block object or the Stack object of the target element node exists in the layout file, and therefore page content can be extracted from the layout file.
Further, if there is only one Block object in the Stack objects of the child element node, the addition of the Block object to the Stack object of the target element node is performed.
In order to make the subsequent content extraction more accurate, the step shown in fig. 3 may be performed during the process of constructing the information block set for the element node and performing the drill-down operation on the element node, so as to determine the row-column mode of the child element nodes of the element node.
Step S301, determining rendering coverage area according to rendering area of each sub-element node of the element node.
In the rendering area of each child element node of the element nodes, the extension lines of the edges positioned at the upper, lower, left and right outermost sides form a rendering coverage area.
Step S302, according to the rendering area and the rendering coverage area of each sub-element node of the element node, determining the longitudinal average distance deviation and the transverse average distance deviation.
Fig. 4 is a schematic diagram of rendering a coverage area according to an embodiment of the present invention. The element nodes include a sub-element node r1, a sub-element node r2, a sub-element node r3 and a sub-element node r4, the average distance deviation (longitudinal average distance deviation) of the Rect of all regions at Top is calculated as the average value of d11, d12 and d13, and the average distance deviation of the Rect of all regions at Left is calculated as the average value of d22, d23 and d 24.
Step S303, judging whether the longitudinal average distance deviation is larger than the transverse average distance deviation; if yes, go to step S304; if not, step S305 is performed.
Step S304, determining the information block set corresponding to the element node as a Column mode information block set (Column), and after adding the information block set corresponding to the sub-element node to the information block set corresponding to the element node, performing Column sorting on the information block set corresponding to the element node.
When the longitudinal average distance deviation is larger than the transverse average distance deviation, the sub-element nodes of the element node are generally arranged longitudinally.
Step S305, determining the information block set corresponding to the element node as a Row mode information block set (Row), and after adding the information block set corresponding to the sub-element node into the information block set corresponding to the element node, performing Row sorting on the information block set corresponding to the element node; wherein the column ordering and the row ordering are used as constraints in the content extraction conditions.
When the longitudinal average distance deviation is less than or equal to the transverse average distance deviation, the sub-element nodes of the element node are generally arranged transversely.
In addition, after the information block corresponding to the element node is constructed, the information block type may be marked for the information block corresponding to the element node according to the node type of the element node. After the information block corresponding to the sub-element node is constructed, marking the information block type for the information block corresponding to the sub-element node according to the node type of the sub-element node; wherein the information block type is also used as a constraint in the content extraction condition.
Further, node types include, but are not limited to: INPUT nodes (INPUT nodes), IMG nodes (picture nodes), and text nodes. And if the element node is an INPUT node, marking the Block object corresponding to the element node as an INPUT type. And if the element node is an IMG node, marking the Block object corresponding to the element node as an Image type. And if the element node is a text node, marking the Block object corresponding to the element node as a Content type.
The steps for parsing and executing the interface template are further described below. FIG. 5 is a flowchart illustrating steps for parsing and executing an interface module according to an embodiment of the invention.
Step S501, generating a state transition strategy by analyzing element constraint conditions, matching constraint conditions and/or extracting transformation rules in an interface template; including at least one state node in the state transition policy, each state node for matching an element content among a plurality of the visual features.
The interface template includes, but is not limited to, parameters. Parameters include, but are not limited to: content extraction conditions. Content extraction conditions including: element constraints, matching constraints and/or extraction transformation rules.
The matching constraint and/or the extraction transformation rule may constitute at least one matching tag by means of the element constraint. The state transition policy is used to determine the execution order of the matching tags.
Step S502, executing the state transition policy so as to extract different element contents in the plurality of visual features through the at least one state node in the state transition policy, respectively.
In the process of executing the state transition strategy, each state node matches the element content in the information Block (Block object) according to the matching tag, and when the element content can match the matching tag, the element content is extracted.
The matching constraints and the extraction transformation rules are further described below with respect to the element constraints in the content extraction conditions.
The element constraints are used to match the element nodes. Wherein the bundle condition includes a type constraint, a text constraint, a font constraint, and a color constraint.
The type constraints include: text, link, and img. The type constraint represents a presentation type, the type constraint is immediately to the right of < and | is used to connect multiple types of constraints.
The text constraints include: the element content to be extracted includes keywords. And using a character string constant or a regular expression to constrain the keywords displayed by the element content, wherein the character string constant is 'quoted' by quoting the regular expression constant.
The font constraints include: and selecting the font of the element content to be extracted. The font filtering condition is, for example, fontsize (font size), fondweight (font thickness).
The color constraints include: color (color) of the content of the element to be extracted.
The matching constraints are used to define the matching means. The matching constraint conditions include, but are not limited to, optional matching, multi-element matching, and arbitrary matching.
Optional matches refer to matches for which the element constraint is not mandatory. The non-essential match may use the symbol "? "is used herein.
Multiple element matching includes, but is not limited to: one-dimensional matching and two-dimensional matching. One-dimensional matching refers to multiple item matching in a default direction (row or column pattern). Two-dimensional matching means that matching in the orthogonal direction is required in addition to matching in the default direction.
Any match may be represented using the symbol "…". "…" is located in a single row, meaning that any multiple elements are matched longitudinally. "…" is located between two elements in the transverse direction, meaning that any multiple elements are matched between the two elements.
The extraction transformation rules are used to define the extraction manner. Wherein, the extraction transformation rules include but are not limited to: standard extraction rules, designated attribute extraction rules, designated attributes and transformation extraction rules.
Standard extraction rules refer to assigning the extracted element content under a specified variable name. The standard extraction rule may be expressed by { { variable name } }.
The specified attribute extraction rule refers to the specified attribute information of the extracted element node, and assigns the attribute information under a preset variable name. The specified attribute extraction rule may be expressed by {. attribute name variable name } }.
The specified attribute and transformation extraction rule refers to the specified attribute information of the extracted element node, the attribute information is transmitted into a preset transformation method by taking the attribute information as a parameter, the format of the attribute information is modified by the transformation method, and the attribute information after the format modification is assigned under the preset variable name. The specified attribute and transformation extraction rule can be expressed by { [. attribute name ] variable name: transformation method } }.
The content extraction conditions may use constraints of closed symbols < > with each state node comprising at least one matching label. Each matching label is respectively composed of an element constraint condition, a matching constraint condition and/or an extraction transformation rule. A plurality of matching tags are arranged in a row, and the content of the elements is displayed in the same row space; the matching tag wraps, indicating that the actual element content wraps as well.
The state transition policy is the execution order of the matching tags. Further, the state transition policy is an execution order of the plurality of state nodes.
Since there may be multiple rendering styles within the same rendering range in a page, multiple branches may be included in the state migration policy, each branch including at least one state node; wherein a plurality of the branches correspond to the same rendering range and different rendering styles.
For example: in a result page of a search engine, an advertisement, a map or text may be displayed in a region of one result, and it is apparent that rendering styles of the advertisement, the map and the text are different.
Executing a plurality of said branches, respectively; upon execution of each of the branches, at least one state node in the branch is executed, respectively, to extract different element content in the plurality of visual features by the at least one state node.
For example: fig. 6 is a schematic diagram of content extraction conditions according to an embodiment of the present invention. The state transition strategy can be generated through analyzing the content extraction condition. The state transition policy is:
# Branch 1
Constraint 1-1-1, constraint 1-1-2;
constraint 1-2-1, constraint 1-2-2;
constraint 1-3-1, constraint 1-3-2;
# Branch 2
Constraint 2-1-1;
constraint 2-2-1;
constraint 2-3-1.
In fig. 6, # branch 1 is # regular and # branch 1 is # top. Taking # regular as an example, the # regular includes at least three rows, the matching tag in each row indicates that the element content to be extracted is displayed in the same row space, and the matching tags in different rows indicate that the element content to be extracted is displayed in different row spaces. Taking the first row of # regular as an example, the diversion includes two < >, then the row includes two state nodes, namely constraint 1-1-1 and constraint 1-1-2.
When executing the state transition strategy, sequentially executing each branch according to the sequence of the plurality of branches; in executing each branch, each state node is executed sequentially in the order of the plurality of state nodes. Further, after the execution of one state node is completed, for the matched information Block (Block object), the ID of the information Block is recorded, and when the subsequent state node is executed, if the ID of the subsequent matched information Block is already recorded, the subsequent matched information Block is discarded.
The embodiment of the invention describes the page by using the visual structure, thus conforming to the basic principle and intuition of human recognition objects, being capable of modeling the page content more simply, having no changeability of the visual structure and remarkably reducing the mental burden of users and maintainers compared with the traditional mode. The embodiment of the invention models the page through the visual structure, extracts the page content (element content) from the visual structure through the interface template, is convenient, quick and easy to understand, does not need to maintain a large number of matching codes, and is not easy to cause the problem of positioning failure.
The embodiment of the invention also provides a device for extracting the page content. Fig. 7 is a block diagram of a page content extracting apparatus according to an embodiment of the present invention.
The page content extraction device includes: an obtaining module 701, a deduplication module 702, a first extraction module 703 and a second extraction module 704.
The obtaining module 701 is configured to obtain a target document object model DOM tree corresponding to a page.
And a deduplication module 702, configured to execute, starting from a root element node of the target DOM tree, rendering region deduplication processing for the element node in the target DOM tree, to obtain a deduplication DOM tree.
A first extraction module 703, configured to extract a visual feature of each element node in the deduplication DOM tree.
A second extracting module 704, configured to parse and execute a preset interface template, and extract the page content of the page from the plurality of visual features through the interface template; and setting a content extraction condition for matching the page content in the interface template.
The functions of the apparatus according to the embodiment of the present invention have been described in the above method embodiments, so that reference may be made to the related descriptions in the foregoing embodiments for details which are not described in the embodiment of the present invention, and further details are not described herein.
The page content extracting device comprises a processor and a memory, and the following steps: the obtaining module 701, the deduplication module 702, the first extraction module 703, the second extraction module 704, and the like are all stored in the memory as program units, and the processor executes the program units stored in the memory to implement corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can set one or more than one, extract the visual characteristics of the element nodes in the DOM tree corresponding to the page by adjusting the kernel parameters, and extract the page elements from the visual characteristics through the interface template.
An embodiment of the present invention provides a storage medium on which a program is stored, the program implementing the page content extraction method when executed by a processor.
The embodiment of the invention provides a processor, which is used for running a program, wherein the page content extraction method is executed when the program runs.
An embodiment of the present invention provides an apparatus, and as shown in fig. 8, is a structural diagram of an apparatus according to an embodiment of the present invention. The device 80 includes at least one processor 801, and at least one memory 802 coupled to the processor 801, a bus 803; the processor 801 and the memory 802 complete communication with each other through the bus 803; the processor 801 is configured to call program instructions in the memory 802 to perform the page content extraction method described above. The device herein may be a server, a PC, a PAD, a mobile phone, etc.
The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device: acquiring a DOM tree corresponding to a page; starting from the root element node of the target DOM tree, executing rendering area duplicate removal processing aiming at the element node in the target DOM tree to obtain a duplicate removal DOM tree; extracting the visual characteristics of each element node in the de-duplication DOM tree; analyzing and executing a preset interface template, and extracting the page content of the page from the visual features through the interface template; and setting a content extraction condition for matching the page content in the interface template.
Before extracting the visual features of the element nodes of the display class in the target DOM tree, the method further comprises the following steps: parsing the page into a DOM tree; determining a target element node in the DOM tree; acquiring a target DOM tree in the DOM tree; and the root element node of the target DOM tree is the target element node.
Wherein, the executing rendering region deduplication processing aiming at the element nodes in the target DOM tree to obtain a deduplication DOM tree comprises: starting from the root element node of the target DOM tree, executing the following operations aiming at each father element node in the target DOM tree to obtain a deduplication DOM tree: comparing the parent element node and child element nodes of the parent element node; discarding the parent element node in the target DOM tree when the rendering area of the parent element node is the same as the rendering area of the child element node; and when the rendering area of the parent element node covers the rendering area of the child element node, and only the child element node of the parent element node is of a text semantic type, discarding the child element node in the target DOM tree.
Wherein, in the deduplication DOM tree, extracting visual features of element nodes comprises: performing a drill-down operation on each element node in the de-duplication DOM tree so as to determine a visual feature corresponding to the element node; wherein the visual features corresponding to the element nodes comprise: the information block corresponding to the element node, or the information block set corresponding to the element node; the information block set corresponding to the element node comprises: an information block or a set of information blocks corresponding to each sub-element node of the element node; information block the information block is constructed according to attribute information of an element node corresponding to the information block.
Wherein the performing a drill-down operation for each element node in the deduplication DOM tree comprises: when the element node is a text node or a leaf node, constructing an information block corresponding to the element node according to the attribute information of the element node; when at least one sub-element node exists in the element node, constructing an information block set for the element node, and executing the following operations aiming at the element node: when the child element node of the element node is a text node, constructing an information block corresponding to the child element node according to the attribute information of the child element node, and adding the information block corresponding to the child element node into an information block set corresponding to the element node; when the child element node of the element node is of a text semantic type, or when the child element node of the element node is of a non-text semantic type and the rendering area of the child element node is not blocked, constructing an information block set for the child element node, executing the drill-down operation on the child element node, and adding the information block set corresponding to the child element node into the information block set corresponding to the element node until the object of the drill-down operation is a leaf node.
Wherein, after the constructing the information block corresponding to the element node, the method further comprises: marking the information block type for the information block corresponding to the element node according to the node type of the element node; wherein the information block type is used as a constraint condition in the content extraction condition; after the constructing the information block corresponding to the sub-element node, further comprising: and marking the information block type for the information block corresponding to the sub-element node according to the node type of the sub-element node.
Wherein, in the process of constructing the information block set for the element node and executing the drill-down operation for the element node, the method further comprises: determining a rendering coverage area according to the rendering area of each sub-element node of the element node; determining a longitudinal average distance deviation and a transverse average distance deviation according to the rendering area and the rendering coverage area of each sub-element node of the element nodes; when the longitudinal average distance deviation is larger than the transverse average distance deviation, determining the information block set corresponding to the element node as a column mode information block set, and after adding the information block set corresponding to the sub-element node into the information block set corresponding to the element node, performing column sorting on the information block set corresponding to the element node; when the longitudinal average distance deviation is smaller than or equal to the transverse average distance deviation, determining the information block set corresponding to the element node as a row mode information block set, and after adding the information block set corresponding to the sub-element node into the information block set corresponding to the element node, performing row sorting on the information block set corresponding to the element node; wherein the column ordering and the row ordering are used as constraints in the content extraction conditions.
The page content is element content in attribute information of element nodes; the content extraction conditions include: element constraints, matching constraints, and/or extraction transformation rules; the parsing and executing an interface template through which to extract page content of the page from the visual features, comprising: generating a state transition strategy by analyzing element constraint conditions, matching constraint conditions and/or extracting transformation rules in the interface template; wherein at least one state node is included in the state transition policy, each state node for matching an element content among a plurality of the visual features; executing the state transition policy to extract different element contents in the plurality of visual features respectively by the at least one state node in the state transition policy.
Wherein, a plurality of branches are included in the state transition strategy, and each branch comprises at least one state node; wherein a plurality of the branches correspond to the same rendering range and different rendering styles; the executing the state transition policy includes: executing a plurality of said branches, respectively; upon execution of each of the branches, at least one state node in the branch is executed, respectively, to extract different element content in the plurality of visual features by the at least one state node.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip. The memory is an example of a computer-readable medium.
Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art to which the present application pertains. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A method for extracting page content is characterized by comprising the following steps:
acquiring a DOM tree corresponding to a page;
starting from the root element node of the target DOM tree, executing rendering area duplicate removal processing aiming at the element node in the target DOM tree to obtain a duplicate removal DOM tree;
extracting the visual characteristics of each element node in the de-duplication DOM tree;
analyzing and executing a preset interface template, and extracting the page content of the page from the visual features through the interface template; and setting content extraction conditions for matching the page content in the interface template.
2. The method according to claim 1, wherein before extracting visual features of element nodes of a presentation class in the target DOM tree, further comprising:
analyzing the page into a DOM tree;
determining a target element node in the DOM tree;
acquiring a target DOM tree in the DOM tree; and the root element node of the target DOM tree is the target element node.
3. The method of claim 1, wherein performing rendering region deduplication processing on the element nodes in the target DOM tree, resulting in a deduplicated DOM tree, comprises:
starting from the root element node of the target DOM tree, executing the following operations aiming at each father element node in the target DOM tree to obtain a deduplication DOM tree:
comparing the parent element node and child element nodes of the parent element node;
discarding the parent element node in the target DOM tree when the rendering area of the parent element node is the same as the rendering area of the child element node;
and when the rendering area of the parent element node covers the rendering area of the child element node, and only the child element node of the parent element node is of a text semantic type, discarding the child element node in the target DOM tree.
4. The method according to claim 1, wherein said extracting visual features of element nodes in said deduplication DOM tree comprises:
performing a drill-down operation on each element node in the de-duplication DOM tree so as to determine a visual feature corresponding to the element node;
wherein the visual features corresponding to the element nodes comprise: the information block corresponding to the element node, or the information block set corresponding to the element node; the information block set corresponding to the element node comprises: an information block or a set of information blocks corresponding to each sub-element node of the element node; information block the information block is constructed according to attribute information of an element node corresponding to the information block.
5. The method of claim 4, wherein performing a drill-down operation for each element node in the de-duplication DOM tree comprises:
when the element node is a text node or a leaf node, constructing an information block corresponding to the element node according to the attribute information of the element node;
when at least one sub-element node exists in the element node, constructing an information block set for the element node, and executing the following operations aiming at the element node:
when the child element node of the element node is a text node, constructing an information block corresponding to the child element node according to the attribute information of the child element node, and adding the information block corresponding to the child element node into an information block set corresponding to the element node;
when the child element node of the element node is of a text semantic type, or when the child element node of the element node is of a non-text semantic type and the rendering area of the child element node is not blocked, constructing an information block set for the child element node, executing the drill-down operation on the child element node, and adding the information block set corresponding to the child element node into the information block set corresponding to the element node until the object of the drill-down operation is a leaf node.
6. The method of claim 5,
after the constructing the information block corresponding to the element node, further comprising:
marking the information block type for the information block corresponding to the element node according to the node type of the element node; wherein the information block type is used as a constraint condition in the content extraction condition;
after the constructing the information block corresponding to the sub-element node, further comprising:
and marking the information block type for the information block corresponding to the sub-element node according to the node type of the sub-element node.
7. The method of claim 5, wherein during the constructing of the set of information blocks for the element node and the performing of the drill-down operation for the element node, further comprising:
determining a rendering coverage area according to the rendering area of each sub-element node of the element node;
determining a longitudinal average distance deviation and a transverse average distance deviation according to the rendering area and the rendering coverage area of each sub-element node of the element nodes;
when the longitudinal average distance deviation is larger than the transverse average distance deviation, determining the information block set corresponding to the element node as a column mode information block set, and after adding the information block set corresponding to the sub-element node into the information block set corresponding to the element node, performing column sorting on the information block set corresponding to the element node;
when the longitudinal average distance deviation is smaller than or equal to the transverse average distance deviation, determining the information block set corresponding to the element node as a row mode information block set, and after adding the information block set corresponding to the sub-element node into the information block set corresponding to the element node, performing row sorting on the information block set corresponding to the element node; wherein the column ordering and the row ordering are used as constraints in the content extraction conditions.
8. The method of claim 1,
the page content is element content in attribute information of the element node;
the content extraction conditions include: element constraints, matching constraints, and/or extracting transformation rules;
the parsing and executing an interface template through which to extract page content of the page from the visual features, comprising:
generating a state transition strategy by analyzing element constraint conditions, matching constraint conditions and/or extracting transformation rules in the interface template; wherein at least one state node is included in the state transition policy, each state node for matching an element content among a plurality of the visual features;
executing the state transition policy to extract different element contents in the plurality of visual features respectively by the at least one state node in the state transition policy.
9. The method of claim 8,
including a plurality of branches in the state migration policy, each of the branches including at least one state node; wherein a plurality of the branches correspond to the same rendering range and different rendering styles;
the executing the state transition policy includes:
executing a plurality of said branches, respectively;
while executing each of the branches, at least one state node in the branch is executed separately so as to extract different element contents in a plurality of the visual features through the at least one state node.
10. A page content extraction apparatus, comprising:
the acquisition module is used for acquiring a DOM tree corresponding to the page;
the duplication removing module is used for executing duplication removing processing of a rendering area aiming at the element nodes in the target DOM tree from the root element nodes of the target DOM tree to obtain a duplication removing DOM tree;
the first extraction module is used for extracting the visual characteristics of each element node in the deduplication DOM tree;
the second extraction module is used for analyzing and executing a preset interface template, and extracting the page content of the page from the visual features through the interface template; and setting content extraction conditions for matching the page content in the interface template.
CN202011488109.6A 2020-12-16 2020-12-16 Page content extraction method and device Pending CN114637505A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011488109.6A CN114637505A (en) 2020-12-16 2020-12-16 Page content extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011488109.6A CN114637505A (en) 2020-12-16 2020-12-16 Page content extraction method and device

Publications (1)

Publication Number Publication Date
CN114637505A true CN114637505A (en) 2022-06-17

Family

ID=81945487

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011488109.6A Pending CN114637505A (en) 2020-12-16 2020-12-16 Page content extraction method and device

Country Status (1)

Country Link
CN (1) CN114637505A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115830200A (en) * 2022-11-07 2023-03-21 北京力控元通科技有限公司 Three-dimensional model generation method, three-dimensional graph rendering method, device and equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115830200A (en) * 2022-11-07 2023-03-21 北京力控元通科技有限公司 Three-dimensional model generation method, three-dimensional graph rendering method, device and equipment
CN115830200B (en) * 2022-11-07 2023-05-12 北京力控元通科技有限公司 Three-dimensional model generation method, three-dimensional graph rendering method, device and equipment

Similar Documents

Publication Publication Date Title
US9418315B1 (en) Systems, methods, and computer readable media for extracting data from portable document format (PDF) files
US7673235B2 (en) Method and apparatus for utilizing an object model to manage document parts for use in an electronic document
US10360294B2 (en) Methods and systems for efficient and accurate text extraction from unstructured documents
US7386558B2 (en) Methods and systems for filtering an Extensible Application Markup Language (XAML) file to facilitate indexing of the logical content contained therein
CN107423391B (en) Information extraction method of webpage structured data
CN113609820B (en) Method, device and equipment for generating word file based on extensible markup language file
CN109033282B (en) Webpage text extraction method and device based on extraction template
CN103699591A (en) Page body extraction method based on sample page
US20100185684A1 (en) High precision multi entity extraction
CN111737623A (en) Webpage information extraction method and related equipment
US20130124684A1 (en) Visual separator detection in web pages using code analysis
CN109271598B (en) Method, device and storage medium for extracting news webpage content
CN112527291A (en) Webpage generation method and device, electronic equipment and storage medium
CN108874934B (en) Page text extraction method and device
CN107590288B (en) Method and device for extracting webpage image-text blocks
CN114021042A (en) Webpage content extraction method and device, computer equipment and storage medium
CN108694192B (en) Webpage type judging method and device
CN114637505A (en) Page content extraction method and device
CN112328246A (en) Page component generation method and device, computer equipment and storage medium
US20080015843A1 (en) Linguistic Image Label Incorporating Decision Relevant Perceptual, Semantic, and Relationships Data
CN113779218B (en) Question-answer pair construction method, question-answer pair construction device, computer equipment and storage medium
CN113343140B (en) Method for automatically extracting webpage text content based on neo4j graphic database
Xu et al. A new webpage classification model based on visual information using gestalt laws of grouping
JP2004303097A (en) Partial document extraction program and partial document extraction method of structured document
CN110990671B (en) Page type discrimination device and method and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination