CN112667874A

CN112667874A - Webpage data extraction method and device, electronic equipment and storage medium

Info

Publication number: CN112667874A
Application number: CN202011541079.0A
Authority: CN
Inventors: 王大伟; 周威
Original assignee: OneConnect Financial Technology Co Ltd Shanghai
Current assignee: OneConnect Smart Technology Co Ltd; OneConnect Financial Technology Co Ltd Shanghai
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2021-04-16
Also published as: WO2022134820A1

Abstract

The invention relates to the technical field of terminals, and provides a method and a device for extracting data of a webpage, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring HTML codes of a webpage to be extracted and analyzing the HTML codes into a first node DOM tree; obtaining all unordered list labels; traversing all list tags of each unordered list tag to obtain a traversal result, and selecting the DOM tree of the list tag with the most child nodes as a second node DOM tree; matching the third node DOM tree of each unselected list tag with the second node DOM tree to generate a fourth node DOM tree; generating a fifth node DOM tree according to the second node DOM tree and the fourth node DOM tree; and generating a new webpage to be extracted according to the DOM tree of the fifth node to extract the webpage characteristic data. According to the method and the device, after the DOM trees of all the unordered list tags are kept consistent, new webpages to be extracted are regenerated for extracting the webpage characteristic data, and the accuracy rate of data extraction is improved.

Description

Webpage data extraction method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of terminals, in particular to a method and a device for extracting data of a webpage, electronic equipment and a storage medium.

Background

With the rapid development of the internet, a large amount of valuable information exists in a public webpage, in order to acquire the valuable information, the prior art extracts the valuable information from the webpage by writing a network acquisition program, but in the process of acquiring list type data, because the data displayed by each list item is not comprehensive enough and the positions of different data fields are different, the problem that data dislocation is easy to occur after the extracted data is converted into a two-dimensional table is caused.

Disclosure of Invention

In view of the above, it is necessary to provide a method and an apparatus for extracting data from a web page, an electronic device, and a storage medium, in which a new web page to be extracted is generated again to extract feature data of the web page after the DOM trees corresponding to each unordered list tag are kept consistent, so as to improve the accuracy of data extraction.

The first aspect of the present invention provides a data extraction method for a web page, where the method includes:

acquiring an HTML code in a source code of a webpage to be extracted, and analyzing the HTML code into a first node DOM tree;

analyzing the DOM tree of the first node to obtain all unordered list tags;

traversing all list tags corresponding to each unordered list tag to obtain a traversal result, and selecting a DOM tree corresponding to the list tag with the most child nodes from the traversal result as a second node DOM tree;

matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generating a fourth node DOM tree of each unselected list tag according to the matching result;

generating a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all fourth node DOM trees;

and generating a new webpage to be extracted according to the DOM trees of the fifth nodes of all the unordered list tags, and extracting webpage characteristic data of the new webpage to be extracted.

Optionally, the matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generating the fourth node DOM tree of each unselected list tag according to the matching result includes:

matching the first tag of the root node of the DOM tree of the second node with the second tag of the root node of the third DOM tree of each unselected list tag;

when the first label is consistent with the second label, judging whether the root node of the second node DOM tree and the root node of the third node DOM tree are leaf nodes;

when the root node of the second node DOM tree is not a leaf node and the root node of the third node DOM tree is not a leaf node, matching the third tags of all the child nodes of the next level of the root node of the second node DOM tree with the fourth tags of all the child nodes of the same level of the third node DOM tree;

and when the third tags of all the child nodes of the next level of the root node of the second node DOM tree are consistent with the fourth tags of all the child nodes of the same level of the third node DOM tree, repeating the process until the child nodes of the second node DOM tree and the child nodes of the third node DOM tree are leaf nodes.

Optionally, the generating a fourth node DOM tree of each unselected list tag according to the matching result includes:

when the third labels of all the child nodes of the next level of the root node of the second node DOM tree are inconsistent with the fourth label of any child node of the same level of the third node DOM tree, identifying a left neighbor node and a right neighbor node of the fourth label;

when a left neighbor node is identified but a right neighbor node is not identified, inserting the fourth tag into the rightmost side of the left neighbor node to obtain a new third node DOM tree, and taking the new third node DOM tree as the fourth node DOM tree of each unselected list tag; or

When the left neighbor node is not identified but the right neighbor node is identified, inserting the fourth tag into the leftmost side of the right neighbor node to obtain a new third node DOM tree, and taking the new third node DOM tree as the fourth node DOM tree of each unselected list tag; or

And when a left neighbor node and a right neighbor node are identified, inserting the fourth tag between the left neighbor node and the right neighbor node to obtain a new third node DOM tree, and taking the new third node DOM tree as the fourth node DOM tree of each unselected list tag.

Optionally, the method further includes:

and when the root node of the DOM tree of the second node is a leaf node but the root node of the DOM tree of the third node is not the leaf node, taking the DOM tree of the third node as a DOM tree of a fourth node of each unselected list tag.

Optionally, the method further includes:

when the root node of the second node DOM tree is not a leaf node, but the root node of the third node DOM tree is a leaf node, traversing all child nodes of the root node of the second node DOM tree;

and inserting the corresponding labels of all the child nodes into the positions corresponding to the DOM trees of the third nodes to obtain a new DOM tree of the third nodes, and taking the new DOM tree of the third nodes as a DOM tree of the fourth nodes of each unselected list label.

Optionally, the method further includes:

and when the root node of the second node DOM tree is a leaf node and the root node of the third node DOM tree is a leaf node, determining that the third node DOM tree is a fourth node DOM tree of each unselected list tag.

Optionally, the generating a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all fourth node DOM trees includes:

and corresponding the second node DOM tree and all the fourth node DOM trees to corresponding positions in each unordered list tag to obtain a fifth node DOM tree of each unordered list tag.

A second aspect of the present invention provides an apparatus for extracting data from a web page, the apparatus comprising:

the acquisition module is used for acquiring HTML codes in source codes of the webpage to be extracted and analyzing the HTML codes into a first node DOM tree;

the analysis module is used for analyzing the DOM tree of the first node to obtain all unordered list tags;

the traversal module is used for traversing all the list tags corresponding to each unordered list tag to obtain a traversal result, and selecting a DOM tree corresponding to the list tag with the most child nodes from the traversal result as a second node DOM tree;

the matching module is used for matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generating a fourth node DOM tree of each unselected list tag according to the matching result;

the generating module is used for generating a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all the fourth node DOM trees;

and the extraction module is used for generating a new webpage to be extracted according to the DOM trees of the fifth nodes of all the unordered list tags and extracting webpage characteristic data of the new webpage to be extracted.

A third aspect of the present invention provides an electronic device, which includes a processor and a memory, wherein the processor is configured to implement the data extraction method for the web page when executing the computer program stored in the memory.

A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the data extraction method for a web page.

In summary, according to the data extraction method, the data extraction device, the electronic device, and the storage medium of the web page of the present invention, on one hand, by selecting the DOM tree corresponding to the list tag with the most child nodes as the second node DOM tree, that is, the seed node DOM tree, the comprehensiveness of the data in each list tag is ensured because the seed node DOM tree contains the most child nodes; on the other hand, the consistency of the DOM tree corresponding to each unordered list tag is ensured by matching the DOM tree of the third node of each unselected list tag in the traversal result with the DOM tree of the second node and generating the DOM tree of the fourth node of each unselected list tag according to the matching result, so that a new webpage to be extracted is generated again according to the new DOM tree, the phenomenon of missing fields cannot occur when webpage feature data is extracted on the new webpage to be extracted, the phenomenon of data dislocation after the extracted webpage feature data is converted into a two-dimensional table is avoided, and the accuracy of data extraction is improved; and finally, in the process of matching the third node DOM tree and the second node DOM tree of each unselected list tag in the traversal result, the specific inserting position is accurate by identifying the left neighbor node and the right neighbor node corresponding to the inconsistent nodes, and the consistency of the fourth node DOM tree of each list tag is ensured.

Drawings

Fig. 1 is a flowchart of a data extraction method for a web page according to an embodiment of the present invention.

Fig. 2 is a structural diagram of a data extraction device for web pages according to a second embodiment of the present invention.

Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.

The following detailed description will further illustrate the invention in conjunction with the above-described figures.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

Example one

In this embodiment, the data extraction method for the web page may be applied to an electronic device, and for an electronic device that needs to perform data extraction for a web page, a function of data extraction for a web page provided by the method of the present invention may be directly integrated on the electronic device, or may be run in the electronic device in a form of Software Development Kit (SKD).

As shown in fig. 1, the data extraction method for the web page specifically includes the following steps, and the order of the steps in the flowchart may be changed and some may be omitted according to different requirements.

S11, obtaining HTML codes in the source codes of the web pages to be extracted, and analyzing the HTML codes into a first node DOM tree.

In this embodiment, a link of a webpage to be extracted is received, a source code is downloaded according to the link, JavaScript and CSS codes are deleted from the source code, HTML codes are retained, and an HTML parser is used to parse the HTML code corresponding to the webpage to be extracted into a first node DOM tree according to a tag hierarchical relationship.

S12, analyzing the first node DOM tree to obtain all unordered list labels.

In this embodiment, the unordered list tag refers to an UL tag, and after the first node DOM tree is obtained, the first node DOM tree is parsed to obtain the unordered list tag of the to-be-extracted webpage, where the first node DOM book may include a plurality of unordered list tags.

S13, traversing all list tags corresponding to each unordered list tag to obtain a traversal result, and selecting the DOM tree corresponding to the list tag with the most child nodes from the traversal result as a second node DOM tree.

In this embodiment, the list tag (li tag) refers to a sub-tag of a next level corresponding to the unordered list tag, each unordered list tag may include a plurality of word tags, each list tag corresponding to each unordered list tag is traversed, and a DOM tree corresponding to the list tag with the most child nodes is selected from the traversal result and used as the DOM tree of the second node.

In this embodiment, the DOM tree corresponding to the list tag having the most children is selected as the second node DOM tree, that is, the seed node DOM tree, and since the seed node DOM book has the most children, the comprehensiveness of the data in each list tag is ensured.

S14, matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generating a fourth node DOM tree of each unselected list tag according to the matching result.

In this embodiment, in order to ensure the comprehensiveness of the data in the DOM tree of each list tag, the third node DOM tree of each unselected list tag is matched with the second node DOM tree of the list tag having the largest number of child nodes, and the third node DOM tree is updated, so that each fourth node DOM tree is consistent with the second node DOM tree, the phenomenon of data misalignment after the extracted webpage feature data is converted into a two-dimensional table is avoided, and the accuracy of data extraction is improved.

Preferably, the matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and the generating a fourth node DOM tree of each unselected list tag according to the matching result includes:

In this embodiment, in order to ensure consistency between each second node DOM tree and the fourth node DOM tree, the first tag of the root node of the second node DOM tree is first matched with the second tag of the root node of the third DOM tree of each unselected list tag, if the first tag is consistent with the second tag, it is determined that tags of the root nodes of the second node DOM tree and the third node DOM tree are consistent, and it is continuously determined whether the root node of the second node DOM tree and the root node of the third node DOM tree are leaf nodes, specifically, the leaf node means that the root node is an end node.

And when the judgment result is that the root node of the second node DOM tree is not a leaf node and the root node of the third node DOM tree is not a leaf node, continuously judging whether the third tags of all the sub-nodes of the next level corresponding to the root node are consistent with the fourth tags of all the sub-nodes of the same level of the third node DOM tree, if so, determining that the second node DOM tree is consistent with the third node DOM tree, and only repeating the process to continuously judge until the sub-nodes of the second node DOM tree and the sub-nodes of the third node DOM tree are leaf nodes.

And if the node DOM tree is inconsistent with the third node DOM tree, determining that the second node DOM tree is inconsistent with the third node DOM tree, traversing the second node DOM tree and the third node DOM tree to find out inconsistent nodes, and updating the second node DOM tree or the third node DOM tree according to the labels corresponding to the inconsistent nodes.

Specifically, when the third tags of all the child nodes of the next level of the root node of the second node DOM tree are inconsistent with the fourth tag of any child node of the same level of the third node DOM tree, it is determined that the third node DOM tree needs to be updated.

Specifically, the generating a DOM tree of a fourth node of each unselected list tag according to the matching result includes:

identifying a left neighbor node and a right neighbor node of the fourth label;

In this embodiment, in the update process of the DOM tree of the third node, the left neighbor node and the right neighbor node corresponding to the inconsistent node need to be identified, so that the specific insertion position is accurate, and the consistency of the DOM tree of the fourth node of each list tag is ensured.

In some other embodiments, in the process of matching the child nodes of the list tag, the second node DOM tree is updated according to the child nodes corresponding to other unselected list tags to obtain a new second node DOM tree, and specifically, when a third tag of any child node at a next level of a root node of the second node DOM tree is inconsistent with fourth tags of all child nodes at the same level of the third node DOM tree, it is determined that the second node DOM tree needs to be updated.

Specifically, the updating process of the DOM tree of the second node includes:

identifying a left neighbor node and a right neighbor node of the third label;

when a left neighbor node is identified but a right neighbor node is not identified, inserting the third tag into the rightmost edge of the left neighbor node to obtain a new DOM tree of a second node, and taking the new DOM tree of the second node as the DOM tree corresponding to the most list tags of the child nodes; or

When the left neighbor node is not identified but the right neighbor node is identified, inserting the third tag into the leftmost side of the right neighbor node to obtain a new DOM tree of the second node, and taking the new DOM tree of the second node as the DOM tree corresponding to the most list tags of the child nodes; or

And when a left neighbor node and a right neighbor node are identified, inserting the third tag into a new DOM tree of the second node between the left neighbor node and the right neighbor node, and taking the new DOM tree of the second node as the DOM tree corresponding to the most list tags of the child nodes.

In this embodiment, when S14 or S15 is executed, the DOM of the second node is updated to the DOM tree of the new second node, so that a field missing phenomenon in the process of extracting the feature data of the web page is avoided, and the comprehensiveness of the data in each list tag is further improved.

Further, the method further comprises:

and when the first tag is inconsistent with the second tag, taking the third node DOM tree as a fourth node DOM tree of each unselected list tag.

In this embodiment, the fourth node DOM tree corresponding to each list tag can be quickly determined according to different determination criteria, so that the diversity of determining the fourth node DOM trees is improved.

S15, generating a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all the fourth node DOM trees.

In this embodiment, each unordered list corresponds to one node DOM tree, and the fifth node DOM tree of each unordered list tag is obtained by corresponding the second node DOM tree and all the fourth node DOM trees to the position corresponding to each unordered list, so that consistency of the webpage data to be extracted before and after matching is ensured.

In this embodiment, the list tag corresponding to the second node DOM and the position of the list tag in each unordered list tag are identified; identifying a list tag corresponding to each DOM tree and the position of each list tag in each unordered list tag; and then corresponding the second node DOM tree and all the fourth node DOM trees to corresponding positions in each unordered list tag to obtain a fifth node DOM tree of each unordered list tag.

S16, generating a new webpage to be extracted according to the DOM trees of the fifth nodes of all the unordered list labels, and extracting webpage characteristic data of the new webpage to be extracted.

In this embodiment, the new webpage to be extracted is obtained by parsing the first DOM tree corresponding to the webpage to be extracted and matching the DOM tree of each list tag corresponding to each unordered list tag.

Further, the method further comprises:

and after extracting the webpage feature data of the new webpage to be extracted, converting the extracted webpage feature data into a two-dimensional table.

In this embodiment, since the DOM tree of the third node of each unselected list tag in the traversal result is matched with the DOM tree of the second node, and the DOM tree of the fourth node of each unselected list tag is generated according to the matching result, the consistency of the DOM tree corresponding to each unordered list tag is ensured, so that a new webpage to be extracted is regenerated according to the new DOM tree, the phenomenon of missing fields cannot occur in extraction of webpage feature data on the new webpage to be extracted, the phenomenon of data dislocation after the extracted webpage feature data is converted into a two-dimensional table is avoided, and the accuracy of data extraction is improved.

In summary, in the data extraction method for the web page according to the embodiment, the HTML code in the source code of the web page to be extracted is acquired, and the HTML code is analyzed into the first node DOM tree; analyzing the DOM tree of the first node to obtain all unordered list tags; traversing all list tags corresponding to each unordered list tag to obtain a traversal result, and selecting a DOM tree corresponding to the list tag with the most child nodes from the traversal result as a second node DOM tree; matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generating a fourth node DOM tree of each unselected list tag according to the matching result; generating a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all fourth node DOM trees; and generating a new webpage to be extracted according to the DOM trees of the fifth nodes of all the unordered list tags, and extracting webpage characteristic data of the new webpage to be extracted.

In this embodiment, on one hand, by selecting the DOM tree corresponding to the list tag with the most child nodes as the second node DOM tree, that is, the seed node DOM tree, the comprehensiveness of the data in each list tag is ensured because the child nodes included in the seed node DOM book are the most child nodes; on the other hand, the consistency of the DOM tree corresponding to each unordered list tag is ensured by matching the DOM tree of the third node of each unselected list tag in the traversal result with the DOM tree of the second node and generating the DOM tree of the fourth node of each unselected list tag according to the matching result, so that a new webpage to be extracted is generated again according to the new DOM tree, the phenomenon of missing fields cannot occur when webpage feature data is extracted on the new webpage to be extracted, the phenomenon of data dislocation after the extracted webpage feature data is converted into a two-dimensional table is avoided, and the accuracy of data extraction is improved; and finally, in the process of matching the third node DOM tree and the second node DOM tree of each unselected list tag in the traversal result, the specific inserting position is accurate by identifying the left neighbor node and the right neighbor node corresponding to the inconsistent nodes, and the consistency of the fourth node DOM tree of each list tag is ensured.

Example two

In some embodiments, the data extraction device 20 of the web page may include a plurality of functional modules composed of program code segments. The program codes of the various program segments in the data extraction device 20 of the web page may be stored in the memory of the electronic device and executed by the at least one processor to perform data extraction of the web page (see detailed description of fig. 1).

In this embodiment, the data extraction device 20 of the web page may be divided into a plurality of functional modules according to the functions executed by the data extraction device. The functional module may include: the system comprises an acquisition module 201, a parsing module 202, a traversal module 203, a matching module 204, a determination module 205, a generation module 206 and an extraction module 207. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in memory. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.

The obtaining module 201 is configured to obtain an HTML code in a source code of a webpage to be extracted, and parse the HTML code into a first node DOM tree.

And the analyzing module 202 is configured to analyze the first node DOM tree to obtain all unordered list tags.

And the traversing module 203 is configured to traverse all the list tags corresponding to each unordered list tag to obtain a traversal result, and select, from the traversal result, a DOM tree corresponding to the list tag with the most child nodes as a second node DOM tree.

And the matching module 204 is configured to match the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generate a fourth node DOM tree of each unselected list tag according to the matching result.

Preferably, the matching module 204 matches the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generating the fourth node DOM tree of each unselected list tag according to the matching result includes:

When the judgment result shows that the root node of the second node DOM tree is not a leaf node and the root node of the third node DOM tree is not a leaf node, whether third tags of all sub-nodes of a next level corresponding to the root node are consistent with fourth tags of all sub-nodes of the same level of the third node DOM tree or not needs to be continuously judged, if so, the second node DOM tree is determined to be consistent with the third node DOM tree, and the process needs to be repeated for continuous judgment until the sub-nodes of the second node DOM tree and the sub-nodes of the third node DOM tree are leaf nodes; and if the node DOM tree is inconsistent with the third node DOM tree, determining that the second node DOM tree is inconsistent with the third node DOM tree, traversing the second node DOM tree and the third node DOM tree to find out inconsistent nodes, and updating the second node DOM tree or the third node DOM tree according to the labels corresponding to the inconsistent nodes.

Specifically, the step of the matching module 204 generating a fourth node DOM tree of each unselected list tag according to the matching result includes:

identifying a left neighbor node and a right neighbor node of the fourth label;

Specifically, the updating process of the DOM tree of the second node includes:

identifying a left neighbor node and a right neighbor node of the third label;

In this embodiment, the second node DOM is updated to the new second node DOM tree, so that a field missing phenomenon in the process of extracting the webpage feature data is avoided, and the comprehensiveness of the data in each list tag is further improved.

And further, when the root node of the second node DOM tree is a leaf node but the root node of the third node DOM tree is not a leaf node, taking the third node DOM tree as a fourth node DOM tree of each unselected list tag.

Further, when the root node of the second node DOM tree is not a leaf node, but the root node of the third node DOM tree is a leaf node, traversing all child nodes of the root node of the second node DOM tree; and inserting the corresponding labels of all the child nodes into the positions corresponding to the DOM trees of the third nodes to obtain a new DOM tree of the third nodes, and taking the new DOM tree of the third nodes as a DOM tree of the fourth nodes of each unselected list label.

Further, the determining module 205 is configured to determine that the third node DOM tree is a fourth node DOM tree of each unselected list tag when the root node of the second node DOM tree is a leaf node and the root node of the third node DOM tree is a leaf node.

And further, when the first tag is inconsistent with the second tag, taking the third node DOM tree as a fourth node DOM tree of each unselected list tag.

And a generating module 206, configured to generate a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all fourth node DOM trees.

Optionally, the generating module 206 generates a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all fourth node DOM trees, including:

And the extraction module 207 is configured to generate a new webpage to be extracted according to the fifth node DOM trees of all the unordered list tags, and extract webpage feature data of the new webpage to be extracted.

Further, after extracting the web page feature data of the new web page to be extracted, converting the extracted web page feature data into a two-dimensional table.

In summary, the data extraction device for a web page according to this embodiment obtains an HTML code in a source code of a web page to be extracted, and parses the HTML code into a first node DOM tree; analyzing the DOM tree of the first node to obtain all unordered list tags; traversing all list tags corresponding to each unordered list tag to obtain a traversal result, and selecting a DOM tree corresponding to the list tag with the most child nodes from the traversal result as a second node DOM tree; matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generating a fourth node DOM tree of each unselected list tag according to the matching result; generating a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all fourth node DOM trees; and generating a new webpage to be extracted according to the DOM trees of the fifth nodes of all the unordered list tags, and extracting webpage characteristic data of the new webpage to be extracted.

EXAMPLE III

Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention. In the preferred embodiment of the present invention, the electronic device 3 comprises a memory 31, at least one processor 32, at least one communication bus 33 and a transceiver 34.

It will be appreciated by those skilled in the art that the configuration of the electronic device shown in fig. 3 does not constitute a limitation of the embodiment of the present invention, and may be a bus-type configuration or a star-type configuration, and the electronic device 3 may include more or less other hardware or software than those shown, or a different arrangement of components.

In some embodiments, the electronic device 3 is an electronic device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware thereof includes but is not limited to a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The electronic device 3 may also include a client device, which includes, but is not limited to, any electronic product that can interact with a client through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, a digital camera, and the like.

It should be noted that the electronic device 3 is only an example, and other existing or future electronic products, such as those that can be adapted to the present invention, should also be included in the scope of the present invention, and are included herein by reference.

In some embodiments, the memory 31 is used for storing program codes and various data, such as the data extraction device 20 of the web page installed in the electronic device 3, and realizes high-speed and automatic access to programs or data during the operation of the electronic device 3. The Memory 31 includes a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an electronically Erasable rewritable Read-Only Memory (Electrically-Erasable Programmable Read-Only Memory (EEPROM)), an optical Read-Only disk (CD-ROM) or other optical disk Memory, a magnetic disk Memory, a tape Memory, or any other medium readable by a computer capable of carrying or storing data.

In some embodiments, the at least one processor 32 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The at least one processor 32 is a Control Unit (Control Unit) of the electronic device 3, connects various components of the electronic device 3 by using various interfaces and lines, and executes various functions and processes data of the electronic device 3 by running or executing programs or modules stored in the memory 31 and calling data stored in the memory 31.

In some embodiments, the at least one communication bus 33 is arranged to enable connection communication between the memory 31 and the at least one processor 32 or the like.

Although not shown, the electronic device 3 may further include a power supply (such as a battery) for supplying power to each component, and optionally, the power supply may be logically connected to the at least one processor 32 through a power management device, so as to implement functions of managing charging, discharging, and power consumption through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 3 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.

The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, an electronic device, or a network device) or a processor (processor) to execute parts of the methods according to the embodiments of the present invention.

In a further embodiment, in conjunction with fig. 2, the at least one processor 32 may execute an operating device of the electronic device 3 and various installed application programs (such as the data extraction device 20 of the web page), program codes, and the like, for example, the above modules.

The memory 31 has program code stored therein, and the at least one processor 32 can call the program code stored in the memory 31 to perform related functions. For example, the modules illustrated in fig. 2 are program codes stored in the memory 31 and executed by the at least one processor 32, so as to implement the functions of the modules for the purpose of data extraction of web pages.

In one embodiment of the invention, the memory 31 stores a plurality of instructions that are executed by the at least one processor 32 to implement the functionality of data extraction for web pages.

Specifically, the at least one processor 32 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1, and details are not repeated here.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or that the singular does not exclude the plural. A plurality of units or means recited in the present invention may also be implemented by one unit or means through software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A data extraction method of a webpage is characterized by comprising the following steps:

analyzing the DOM tree of the first node to obtain all unordered list tags;

2. The method for extracting data from a web page according to claim 1, wherein the matching the DOM tree of the third node of each unselected list tag in the traversal result with the DOM tree of the second node, and the generating the DOM tree of the fourth node of each unselected list tag according to the matching result comprises:

3. The method for extracting data from web pages according to claim 1, wherein the generating a DOM tree of a fourth node for each unselected list tag according to the matching result comprises:

4. The method for extracting data from a web page according to claim 2, wherein the method further comprises:

5. The method for extracting data from a web page according to claim 2, wherein the method further comprises:

6. The method for extracting data from a web page according to claim 2, wherein the method further comprises:

7. The method for extracting data from a web page according to any one of claims 1 to 6, wherein said generating a fifth-node DOM tree for each unordered list tag from said second-node DOM tree and all fourth-node DOM trees comprises:

8. An apparatus for extracting data from a web page, the apparatus comprising:

9. An electronic device, characterized in that the electronic device comprises a processor and a memory, the processor being configured to implement the data extraction method of the web page according to any one of claims 1 to 7 when executing the computer program stored in the memory.

10. A computer-readable storage medium, on which a computer program is stored, the computer program, when being executed by a processor, implementing a data extraction method for a web page according to any one of claims 1 to 7.