CN112667874A - Webpage data extraction method and device, electronic equipment and storage medium - Google Patents
Webpage data extraction method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN112667874A CN112667874A CN202011541079.0A CN202011541079A CN112667874A CN 112667874 A CN112667874 A CN 112667874A CN 202011541079 A CN202011541079 A CN 202011541079A CN 112667874 A CN112667874 A CN 112667874A
- Authority
- CN
- China
- Prior art keywords
- node
- dom tree
- tag
- list
- list tag
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 59
- 238000013075 data extraction Methods 0.000 title claims abstract description 38
- 238000004590 computer program Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 6
- 230000006870 function Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 2
- 238000003780 insertion Methods 0.000 description 2
- 230000037431 insertion Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000007599 discharging Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/14—Tree-structured documents
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
- G06F40/154—Tree transformation for tree-structured or markup documents, e.g. XSLT, XSL-FO or stylesheets
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention relates to the technical field of terminals, and provides a method and a device for extracting data of a webpage, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring HTML codes of a webpage to be extracted and analyzing the HTML codes into a first node DOM tree; obtaining all unordered list labels; traversing all list tags of each unordered list tag to obtain a traversal result, and selecting the DOM tree of the list tag with the most child nodes as a second node DOM tree; matching the third node DOM tree of each unselected list tag with the second node DOM tree to generate a fourth node DOM tree; generating a fifth node DOM tree according to the second node DOM tree and the fourth node DOM tree; and generating a new webpage to be extracted according to the DOM tree of the fifth node to extract the webpage characteristic data. According to the method and the device, after the DOM trees of all the unordered list tags are kept consistent, new webpages to be extracted are regenerated for extracting the webpage characteristic data, and the accuracy rate of data extraction is improved.
Description
Technical Field
The invention relates to the technical field of terminals, in particular to a method and a device for extracting data of a webpage, electronic equipment and a storage medium.
Background
With the rapid development of the internet, a large amount of valuable information exists in a public webpage, in order to acquire the valuable information, the prior art extracts the valuable information from the webpage by writing a network acquisition program, but in the process of acquiring list type data, because the data displayed by each list item is not comprehensive enough and the positions of different data fields are different, the problem that data dislocation is easy to occur after the extracted data is converted into a two-dimensional table is caused.
Disclosure of Invention
In view of the above, it is necessary to provide a method and an apparatus for extracting data from a web page, an electronic device, and a storage medium, in which a new web page to be extracted is generated again to extract feature data of the web page after the DOM trees corresponding to each unordered list tag are kept consistent, so as to improve the accuracy of data extraction.
The first aspect of the present invention provides a data extraction method for a web page, where the method includes:
acquiring an HTML code in a source code of a webpage to be extracted, and analyzing the HTML code into a first node DOM tree;
analyzing the DOM tree of the first node to obtain all unordered list tags;
traversing all list tags corresponding to each unordered list tag to obtain a traversal result, and selecting a DOM tree corresponding to the list tag with the most child nodes from the traversal result as a second node DOM tree;
matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generating a fourth node DOM tree of each unselected list tag according to the matching result;
generating a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all fourth node DOM trees;
and generating a new webpage to be extracted according to the DOM trees of the fifth nodes of all the unordered list tags, and extracting webpage characteristic data of the new webpage to be extracted.
Optionally, the matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generating the fourth node DOM tree of each unselected list tag according to the matching result includes:
matching the first tag of the root node of the DOM tree of the second node with the second tag of the root node of the third DOM tree of each unselected list tag;
when the first label is consistent with the second label, judging whether the root node of the second node DOM tree and the root node of the third node DOM tree are leaf nodes;
when the root node of the second node DOM tree is not a leaf node and the root node of the third node DOM tree is not a leaf node, matching the third tags of all the child nodes of the next level of the root node of the second node DOM tree with the fourth tags of all the child nodes of the same level of the third node DOM tree;
and when the third tags of all the child nodes of the next level of the root node of the second node DOM tree are consistent with the fourth tags of all the child nodes of the same level of the third node DOM tree, repeating the process until the child nodes of the second node DOM tree and the child nodes of the third node DOM tree are leaf nodes.
Optionally, the generating a fourth node DOM tree of each unselected list tag according to the matching result includes:
when the third labels of all the child nodes of the next level of the root node of the second node DOM tree are inconsistent with the fourth label of any child node of the same level of the third node DOM tree, identifying a left neighbor node and a right neighbor node of the fourth label;
when a left neighbor node is identified but a right neighbor node is not identified, inserting the fourth tag into the rightmost side of the left neighbor node to obtain a new third node DOM tree, and taking the new third node DOM tree as the fourth node DOM tree of each unselected list tag; or
When the left neighbor node is not identified but the right neighbor node is identified, inserting the fourth tag into the leftmost side of the right neighbor node to obtain a new third node DOM tree, and taking the new third node DOM tree as the fourth node DOM tree of each unselected list tag; or
And when a left neighbor node and a right neighbor node are identified, inserting the fourth tag between the left neighbor node and the right neighbor node to obtain a new third node DOM tree, and taking the new third node DOM tree as the fourth node DOM tree of each unselected list tag.
Optionally, the method further includes:
and when the root node of the DOM tree of the second node is a leaf node but the root node of the DOM tree of the third node is not the leaf node, taking the DOM tree of the third node as a DOM tree of a fourth node of each unselected list tag.
Optionally, the method further includes:
when the root node of the second node DOM tree is not a leaf node, but the root node of the third node DOM tree is a leaf node, traversing all child nodes of the root node of the second node DOM tree;
and inserting the corresponding labels of all the child nodes into the positions corresponding to the DOM trees of the third nodes to obtain a new DOM tree of the third nodes, and taking the new DOM tree of the third nodes as a DOM tree of the fourth nodes of each unselected list label.
Optionally, the method further includes:
and when the root node of the second node DOM tree is a leaf node and the root node of the third node DOM tree is a leaf node, determining that the third node DOM tree is a fourth node DOM tree of each unselected list tag.
Optionally, the generating a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all fourth node DOM trees includes:
and corresponding the second node DOM tree and all the fourth node DOM trees to corresponding positions in each unordered list tag to obtain a fifth node DOM tree of each unordered list tag.
A second aspect of the present invention provides an apparatus for extracting data from a web page, the apparatus comprising:
the acquisition module is used for acquiring HTML codes in source codes of the webpage to be extracted and analyzing the HTML codes into a first node DOM tree;
the analysis module is used for analyzing the DOM tree of the first node to obtain all unordered list tags;
the traversal module is used for traversing all the list tags corresponding to each unordered list tag to obtain a traversal result, and selecting a DOM tree corresponding to the list tag with the most child nodes from the traversal result as a second node DOM tree;
the matching module is used for matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generating a fourth node DOM tree of each unselected list tag according to the matching result;
the generating module is used for generating a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all the fourth node DOM trees;
and the extraction module is used for generating a new webpage to be extracted according to the DOM trees of the fifth nodes of all the unordered list tags and extracting webpage characteristic data of the new webpage to be extracted.
A third aspect of the present invention provides an electronic device, which includes a processor and a memory, wherein the processor is configured to implement the data extraction method for the web page when executing the computer program stored in the memory.
A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the data extraction method for a web page.
In summary, according to the data extraction method, the data extraction device, the electronic device, and the storage medium of the web page of the present invention, on one hand, by selecting the DOM tree corresponding to the list tag with the most child nodes as the second node DOM tree, that is, the seed node DOM tree, the comprehensiveness of the data in each list tag is ensured because the seed node DOM tree contains the most child nodes; on the other hand, the consistency of the DOM tree corresponding to each unordered list tag is ensured by matching the DOM tree of the third node of each unselected list tag in the traversal result with the DOM tree of the second node and generating the DOM tree of the fourth node of each unselected list tag according to the matching result, so that a new webpage to be extracted is generated again according to the new DOM tree, the phenomenon of missing fields cannot occur when webpage feature data is extracted on the new webpage to be extracted, the phenomenon of data dislocation after the extracted webpage feature data is converted into a two-dimensional table is avoided, and the accuracy of data extraction is improved; and finally, in the process of matching the third node DOM tree and the second node DOM tree of each unselected list tag in the traversal result, the specific inserting position is accurate by identifying the left neighbor node and the right neighbor node corresponding to the inconsistent nodes, and the consistency of the fourth node DOM tree of each list tag is ensured.
Drawings
Fig. 1 is a flowchart of a data extraction method for a web page according to an embodiment of the present invention.
Fig. 2 is a structural diagram of a data extraction device for web pages according to a second embodiment of the present invention.
Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.
The following detailed description will further illustrate the invention in conjunction with the above-described figures.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
Example one
Fig. 1 is a flowchart of a data extraction method for a web page according to an embodiment of the present invention.
In this embodiment, the data extraction method for the web page may be applied to an electronic device, and for an electronic device that needs to perform data extraction for a web page, a function of data extraction for a web page provided by the method of the present invention may be directly integrated on the electronic device, or may be run in the electronic device in a form of Software Development Kit (SKD).
As shown in fig. 1, the data extraction method for the web page specifically includes the following steps, and the order of the steps in the flowchart may be changed and some may be omitted according to different requirements.
S11, obtaining HTML codes in the source codes of the web pages to be extracted, and analyzing the HTML codes into a first node DOM tree.
In this embodiment, a link of a webpage to be extracted is received, a source code is downloaded according to the link, JavaScript and CSS codes are deleted from the source code, HTML codes are retained, and an HTML parser is used to parse the HTML code corresponding to the webpage to be extracted into a first node DOM tree according to a tag hierarchical relationship.
S12, analyzing the first node DOM tree to obtain all unordered list labels.
In this embodiment, the unordered list tag refers to an UL tag, and after the first node DOM tree is obtained, the first node DOM tree is parsed to obtain the unordered list tag of the to-be-extracted webpage, where the first node DOM book may include a plurality of unordered list tags.
S13, traversing all list tags corresponding to each unordered list tag to obtain a traversal result, and selecting the DOM tree corresponding to the list tag with the most child nodes from the traversal result as a second node DOM tree.
In this embodiment, the list tag (li tag) refers to a sub-tag of a next level corresponding to the unordered list tag, each unordered list tag may include a plurality of word tags, each list tag corresponding to each unordered list tag is traversed, and a DOM tree corresponding to the list tag with the most child nodes is selected from the traversal result and used as the DOM tree of the second node.
In this embodiment, the DOM tree corresponding to the list tag having the most children is selected as the second node DOM tree, that is, the seed node DOM tree, and since the seed node DOM book has the most children, the comprehensiveness of the data in each list tag is ensured.
S14, matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generating a fourth node DOM tree of each unselected list tag according to the matching result.
In this embodiment, in order to ensure the comprehensiveness of the data in the DOM tree of each list tag, the third node DOM tree of each unselected list tag is matched with the second node DOM tree of the list tag having the largest number of child nodes, and the third node DOM tree is updated, so that each fourth node DOM tree is consistent with the second node DOM tree, the phenomenon of data misalignment after the extracted webpage feature data is converted into a two-dimensional table is avoided, and the accuracy of data extraction is improved.
Preferably, the matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and the generating a fourth node DOM tree of each unselected list tag according to the matching result includes:
matching the first tag of the root node of the DOM tree of the second node with the second tag of the root node of the third DOM tree of each unselected list tag;
when the first label is consistent with the second label, judging whether the root node of the second node DOM tree and the root node of the third node DOM tree are leaf nodes;
when the root node of the second node DOM tree is not a leaf node and the root node of the third node DOM tree is not a leaf node, matching the third tags of all the child nodes of the next level of the root node of the second node DOM tree with the fourth tags of all the child nodes of the same level of the third node DOM tree;
and when the third tags of all the child nodes of the next level of the root node of the second node DOM tree are consistent with the fourth tags of all the child nodes of the same level of the third node DOM tree, repeating the process until the child nodes of the second node DOM tree and the child nodes of the third node DOM tree are leaf nodes.
In this embodiment, in order to ensure consistency between each second node DOM tree and the fourth node DOM tree, the first tag of the root node of the second node DOM tree is first matched with the second tag of the root node of the third DOM tree of each unselected list tag, if the first tag is consistent with the second tag, it is determined that tags of the root nodes of the second node DOM tree and the third node DOM tree are consistent, and it is continuously determined whether the root node of the second node DOM tree and the root node of the third node DOM tree are leaf nodes, specifically, the leaf node means that the root node is an end node.
And when the judgment result is that the root node of the second node DOM tree is not a leaf node and the root node of the third node DOM tree is not a leaf node, continuously judging whether the third tags of all the sub-nodes of the next level corresponding to the root node are consistent with the fourth tags of all the sub-nodes of the same level of the third node DOM tree, if so, determining that the second node DOM tree is consistent with the third node DOM tree, and only repeating the process to continuously judge until the sub-nodes of the second node DOM tree and the sub-nodes of the third node DOM tree are leaf nodes.
And if the node DOM tree is inconsistent with the third node DOM tree, determining that the second node DOM tree is inconsistent with the third node DOM tree, traversing the second node DOM tree and the third node DOM tree to find out inconsistent nodes, and updating the second node DOM tree or the third node DOM tree according to the labels corresponding to the inconsistent nodes.
Specifically, when the third tags of all the child nodes of the next level of the root node of the second node DOM tree are inconsistent with the fourth tag of any child node of the same level of the third node DOM tree, it is determined that the third node DOM tree needs to be updated.
Specifically, the generating a DOM tree of a fourth node of each unselected list tag according to the matching result includes:
identifying a left neighbor node and a right neighbor node of the fourth label;
when a left neighbor node is identified but a right neighbor node is not identified, inserting the fourth tag into the rightmost side of the left neighbor node to obtain a new third node DOM tree, and taking the new third node DOM tree as the fourth node DOM tree of each unselected list tag; or
When the left neighbor node is not identified but the right neighbor node is identified, inserting the fourth tag into the leftmost side of the right neighbor node to obtain a new third node DOM tree, and taking the new third node DOM tree as the fourth node DOM tree of each unselected list tag; or
And when a left neighbor node and a right neighbor node are identified, inserting the fourth tag between the left neighbor node and the right neighbor node to obtain a new third node DOM tree, and taking the new third node DOM tree as the fourth node DOM tree of each unselected list tag.
In this embodiment, in the update process of the DOM tree of the third node, the left neighbor node and the right neighbor node corresponding to the inconsistent node need to be identified, so that the specific insertion position is accurate, and the consistency of the DOM tree of the fourth node of each list tag is ensured.
In some other embodiments, in the process of matching the child nodes of the list tag, the second node DOM tree is updated according to the child nodes corresponding to other unselected list tags to obtain a new second node DOM tree, and specifically, when a third tag of any child node at a next level of a root node of the second node DOM tree is inconsistent with fourth tags of all child nodes at the same level of the third node DOM tree, it is determined that the second node DOM tree needs to be updated.
Specifically, the updating process of the DOM tree of the second node includes:
identifying a left neighbor node and a right neighbor node of the third label;
when a left neighbor node is identified but a right neighbor node is not identified, inserting the third tag into the rightmost edge of the left neighbor node to obtain a new DOM tree of a second node, and taking the new DOM tree of the second node as the DOM tree corresponding to the most list tags of the child nodes; or
When the left neighbor node is not identified but the right neighbor node is identified, inserting the third tag into the leftmost side of the right neighbor node to obtain a new DOM tree of the second node, and taking the new DOM tree of the second node as the DOM tree corresponding to the most list tags of the child nodes; or
And when a left neighbor node and a right neighbor node are identified, inserting the third tag into a new DOM tree of the second node between the left neighbor node and the right neighbor node, and taking the new DOM tree of the second node as the DOM tree corresponding to the most list tags of the child nodes.
In this embodiment, when S14 or S15 is executed, the DOM of the second node is updated to the DOM tree of the new second node, so that a field missing phenomenon in the process of extracting the feature data of the web page is avoided, and the comprehensiveness of the data in each list tag is further improved.
Further, the method further comprises:
and when the root node of the DOM tree of the second node is a leaf node but the root node of the DOM tree of the third node is not the leaf node, taking the DOM tree of the third node as a DOM tree of a fourth node of each unselected list tag.
Further, the method further comprises:
when the root node of the second node DOM tree is not a leaf node, but the root node of the third node DOM tree is a leaf node, traversing all child nodes of the root node of the second node DOM tree;
and inserting the corresponding labels of all the child nodes into the positions corresponding to the DOM trees of the third nodes to obtain a new DOM tree of the third nodes, and taking the new DOM tree of the third nodes as a DOM tree of the fourth nodes of each unselected list label.
Further, the method further comprises:
and when the root node of the second node DOM tree is a leaf node and the root node of the third node DOM tree is a leaf node, determining that the third node DOM tree is a fourth node DOM tree of each unselected list tag.
Further, the method further comprises:
and when the first tag is inconsistent with the second tag, taking the third node DOM tree as a fourth node DOM tree of each unselected list tag.
In this embodiment, the fourth node DOM tree corresponding to each list tag can be quickly determined according to different determination criteria, so that the diversity of determining the fourth node DOM trees is improved.
S15, generating a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all the fourth node DOM trees.
In this embodiment, each unordered list corresponds to one node DOM tree, and the fifth node DOM tree of each unordered list tag is obtained by corresponding the second node DOM tree and all the fourth node DOM trees to the position corresponding to each unordered list, so that consistency of the webpage data to be extracted before and after matching is ensured.
Optionally, the generating a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all fourth node DOM trees includes:
and corresponding the second node DOM tree and all the fourth node DOM trees to corresponding positions in each unordered list tag to obtain a fifth node DOM tree of each unordered list tag.
In this embodiment, the list tag corresponding to the second node DOM and the position of the list tag in each unordered list tag are identified; identifying a list tag corresponding to each DOM tree and the position of each list tag in each unordered list tag; and then corresponding the second node DOM tree and all the fourth node DOM trees to corresponding positions in each unordered list tag to obtain a fifth node DOM tree of each unordered list tag.
S16, generating a new webpage to be extracted according to the DOM trees of the fifth nodes of all the unordered list labels, and extracting webpage characteristic data of the new webpage to be extracted.
In this embodiment, the new webpage to be extracted is obtained by parsing the first DOM tree corresponding to the webpage to be extracted and matching the DOM tree of each list tag corresponding to each unordered list tag.
Further, the method further comprises:
and after extracting the webpage feature data of the new webpage to be extracted, converting the extracted webpage feature data into a two-dimensional table.
In this embodiment, since the DOM tree of the third node of each unselected list tag in the traversal result is matched with the DOM tree of the second node, and the DOM tree of the fourth node of each unselected list tag is generated according to the matching result, the consistency of the DOM tree corresponding to each unordered list tag is ensured, so that a new webpage to be extracted is regenerated according to the new DOM tree, the phenomenon of missing fields cannot occur in extraction of webpage feature data on the new webpage to be extracted, the phenomenon of data dislocation after the extracted webpage feature data is converted into a two-dimensional table is avoided, and the accuracy of data extraction is improved.
In summary, in the data extraction method for the web page according to the embodiment, the HTML code in the source code of the web page to be extracted is acquired, and the HTML code is analyzed into the first node DOM tree; analyzing the DOM tree of the first node to obtain all unordered list tags; traversing all list tags corresponding to each unordered list tag to obtain a traversal result, and selecting a DOM tree corresponding to the list tag with the most child nodes from the traversal result as a second node DOM tree; matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generating a fourth node DOM tree of each unselected list tag according to the matching result; generating a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all fourth node DOM trees; and generating a new webpage to be extracted according to the DOM trees of the fifth nodes of all the unordered list tags, and extracting webpage characteristic data of the new webpage to be extracted.
In this embodiment, on one hand, by selecting the DOM tree corresponding to the list tag with the most child nodes as the second node DOM tree, that is, the seed node DOM tree, the comprehensiveness of the data in each list tag is ensured because the child nodes included in the seed node DOM book are the most child nodes; on the other hand, the consistency of the DOM tree corresponding to each unordered list tag is ensured by matching the DOM tree of the third node of each unselected list tag in the traversal result with the DOM tree of the second node and generating the DOM tree of the fourth node of each unselected list tag according to the matching result, so that a new webpage to be extracted is generated again according to the new DOM tree, the phenomenon of missing fields cannot occur when webpage feature data is extracted on the new webpage to be extracted, the phenomenon of data dislocation after the extracted webpage feature data is converted into a two-dimensional table is avoided, and the accuracy of data extraction is improved; and finally, in the process of matching the third node DOM tree and the second node DOM tree of each unselected list tag in the traversal result, the specific inserting position is accurate by identifying the left neighbor node and the right neighbor node corresponding to the inconsistent nodes, and the consistency of the fourth node DOM tree of each list tag is ensured.
Example two
Fig. 2 is a structural diagram of a data extraction device for web pages according to a second embodiment of the present invention.
In some embodiments, the data extraction device 20 of the web page may include a plurality of functional modules composed of program code segments. The program codes of the various program segments in the data extraction device 20 of the web page may be stored in the memory of the electronic device and executed by the at least one processor to perform data extraction of the web page (see detailed description of fig. 1).
In this embodiment, the data extraction device 20 of the web page may be divided into a plurality of functional modules according to the functions executed by the data extraction device. The functional module may include: the system comprises an acquisition module 201, a parsing module 202, a traversal module 203, a matching module 204, a determination module 205, a generation module 206 and an extraction module 207. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in memory. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.
The obtaining module 201 is configured to obtain an HTML code in a source code of a webpage to be extracted, and parse the HTML code into a first node DOM tree.
In this embodiment, a link of a webpage to be extracted is received, a source code is downloaded according to the link, JavaScript and CSS codes are deleted from the source code, HTML codes are retained, and an HTML parser is used to parse the HTML code corresponding to the webpage to be extracted into a first node DOM tree according to a tag hierarchical relationship.
And the analyzing module 202 is configured to analyze the first node DOM tree to obtain all unordered list tags.
In this embodiment, the unordered list tag refers to an UL tag, and after the first node DOM tree is obtained, the first node DOM tree is parsed to obtain the unordered list tag of the to-be-extracted webpage, where the first node DOM book may include a plurality of unordered list tags.
And the traversing module 203 is configured to traverse all the list tags corresponding to each unordered list tag to obtain a traversal result, and select, from the traversal result, a DOM tree corresponding to the list tag with the most child nodes as a second node DOM tree.
In this embodiment, the list tag (li tag) refers to a sub-tag of a next level corresponding to the unordered list tag, each unordered list tag may include a plurality of word tags, each list tag corresponding to each unordered list tag is traversed, and a DOM tree corresponding to the list tag with the most child nodes is selected from the traversal result and used as the DOM tree of the second node.
In this embodiment, the DOM tree corresponding to the list tag having the most children is selected as the second node DOM tree, that is, the seed node DOM tree, and since the seed node DOM book has the most children, the comprehensiveness of the data in each list tag is ensured.
And the matching module 204 is configured to match the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generate a fourth node DOM tree of each unselected list tag according to the matching result.
In this embodiment, in order to ensure the comprehensiveness of the data in the DOM tree of each list tag, the third node DOM tree of each unselected list tag is matched with the second node DOM tree of the list tag having the largest number of child nodes, and the third node DOM tree is updated, so that each fourth node DOM tree is consistent with the second node DOM tree, the phenomenon of data misalignment after the extracted webpage feature data is converted into a two-dimensional table is avoided, and the accuracy of data extraction is improved.
Preferably, the matching module 204 matches the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generating the fourth node DOM tree of each unselected list tag according to the matching result includes:
matching the first tag of the root node of the DOM tree of the second node with the second tag of the root node of the third DOM tree of each unselected list tag;
when the first label is consistent with the second label, judging whether the root node of the second node DOM tree and the root node of the third node DOM tree are leaf nodes;
when the root node of the second node DOM tree is not a leaf node and the root node of the third node DOM tree is not a leaf node, matching the third tags of all the child nodes of the next level of the root node of the second node DOM tree with the fourth tags of all the child nodes of the same level of the third node DOM tree;
and when the third tags of all the child nodes of the next level of the root node of the second node DOM tree are consistent with the fourth tags of all the child nodes of the same level of the third node DOM tree, repeating the process until the child nodes of the second node DOM tree and the child nodes of the third node DOM tree are leaf nodes.
In this embodiment, in order to ensure consistency between each second node DOM tree and the fourth node DOM tree, the first tag of the root node of the second node DOM tree is first matched with the second tag of the root node of the third DOM tree of each unselected list tag, if the first tag is consistent with the second tag, it is determined that tags of the root nodes of the second node DOM tree and the third node DOM tree are consistent, and it is continuously determined whether the root node of the second node DOM tree and the root node of the third node DOM tree are leaf nodes, specifically, the leaf node means that the root node is an end node.
When the judgment result shows that the root node of the second node DOM tree is not a leaf node and the root node of the third node DOM tree is not a leaf node, whether third tags of all sub-nodes of a next level corresponding to the root node are consistent with fourth tags of all sub-nodes of the same level of the third node DOM tree or not needs to be continuously judged, if so, the second node DOM tree is determined to be consistent with the third node DOM tree, and the process needs to be repeated for continuous judgment until the sub-nodes of the second node DOM tree and the sub-nodes of the third node DOM tree are leaf nodes; and if the node DOM tree is inconsistent with the third node DOM tree, determining that the second node DOM tree is inconsistent with the third node DOM tree, traversing the second node DOM tree and the third node DOM tree to find out inconsistent nodes, and updating the second node DOM tree or the third node DOM tree according to the labels corresponding to the inconsistent nodes.
Specifically, when the third tags of all the child nodes of the next level of the root node of the second node DOM tree are inconsistent with the fourth tag of any child node of the same level of the third node DOM tree, it is determined that the third node DOM tree needs to be updated.
Specifically, the step of the matching module 204 generating a fourth node DOM tree of each unselected list tag according to the matching result includes:
identifying a left neighbor node and a right neighbor node of the fourth label;
when a left neighbor node is identified but a right neighbor node is not identified, inserting the fourth tag into the rightmost side of the left neighbor node to obtain a new third node DOM tree, and taking the new third node DOM tree as the fourth node DOM tree of each unselected list tag; or
When the left neighbor node is not identified but the right neighbor node is identified, inserting the fourth tag into the leftmost side of the right neighbor node to obtain a new third node DOM tree, and taking the new third node DOM tree as the fourth node DOM tree of each unselected list tag; or
And when a left neighbor node and a right neighbor node are identified, inserting the fourth tag between the left neighbor node and the right neighbor node to obtain a new third node DOM tree, and taking the new third node DOM tree as the fourth node DOM tree of each unselected list tag.
In this embodiment, in the update process of the DOM tree of the third node, the left neighbor node and the right neighbor node corresponding to the inconsistent node need to be identified, so that the specific insertion position is accurate, and the consistency of the DOM tree of the fourth node of each list tag is ensured.
In some other embodiments, in the process of matching the child nodes of the list tag, the second node DOM tree is updated according to the child nodes corresponding to other unselected list tags to obtain a new second node DOM tree, and specifically, when a third tag of any child node at a next level of a root node of the second node DOM tree is inconsistent with fourth tags of all child nodes at the same level of the third node DOM tree, it is determined that the second node DOM tree needs to be updated.
Specifically, the updating process of the DOM tree of the second node includes:
identifying a left neighbor node and a right neighbor node of the third label;
when a left neighbor node is identified but a right neighbor node is not identified, inserting the third tag into the rightmost edge of the left neighbor node to obtain a new DOM tree of a second node, and taking the new DOM tree of the second node as the DOM tree corresponding to the most list tags of the child nodes; or
When the left neighbor node is not identified but the right neighbor node is identified, inserting the third tag into the leftmost side of the right neighbor node to obtain a new DOM tree of the second node, and taking the new DOM tree of the second node as the DOM tree corresponding to the most list tags of the child nodes; or
And when a left neighbor node and a right neighbor node are identified, inserting the third tag into a new DOM tree of the second node between the left neighbor node and the right neighbor node, and taking the new DOM tree of the second node as the DOM tree corresponding to the most list tags of the child nodes.
In this embodiment, the second node DOM is updated to the new second node DOM tree, so that a field missing phenomenon in the process of extracting the webpage feature data is avoided, and the comprehensiveness of the data in each list tag is further improved.
And further, when the root node of the second node DOM tree is a leaf node but the root node of the third node DOM tree is not a leaf node, taking the third node DOM tree as a fourth node DOM tree of each unselected list tag.
Further, when the root node of the second node DOM tree is not a leaf node, but the root node of the third node DOM tree is a leaf node, traversing all child nodes of the root node of the second node DOM tree; and inserting the corresponding labels of all the child nodes into the positions corresponding to the DOM trees of the third nodes to obtain a new DOM tree of the third nodes, and taking the new DOM tree of the third nodes as a DOM tree of the fourth nodes of each unselected list label.
Further, the determining module 205 is configured to determine that the third node DOM tree is a fourth node DOM tree of each unselected list tag when the root node of the second node DOM tree is a leaf node and the root node of the third node DOM tree is a leaf node.
And further, when the first tag is inconsistent with the second tag, taking the third node DOM tree as a fourth node DOM tree of each unselected list tag.
In this embodiment, the fourth node DOM tree corresponding to each list tag can be quickly determined according to different determination criteria, so that the diversity of determining the fourth node DOM trees is improved.
And a generating module 206, configured to generate a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all fourth node DOM trees.
In this embodiment, each unordered list corresponds to one node DOM tree, and the fifth node DOM tree of each unordered list tag is obtained by corresponding the second node DOM tree and all the fourth node DOM trees to the position corresponding to each unordered list, so that consistency of the webpage data to be extracted before and after matching is ensured.
Optionally, the generating module 206 generates a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all fourth node DOM trees, including:
and corresponding the second node DOM tree and all the fourth node DOM trees to corresponding positions in each unordered list tag to obtain a fifth node DOM tree of each unordered list tag.
In this embodiment, the list tag corresponding to the second node DOM and the position of the list tag in each unordered list tag are identified; identifying a list tag corresponding to each DOM tree and the position of each list tag in each unordered list tag; and then corresponding the second node DOM tree and all the fourth node DOM trees to corresponding positions in each unordered list tag to obtain a fifth node DOM tree of each unordered list tag.
And the extraction module 207 is configured to generate a new webpage to be extracted according to the fifth node DOM trees of all the unordered list tags, and extract webpage feature data of the new webpage to be extracted.
In this embodiment, the new webpage to be extracted is obtained by parsing the first DOM tree corresponding to the webpage to be extracted and matching the DOM tree of each list tag corresponding to each unordered list tag.
Further, after extracting the web page feature data of the new web page to be extracted, converting the extracted web page feature data into a two-dimensional table.
In this embodiment, since the DOM tree of the third node of each unselected list tag in the traversal result is matched with the DOM tree of the second node, and the DOM tree of the fourth node of each unselected list tag is generated according to the matching result, the consistency of the DOM tree corresponding to each unordered list tag is ensured, so that a new webpage to be extracted is regenerated according to the new DOM tree, the phenomenon of missing fields cannot occur in extraction of webpage feature data on the new webpage to be extracted, the phenomenon of data dislocation after the extracted webpage feature data is converted into a two-dimensional table is avoided, and the accuracy of data extraction is improved.
In summary, the data extraction device for a web page according to this embodiment obtains an HTML code in a source code of a web page to be extracted, and parses the HTML code into a first node DOM tree; analyzing the DOM tree of the first node to obtain all unordered list tags; traversing all list tags corresponding to each unordered list tag to obtain a traversal result, and selecting a DOM tree corresponding to the list tag with the most child nodes from the traversal result as a second node DOM tree; matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generating a fourth node DOM tree of each unselected list tag according to the matching result; generating a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all fourth node DOM trees; and generating a new webpage to be extracted according to the DOM trees of the fifth nodes of all the unordered list tags, and extracting webpage characteristic data of the new webpage to be extracted.
In this embodiment, on one hand, by selecting the DOM tree corresponding to the list tag with the most child nodes as the second node DOM tree, that is, the seed node DOM tree, the comprehensiveness of the data in each list tag is ensured because the child nodes included in the seed node DOM book are the most child nodes; on the other hand, the consistency of the DOM tree corresponding to each unordered list tag is ensured by matching the DOM tree of the third node of each unselected list tag in the traversal result with the DOM tree of the second node and generating the DOM tree of the fourth node of each unselected list tag according to the matching result, so that a new webpage to be extracted is generated again according to the new DOM tree, the phenomenon of missing fields cannot occur when webpage feature data is extracted on the new webpage to be extracted, the phenomenon of data dislocation after the extracted webpage feature data is converted into a two-dimensional table is avoided, and the accuracy of data extraction is improved; and finally, in the process of matching the third node DOM tree and the second node DOM tree of each unselected list tag in the traversal result, the specific inserting position is accurate by identifying the left neighbor node and the right neighbor node corresponding to the inconsistent nodes, and the consistency of the fourth node DOM tree of each list tag is ensured.
EXAMPLE III
Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention. In the preferred embodiment of the present invention, the electronic device 3 comprises a memory 31, at least one processor 32, at least one communication bus 33 and a transceiver 34.
It will be appreciated by those skilled in the art that the configuration of the electronic device shown in fig. 3 does not constitute a limitation of the embodiment of the present invention, and may be a bus-type configuration or a star-type configuration, and the electronic device 3 may include more or less other hardware or software than those shown, or a different arrangement of components.
In some embodiments, the electronic device 3 is an electronic device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware thereof includes but is not limited to a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The electronic device 3 may also include a client device, which includes, but is not limited to, any electronic product that can interact with a client through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, a digital camera, and the like.
It should be noted that the electronic device 3 is only an example, and other existing or future electronic products, such as those that can be adapted to the present invention, should also be included in the scope of the present invention, and are included herein by reference.
In some embodiments, the memory 31 is used for storing program codes and various data, such as the data extraction device 20 of the web page installed in the electronic device 3, and realizes high-speed and automatic access to programs or data during the operation of the electronic device 3. The Memory 31 includes a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an electronically Erasable rewritable Read-Only Memory (Electrically-Erasable Programmable Read-Only Memory (EEPROM)), an optical Read-Only disk (CD-ROM) or other optical disk Memory, a magnetic disk Memory, a tape Memory, or any other medium readable by a computer capable of carrying or storing data.
In some embodiments, the at least one processor 32 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The at least one processor 32 is a Control Unit (Control Unit) of the electronic device 3, connects various components of the electronic device 3 by using various interfaces and lines, and executes various functions and processes data of the electronic device 3 by running or executing programs or modules stored in the memory 31 and calling data stored in the memory 31.
In some embodiments, the at least one communication bus 33 is arranged to enable connection communication between the memory 31 and the at least one processor 32 or the like.
Although not shown, the electronic device 3 may further include a power supply (such as a battery) for supplying power to each component, and optionally, the power supply may be logically connected to the at least one processor 32 through a power management device, so as to implement functions of managing charging, discharging, and power consumption through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 3 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, an electronic device, or a network device) or a processor (processor) to execute parts of the methods according to the embodiments of the present invention.
In a further embodiment, in conjunction with fig. 2, the at least one processor 32 may execute an operating device of the electronic device 3 and various installed application programs (such as the data extraction device 20 of the web page), program codes, and the like, for example, the above modules.
The memory 31 has program code stored therein, and the at least one processor 32 can call the program code stored in the memory 31 to perform related functions. For example, the modules illustrated in fig. 2 are program codes stored in the memory 31 and executed by the at least one processor 32, so as to implement the functions of the modules for the purpose of data extraction of web pages.
In one embodiment of the invention, the memory 31 stores a plurality of instructions that are executed by the at least one processor 32 to implement the functionality of data extraction for web pages.
Specifically, the at least one processor 32 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1, and details are not repeated here.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or that the singular does not exclude the plural. A plurality of units or means recited in the present invention may also be implemented by one unit or means through software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.
Claims (10)
1. A data extraction method of a webpage is characterized by comprising the following steps:
acquiring an HTML code in a source code of a webpage to be extracted, and analyzing the HTML code into a first node DOM tree;
analyzing the DOM tree of the first node to obtain all unordered list tags;
traversing all list tags corresponding to each unordered list tag to obtain a traversal result, and selecting a DOM tree corresponding to the list tag with the most child nodes from the traversal result as a second node DOM tree;
matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generating a fourth node DOM tree of each unselected list tag according to the matching result;
generating a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all fourth node DOM trees;
and generating a new webpage to be extracted according to the DOM trees of the fifth nodes of all the unordered list tags, and extracting webpage characteristic data of the new webpage to be extracted.
2. The method for extracting data from a web page according to claim 1, wherein the matching the DOM tree of the third node of each unselected list tag in the traversal result with the DOM tree of the second node, and the generating the DOM tree of the fourth node of each unselected list tag according to the matching result comprises:
matching the first tag of the root node of the DOM tree of the second node with the second tag of the root node of the third DOM tree of each unselected list tag;
when the first label is consistent with the second label, judging whether the root node of the second node DOM tree and the root node of the third node DOM tree are leaf nodes;
when the root node of the second node DOM tree is not a leaf node and the root node of the third node DOM tree is not a leaf node, matching the third tags of all the child nodes of the next level of the root node of the second node DOM tree with the fourth tags of all the child nodes of the same level of the third node DOM tree;
and when the third tags of all the child nodes of the next level of the root node of the second node DOM tree are consistent with the fourth tags of all the child nodes of the same level of the third node DOM tree, repeating the process until the child nodes of the second node DOM tree and the child nodes of the third node DOM tree are leaf nodes.
3. The method for extracting data from web pages according to claim 1, wherein the generating a DOM tree of a fourth node for each unselected list tag according to the matching result comprises:
when the third labels of all the child nodes of the next level of the root node of the second node DOM tree are inconsistent with the fourth label of any child node of the same level of the third node DOM tree, identifying a left neighbor node and a right neighbor node of the fourth label;
when a left neighbor node is identified but a right neighbor node is not identified, inserting the fourth tag into the rightmost side of the left neighbor node to obtain a new third node DOM tree, and taking the new third node DOM tree as the fourth node DOM tree of each unselected list tag; or
When the left neighbor node is not identified but the right neighbor node is identified, inserting the fourth tag into the leftmost side of the right neighbor node to obtain a new third node DOM tree, and taking the new third node DOM tree as the fourth node DOM tree of each unselected list tag; or
And when a left neighbor node and a right neighbor node are identified, inserting the fourth tag between the left neighbor node and the right neighbor node to obtain a new third node DOM tree, and taking the new third node DOM tree as the fourth node DOM tree of each unselected list tag.
4. The method for extracting data from a web page according to claim 2, wherein the method further comprises:
and when the root node of the DOM tree of the second node is a leaf node but the root node of the DOM tree of the third node is not the leaf node, taking the DOM tree of the third node as a DOM tree of a fourth node of each unselected list tag.
5. The method for extracting data from a web page according to claim 2, wherein the method further comprises:
when the root node of the second node DOM tree is not a leaf node, but the root node of the third node DOM tree is a leaf node, traversing all child nodes of the root node of the second node DOM tree;
and inserting the corresponding labels of all the child nodes into the positions corresponding to the DOM trees of the third nodes to obtain a new DOM tree of the third nodes, and taking the new DOM tree of the third nodes as a DOM tree of the fourth nodes of each unselected list label.
6. The method for extracting data from a web page according to claim 2, wherein the method further comprises:
and when the root node of the second node DOM tree is a leaf node and the root node of the third node DOM tree is a leaf node, determining that the third node DOM tree is a fourth node DOM tree of each unselected list tag.
7. The method for extracting data from a web page according to any one of claims 1 to 6, wherein said generating a fifth-node DOM tree for each unordered list tag from said second-node DOM tree and all fourth-node DOM trees comprises:
and corresponding the second node DOM tree and all the fourth node DOM trees to corresponding positions in each unordered list tag to obtain a fifth node DOM tree of each unordered list tag.
8. An apparatus for extracting data from a web page, the apparatus comprising:
the acquisition module is used for acquiring HTML codes in source codes of the webpage to be extracted and analyzing the HTML codes into a first node DOM tree;
the analysis module is used for analyzing the DOM tree of the first node to obtain all unordered list tags;
the traversal module is used for traversing all the list tags corresponding to each unordered list tag to obtain a traversal result, and selecting a DOM tree corresponding to the list tag with the most child nodes from the traversal result as a second node DOM tree;
the matching module is used for matching the third node DOM tree of each unselected list tag in the traversal result with the second node DOM tree, and generating a fourth node DOM tree of each unselected list tag according to the matching result;
the generating module is used for generating a fifth node DOM tree of each unordered list tag according to the second node DOM tree and all the fourth node DOM trees;
and the extraction module is used for generating a new webpage to be extracted according to the DOM trees of the fifth nodes of all the unordered list tags and extracting webpage characteristic data of the new webpage to be extracted.
9. An electronic device, characterized in that the electronic device comprises a processor and a memory, the processor being configured to implement the data extraction method of the web page according to any one of claims 1 to 7 when executing the computer program stored in the memory.
10. A computer-readable storage medium, on which a computer program is stored, the computer program, when being executed by a processor, implementing a data extraction method for a web page according to any one of claims 1 to 7.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011541079.0A CN112667874A (en) | 2020-12-23 | 2020-12-23 | Webpage data extraction method and device, electronic equipment and storage medium |
PCT/CN2021/125865 WO2022134820A1 (en) | 2020-12-23 | 2021-10-22 | Webpage data extraction method and apparatus, electronic device, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011541079.0A CN112667874A (en) | 2020-12-23 | 2020-12-23 | Webpage data extraction method and device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112667874A true CN112667874A (en) | 2021-04-16 |
Family
ID=75409158
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011541079.0A Pending CN112667874A (en) | 2020-12-23 | 2020-12-23 | Webpage data extraction method and device, electronic equipment and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112667874A (en) |
WO (1) | WO2022134820A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022134820A1 (en) * | 2020-12-23 | 2022-06-30 | 深圳壹账通智能科技有限公司 | Webpage data extraction method and apparatus, electronic device, and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101727461A (en) * | 2008-10-13 | 2010-06-09 | 中国科学院计算技术研究所 | Method for extracting content of web page |
CN107943929A (en) * | 2017-11-22 | 2018-04-20 | 福州大学 | The automatic generating method of wrapper being abstracted based on dom tree |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102184189B (en) * | 2011-04-18 | 2012-11-28 | 北京理工大学 | Webpage core block determining method based on DOM (Document Object Model) node text density |
CN106372232B (en) * | 2016-09-09 | 2020-01-10 | 北京百度网讯科技有限公司 | Information mining method and device based on artificial intelligence |
CN109582886B (en) * | 2018-11-02 | 2022-05-10 | 北京字节跳动网络技术有限公司 | Page content extraction method, template generation method and device, medium and equipment |
CN109726376B (en) * | 2018-12-21 | 2023-01-20 | 上海众源网络有限公司 | Standard template generation method and device and electronic equipment |
CN112667874A (en) * | 2020-12-23 | 2021-04-16 | 深圳壹账通智能科技有限公司 | Webpage data extraction method and device, electronic equipment and storage medium |
-
2020
- 2020-12-23 CN CN202011541079.0A patent/CN112667874A/en active Pending
-
2021
- 2021-10-22 WO PCT/CN2021/125865 patent/WO2022134820A1/en active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101727461A (en) * | 2008-10-13 | 2010-06-09 | 中国科学院计算技术研究所 | Method for extracting content of web page |
CN107943929A (en) * | 2017-11-22 | 2018-04-20 | 福州大学 | The automatic generating method of wrapper being abstracted based on dom tree |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022134820A1 (en) * | 2020-12-23 | 2022-06-30 | 深圳壹账通智能科技有限公司 | Webpage data extraction method and apparatus, electronic device, and storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2022134820A1 (en) | 2022-06-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106293675B (en) | System static resource loading method and device | |
CN106662986A (en) | Optimized browser rendering process | |
CN112506910A (en) | Multi-source data acquisition method and device, electronic equipment and storage medium | |
CN114020256A (en) | Front-end page generation method, device and equipment and readable storage medium | |
CN111625748A (en) | Website navigation bar information extraction method and device, electronic equipment and storage medium | |
CN115048111B (en) | Code generation method, device, equipment and medium based on metadata | |
CN113886204A (en) | User behavior data collection method and device, electronic equipment and readable storage medium | |
CN114707474A (en) | Report generation method and device, electronic equipment and computer readable storage medium | |
CN113283216A (en) | Webpage content display method, device, equipment and storage medium | |
CN115408399A (en) | Blood relationship analysis method, device, equipment and storage medium based on SQL script | |
CN112667878A (en) | Webpage text content extraction method and device, electronic equipment and storage medium | |
CN103488675A (en) | Automatic precise extraction device for multi-webpage news comment contents | |
CN115640578A (en) | Vulnerability reachability analysis method, device, equipment and medium for application program | |
CN113268695A (en) | Data embedding point processing method and device and related equipment | |
CN112667874A (en) | Webpage data extraction method and device, electronic equipment and storage medium | |
CN112667208A (en) | Translation error recognition method and device, computer equipment and readable storage medium | |
CN115454382A (en) | Demand processing method and device, electronic equipment and storage medium | |
US11544179B2 (en) | Source traceability-based impact analysis | |
CN111625749B (en) | Method, device, equipment and medium for extracting website detail page information of participant company | |
CN112905470A (en) | Interface calling method and device, computer equipment and medium | |
CN113139145A (en) | Page generation method and device, electronic equipment and readable storage medium | |
CN104252355B (en) | The method and apparatus of different information between a kind of acquisition Net procedure sets | |
CN111597108B (en) | Form attribute testing method and device and computer readable storage medium | |
CN114760365B (en) | Data extraction method and device and electronic equipment | |
CN117608719A (en) | Page element searching method and device, electronic equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 40049917 Country of ref document: HK |