CN112099778B - Data acquisition method based on xpath, electronic equipment and storage medium - Google Patents

Data acquisition method based on xpath, electronic equipment and storage medium Download PDF

Info

Publication number
CN112099778B
CN112099778B CN202011265720.2A CN202011265720A CN112099778B CN 112099778 B CN112099778 B CN 112099778B CN 202011265720 A CN202011265720 A CN 202011265720A CN 112099778 B CN112099778 B CN 112099778B
Authority
CN
China
Prior art keywords
node
grouping
detail page
xpath
page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011265720.2A
Other languages
Chinese (zh)
Other versions
CN112099778A (en
Inventor
宋岩强
李青龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Smart Starlight Information Technology Co ltd
Original Assignee
Beijing Smart Starlight Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Smart Starlight Information Technology Co ltd filed Critical Beijing Smart Starlight Information Technology Co ltd
Priority to CN202011265720.2A priority Critical patent/CN112099778B/en
Publication of CN112099778A publication Critical patent/CN112099778A/en
Application granted granted Critical
Publication of CN112099778B publication Critical patent/CN112099778B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a data acquisition method, a system, electronic equipment and a storage medium based on xpath, wherein the method merges xpath paths of click elements of list pages, extracts a common path and obtains a list page merging path; obtaining a list with the same path as the merging path according to the list page merging path, and generating a list page link set; entering a detail page corresponding to one link, and determining an acquisition mode according to a page acquisition object of the detail page, wherein the acquisition mode comprises a grouping acquisition mode and a non-grouping acquisition mode; extracting text information in a text extraction mode according to an xpath path of the detail page click element to obtain text information corresponding to the detail page click element; and automatically generating an acquisition code according to the click element xpath path of the detail page and the corresponding text information, and respectively entering each link in the list page link set for data acquisition, so that the data acquisition of all links in the list page is completed, the acquisition code does not need to be manually written, and the automatic data acquisition is realized.

Description

Data acquisition method based on xpath, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of data acquisition, in particular to a data acquisition method and system based on xpath, electronic equipment and a storage medium.
Background
Generally, web page data acquisition depends on technical staff to analyze a composition structure of a web page according to acquisition requirements of a client, acquire object elements are positioned by using a Firebug or a browser with an xpath extraction tool or analyzing a dom structure and the like, then a corresponding acquisition program is compiled according to a certain acquisition frame, and crawler codes cannot be automatically programmed and called.
Disclosure of Invention
In view of this, embodiments of the present invention provide a data acquisition method, system, electronic device and storage medium based on xpath, so as to solve the problem in the prior art that a browser crawler code cannot be automatically programmed and invoked.
Therefore, the embodiment of the invention provides the following technical scheme:
according to a first aspect, an embodiment of the present invention provides an xpath-based data acquisition method, including: receiving at least two list page click events in list pages to obtain an xpath path of a list page click element corresponding to the list page click events; merging the xpath paths of the clicked elements of the list pages to obtain list page merging paths, wherein the list page merging paths are leaf nodes which are same in name and different in position from parent nodes to the lower part until the first element is the same, and the xpath paths are obtained after the position information of the leaf nodes is removed; searching a list with the same path as the merging path of the list page in the list page, and generating a list page link set; entering any link in the list page link set, opening a corresponding detail page, and determining an acquisition mode according to a page acquisition object of the detail page, wherein the acquisition mode is preset and comprises grouping acquisition and non-grouping acquisition; if the acquisition mode is a non-grouping mode, the detail page is a detailed page, and an xpath path of a click element of the detail page is received; extracting text information corresponding to the detail page click element in a text extraction mode according to the xpath path of the detail page click element; and respectively carrying out data acquisition on each link in the list page link set according to the detail page click element xpath path and the corresponding text information.
Optionally, the method further comprises: if the acquisition mode is a grouping mode, the detail page comprises a plurality of detail page grouping nodes, and an xpath path of a detail page grouping node clicking element is received; according to the xpath path of the detail page grouping node clicking elements, obtaining the xpath paths of all elements similar to the detail page grouping node clicking elements in the detail page grouping nodes, and generating a detail page grouping node set; grouping and merging xpath paths of click elements in any detail page grouping node in a detail page grouping node set to obtain a detail page grouping node xpath path, wherein the detail page grouping node xpath path comprises a grouping mark for positioning the detail page grouping node, and the grouping mark is positioned behind the position of a father node element of the click element in a group and is adjacent to the father node element; entering any node in a detail page grouping node set, opening a corresponding detail page grouping node, and receiving an xpath path of a detail page clicking element; extracting text information corresponding to the detail page click element in a text extraction mode according to an xpath path of the detail page click element; respectively carrying out data grouping collection on each detail page grouping node in the detail page grouping node set according to the xpath path of the detail page clicking element and the corresponding text information; and respectively acquiring data of each link in the list page link set according to the xpath path of the detail page grouping node clicking element.
Optionally, the step of merging the xpath paths of the click elements of the list pages includes: comparing the node types of all levels of nodes of the xpath path of the list page click element one by one; if the node type comparison results are not consistent, terminating the combination; if the node type comparison results are consistent, comparing the node names of the nodes at all levels one by one; if the node name comparison results are consistent, terminating the combination; and if the node name comparison results are not consistent, searching leaf nodes with the same first element name and different positions downwards from the father node, and removing the position information of the leaf nodes to obtain a list page merging path.
Optionally, extracting the text information of the click element includes: if the element name of the click element is INPUT and the type of the INPUT is button, reset or submit, the text information is a value corresponding to the click element; if the element name of the click element is LABLE, the text information is textContent corresponding to the click element; if the element name of the click element is SELECT, the text information is the name corresponding to the click element; if the element names of the click elements are not INPUT, LABLE and SELECT, judging whether the innerText corresponding to the click elements is empty or not; if the innerText corresponding to the click element is not empty, the text information is the innerText corresponding to the click element; and if the innerText corresponding to the click element is empty, the text information is the innerText of the parent node corresponding to the click element.
Optionally, the step of obtaining the xpath paths of all elements similar to the click element of the detail page grouping node in the detail page grouping node according to the xpath path of the click element of the detail page grouping node includes: s901: receiving a detail page grouping node clicking element, and obtaining a current node according to the detail page grouping node clicking element; s902: acquiring a father node of a current node, and traversing child nodes under the father node in sequence; s903: judging whether the node type of the child node is an element or not; s904: if the node type of the child node is an element, judging whether the node name of the child node is the same as the node name of the current node; s905: if the node names are the same, accumulating and counting the position information to obtain an accumulated count value; s906: if the accumulated count value is greater than 0, using the position information by the xpath path of the child node according to the preset value; s907: and (4) taking the parent node of the current node as a new current node, and repeating the steps S902-S906 until the new current node is a body node or a node with an id attribute, so as to obtain the xpath paths of all elements similar to the clicking element of the detail page grouping node.
Optionally, the step of grouping and merging xpath paths of click elements in any one detail page grouping node in the detail page grouping node set includes: the step of grouping and merging the xpath paths of the click elements in any detail page grouping node in the detail page grouping node set comprises the following steps: comparing node names of nodes at each level of an xpath path of the detail page grouping node clicking elements one by one; if the node name comparison results are consistent, reserving an xpath path of the detail page grouping node clicking element; if the node name comparison results are inconsistent, acquiring element attribute information of a click element of a detail page grouping node, wherein the element attribute information comprises position information, dynamic Identification (ID) information and grouping information; if the element attribute information is position information, replacing the position information of the xpath path of the clicked element of the detail page grouping node with a grouping mark; if the element attribute information is dynamic identification ID information, removing the dynamic identification ID information of the xpath path of the clicked element of the detail page grouping node; if the element attribute information is grouping information, judging whether the grouping information is consistent; if the grouping information is consistent, retaining the grouping information of the xpath path of the detail page grouping node clicking element; and if the grouping information is inconsistent, removing the grouping information of the xpath path of the clicking element of the detail page grouping node.
Optionally, the method further comprises: acquiring data of detail page elements in a local self-contained browser acquisition mode; or, data acquisition of the detail page elements is carried out in a headless acquisition mode; or, data acquisition of the detail page elements is carried out in a cloud acquisition mode.
Optionally, the method further comprises: receiving an injected CSS pattern; performing preset display on the matching result of the xpath path matching according to the CSS style; the preset display comprises that the hovering operation of the searching element is displayed in a first color, the clicking operation of the selecting element is displayed in a second color, and the first color is different from the second color.
According to a second aspect, an embodiment of the present invention provides an xpath-based data acquisition system, including: the system comprises a first processing module, a second processing module and a third processing module, wherein the first processing module is used for receiving at least two list page click events in list pages and obtaining an xpath path of a list page click element corresponding to the list page click events; the second processing module is used for merging the xpath paths of the click elements of the list pages to obtain list page merging paths, wherein the list page merging paths are leaf nodes which are same in name and different in position from a parent node downwards to a first element, and the xpath paths are obtained after position information of the leaf nodes is removed; the third processing module is used for searching a list with the same path as the merging path of the list page in the list page and generating a list page link set; the fourth processing module is used for entering any link in the list page link set, opening a corresponding detail page and determining an acquisition mode according to a page acquisition object of the detail page, wherein the acquisition mode is preset and comprises grouping acquisition and non-grouping acquisition; the fifth processing module is used for receiving an xpath path of a click element of the detail page if the acquisition mode is a non-grouping mode, and the detail page is a detail page; the sixth processing module is used for extracting text information corresponding to the detail page click element in a text extraction mode according to the xpath path of the detail page click element; and the seventh processing module is used for respectively acquiring data of each link in the list page link set according to the detail page click element xpath path and the corresponding text information.
Optionally, the method further comprises: the eighth processing module is configured to, if the acquisition mode is a grouping mode, determine that the detail page includes a plurality of detail page grouping nodes, and receive an xpath path of a click element of the detail page grouping nodes; the ninth processing module is used for obtaining the xpath paths of all elements similar to the clicking element of the detail page grouping node in the detail page grouping node according to the xpath paths of the clicking elements of the detail page grouping node, and generating a detail page grouping node set; a tenth processing module, configured to group and merge xpath paths of click elements in any one detail page grouping node in the detail page grouping node set to obtain a detail page grouping node xpath path, where the detail page grouping node xpath path includes a grouping flag for positioning the detail page grouping node, and the grouping flag is located behind a position of a parent node element of the click element in the group and is immediately adjacent to the parent node element; the eleventh processing module is used for entering any detail page grouping node in the detail page grouping node set, opening the corresponding grouping node and receiving an xpath path of the detail page clicking element; the tenth processing module is used for extracting text information corresponding to the detail page click element in a text extraction mode according to the xpath path of the detail page click element; a thirteenth processing module, configured to perform data grouping collection on each detail page grouping node in the detail page grouping node set according to the xpath path of the detail page click element and the corresponding text information; and the fourteenth processing module is used for respectively acquiring data of each link in the list page link set according to the xpath path of the node clicking element of the detail page grouping.
Optionally, the second processing module includes: the first processing unit is used for comparing the node types of all levels of nodes of the xpath path of the click elements of the list page one by one; the second processing unit is used for terminating the combination if the node type comparison results are inconsistent; the third processing unit is used for comparing the node names of the nodes at all levels one by one if the node type comparison results are consistent; a fourth processing unit, configured to terminate merging if the node name comparison results are consistent; and the fifth processing unit is used for searching leaf nodes with the same first element name and different positions downwards from the father node if the node name comparison results are inconsistent, and removing the position information of the leaf nodes to obtain a list page merging path.
Optionally, the system comprises: the sixth processing unit is used for determining that the text information is a value corresponding to the click element if the element name of the click element is INPUT and the type of the INPUT is button, reset or submit; a seventh processing unit, configured to, if the element name of the click element is ble, determine that the text information is textContent corresponding to the click element; the eighth processing unit, configured to, if the element name of the clicked element is SELECT, determine that the text information is a name corresponding to the clicked element; a ninth processing unit, configured to determine whether an innerText corresponding to the click element is empty if the element name of the click element is not INPUT, link, and SELECT; the tenth processing unit is used for determining that the text information is the innerText corresponding to the click element if the innerText corresponding to the click element is not empty; and the eleventh processing unit is used for determining that the text information is the innerText of the parent node corresponding to the click element if the innerText corresponding to the click element is empty.
Optionally, the ninth processing module includes: the twelfth processing unit is used for receiving the detail page grouping node clicking element and obtaining the current node according to the detail page grouping node clicking element; the thirteenth processing unit is used for acquiring a father node of the current node and sequentially traversing child nodes under the father node; a fourteenth processing unit, configured to determine whether the node type of the child node is an element; a fifteenth processing unit, configured to determine whether a node name of a child node is the same as a node name of a current node if the node type of the child node is an element; a sixteenth processing unit, configured to perform cumulative counting on the position information if the node names are the same, to obtain a cumulative count value; a seventeenth processing unit, configured to use the location information according to a predetermined xpath path of the child node if the accumulated count value is greater than 0; and the eighteenth processing unit is used for taking a father node of the current node as a new current node, recursing until the new current node is a body node or a node with an id attribute, and obtaining an xpath path of all elements similar to the detail page grouping node clicking element.
Optionally, the tenth processing module includes: a nineteenth processing unit, configured to compare node names of nodes at each level of an xpath path of the detail page grouping node click element one by one; a twentieth processing unit, configured to, if the node name comparison results are consistent, reserve an xpath path of the detail page grouping node click element; a twenty-first processing unit, configured to, if the node name comparison result is inconsistent, obtain element attribute information of a click element of a detail page grouping node, where the element attribute information includes location information, dynamic identification ID information, and grouping information; a twenty-second processing unit, configured to replace, if the element attribute information is location information, location information of an xpath path where the detail page grouping node clicks an element with a grouping flag; a twenty-third processing unit, configured to remove the dynamic ID information of the xpath path where the detail page grouping node clicks the element if the element attribute information is the dynamic ID information; a twenty-fourth processing unit, configured to determine whether the grouping information is consistent if the element attribute information is the grouping information; a twenty-fifth processing unit, configured to, if the grouping information is consistent, retain grouping information of an xpath path of the detail page grouping node click element; and the twenty-sixth processing unit is used for removing the grouping information of the xpath path of the clicking element of the detail page grouping node if the grouping information is inconsistent.
Optionally, the method further comprises: the fifteenth processing module is used for acquiring data of the detail page elements in a local self-contained browser acquisition mode; or, the sixteenth processing module is used for acquiring data of the detail page elements in a headless acquisition mode; or the seventeenth processing module is configured to collect data of the detail page elements in a cloud collection manner.
Optionally, the method further comprises: an eighteenth processing module for receiving the injected CSS patterns; performing preset display on the matching result of the xpath path matching according to the CSS style; a nineteenth processing module, configured to display, in the predetermined display, a hover operation for a search element in a first color, and a click operation for a selection element in a second color, where the first color is different from the second color.
According to a third aspect, an embodiment of the present invention provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to cause the at least one processor to perform the xpath-based data acquisition method as described in any of the above first aspects.
According to a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer instructions are stored, and the computer instructions are configured to cause a computer to execute the xpath-based data acquisition method described in any one of the first aspects.
The technical scheme of the embodiment of the invention has the following advantages:
the embodiment of the invention provides a data acquisition method, a data acquisition system, electronic equipment and a storage medium based on xpath, wherein the method comprises the following steps: receiving at least two list page click events in list pages to obtain an xpath path of a list page click element corresponding to the list page click events; merging the xpath paths of the clicked elements of the list pages to obtain list page merging paths, wherein the list page merging paths are leaf nodes which are same in name and different in position from parent nodes to the lower part until the first element is the same, and the xpath paths are obtained after the position information of the leaf nodes is removed; searching a list with the same path as the merging path of the list page in the list page, and generating a list page link set; entering any link in the list page link set, opening a corresponding detail page, and determining an acquisition mode according to a page acquisition object of the detail page, wherein the acquisition mode is preset and comprises grouping acquisition and non-grouping acquisition; if the acquisition mode is a non-grouping mode, the detail page is a detailed page, and an xpath path of a click element of the detail page is received; extracting text information corresponding to the detail page click element in a text extraction mode according to the xpath path of the detail page click element; and respectively carrying out data acquisition on each link in the list page link set according to the detail page click element xpath path and the corresponding text information. Through the steps, merging the xpath paths of the click elements of the list pages, and extracting a common path to obtain a list page merging path; obtaining a list with the same path as the merging path in the list page according to the list page merging path, and generating a list page link set; entering a detail page corresponding to one link, and determining whether data needs to be collected in groups according to a page collection object of the detail page; if the grouping collection is not needed, it is indicated that no detail page grouping node exists in the detail pages, the detail pages are specific page detailed contents, and the collection mode is determined to be a non-grouping collection mode; extracting text information in a text extraction mode according to an xpath path of the detail page click element to obtain text information corresponding to the detail page click element; according to the method, an acquisition code is automatically generated according to a detail page clicking element xpath path and the corresponding text information, the acquisition code enters each link in a list page link set respectively, and specific data in the detail page corresponding to the text information is acquired in the detail page corresponding to the link, so that data acquisition of all links in the list page is completed, a crawler acquisition code does not need to be manually compiled, an automatic result acquisition is realized, butt joint communication and complicated coding links of a user and a developer are reduced, and the difficulty and cost of browser data acquisition are reduced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flowchart illustrating a specific example of an xpath-based data collection method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating another exemplary xpath-based data collection method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating another exemplary xpath-based data collection method according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating another exemplary xpath-based data collection method according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating another exemplary xpath-based data collection method according to an embodiment of the present invention;
FIG. 6 is a flowchart illustrating another exemplary xpath-based data collection method according to an embodiment of the present invention;
FIG. 7 is a flowchart illustrating another exemplary xpath-based data collection method according to an embodiment of the present invention;
FIG. 8 is a diagram illustrating a specific example of grouping and merging in an xpath-based data collection method according to an embodiment of the present invention;
FIG. 9 is a flowchart illustrating another exemplary xpath-based data collection method according to an embodiment of the present invention;
FIG. 10 is a block diagram of a specific example of an xpath-based data collection system in accordance with an embodiment of the present invention;
fig. 11 is a schematic diagram of an electronic device according to an embodiment of the invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
For part of webpage data collection with simple structure but large quantity, a developer needs to communicate with a client for many times in the development process, and the coding link is complicated, so that the difficulty and the cost of browser data collection are higher. The inventor can know that the priority order of the xpath selector is fixed to id, poison, class by analyzing the Firebug or the source code of the xpath open source of the browser, and the xpath path does not contain the packet structure information, that is, the packet structure cannot be effectively utilized. In this embodiment, a visual, customizable and group information-carrying xpath-based method is provided by modifying an xpath extraction tool or a tool carried by a browser, so as to provide a basis for automatically generating a crawler code.
Based on this, the embodiment of the present invention provides an xpath-based data acquisition method, as shown in fig. 1, the method may include steps S1-S7.
Step S1: and receiving at least two list page click events in the list pages to obtain an xpath path of the list page click element corresponding to the list page click events.
As an exemplary embodiment, a list page usually includes a plurality of links, and after the JS monitoring code is injected into the browser, event monitoring can be performed on the operation action of the user, and xpath of an element generating the event is extracted from the event. In this embodiment, there are two list page click events, that is, the user performs a click operation on two elements in the list page, so that the two list page click events are received, and the xpath of the click element is extracted from the events to obtain the xpath path of the two list page click elements corresponding to the two click events. Of course, in other embodiments, the number of the list page click events may be 3, or even more, and the number may be set as needed.
Step S2: merging the xpath paths of the clicked elements of the list pages to obtain a list page merging path, wherein the list page merging path is an xpath path obtained by descending from a father node to a leaf node with the same first element name and different positions and removing the position information of the leaf node.
As an exemplary embodiment, usually, multiple contents in the same list in a browser are very similar, and the xpath paths of different links in the same list page are similar according to the writing rule of a crawler program. Their common path can be extracted from the different links. And merging paths of the xpath paths of the at least two list page click elements obtained in the step to obtain a common path of the list page click elements, and taking the common path as a list page merging path. Specifically, the list page merge path is an xpath path obtained by removing the position information of leaf nodes from parent nodes downwards to the leaf nodes with the same first element name and different positions.
For example, the list page contains a plurality of titles, the title in the first line is clicked, and xpath of// ul [ @ id = 'UI1' ]/li [1]/a/h3 is extracted; clicking the title of the second line, and extracting an xpath path of// ul [ @ id = 'UI1' ]/li [2]/a/h 3; merging the second line of title line with the first line of title line, sequentially comparing the nodes in the two paths from left to right, finding the leaf node with the same first element name but different positions as li, wherein the position information of the leaf node corresponding to the first line of title line is [1], the position information of the leaf node corresponding to the second line of title line is [2], removing the position information of the leaf node, and obtaining an xpath merged path, wherein the xpath merged path comprises the following steps: // ul [ @ id = 'UI1' ]/li/a/h 3.
Step S3: and searching a list with the same path as the merging path of the list page in the list page, and generating a list page link set.
As an exemplary embodiment, all lists in the list page are subjected to traversal comparison according to the list page merging path, lists having the same path as the list page merging path are found, and the lists having the same path are stored, so as to obtain a list page link set.
In this embodiment, all lists in the list page are traversed, whether the xpath paths corresponding to the lists are the same as the merge path is compared one by one, and if the xpath paths are the same as the merge path, the xpath paths are indicated to have the same path, and the xpath paths are stored in the list page link set. Specifically, in the comparison process, the position information of the leaf node is ignored, that is, the position information is not compared, and fuzzy matching is performed.
For example, the xpath paths of some three lists in the list page are respectively as follows.
//ul[@id='UI1']/li[10]/a/h3
//ul[@id='UI1']/li[25]/a/h3
//ul[@id='UI1']/div[1]/a/h3
After the search, the first list and the second list have the same path as the merging path, so that the two lists are stored in the list page link set. The leaf nodes in the third list are div, which is different from li, and therefore are not stored in the list page link set, unlike the merge path.
Step S4: and entering any link in the list page link set, opening a corresponding detail page, and determining an acquisition mode according to a page acquisition object of the detail page, wherein the acquisition mode is preset and comprises group acquisition and non-group acquisition.
As an exemplary embodiment, the list page link set includes a plurality of similar links, one of the links is entered, the detail page corresponding to the link is opened, and the acquisition mode is determined according to the page acquisition object of the detail page. The detail page opened after entering the list page is the detail page content, the specific page detail content is displayed, and the page acquisition object is the page detail content; but now, many detail pages are detail page grouping nodes (called secondary links in this embodiment), and the page opened after entering the detail page grouping nodes is the detailed content of the page, so that the page collection object is the detail page grouping node in the detail pages. Specifically, a large number of samples can be learned through a machine learning method to accurately judge whether a page acquisition object in a detail page is page detail content or a detail page grouping node.
And selecting an acquisition mode according to the page acquisition object of the detail page. The acquisition mode is preset and comprises two acquisition modes of grouping acquisition and non-grouping acquisition. The grouping collection mode is suitable for pages which are detail page grouping nodes in the detail pages, and the non-grouping collection mode is suitable for pages which are page detailed contents in the detail pages.
For example, the home page of a car, the recall collision list page includes a number of links, and the detail page opened after entering the fourth link (lexler recalling 9838 cars with air bag hidden danger) is the detail page content. The page collection object of the detail page is page detail.
Also for example, a hotspot in a home web page of a car tracks a list page that includes several lists, only three of which are captured in this embodiment. Enter a second list "super-broadcast" therein: the pure electric vehicle frequently catches fire, who should back of the body pot ", the detail content is shown in the detail page.
Step S5: and if the acquisition mode is a non-grouping mode, the detail page is a detailed page, and an xpath path of the click element of the detail page is received.
As an exemplary embodiment, if the collection mode is the non-grouping mode, the detail page has no detail page grouping node and is a detail page. And clicking the elements in the detail page to obtain an xpath path of the clicked elements of the detail page.
Step S6: and extracting text information corresponding to the click element of the detail page in a text extraction mode according to the xpath path of the click element of the detail page.
As an exemplary embodiment, after obtaining the xpath path of the click element of the detail page, the location of the click element of the detail page can be accurately located, and text information is extracted from the click element of the detail page in a text extraction manner to obtain text information corresponding to the click element of the detail page.
In this embodiment, the specific text extraction manner may be to obtain a value if the nodeName of the element is INPUT and the type is button, reset, and submit; if the nodeName of the element is LABLE, acquiring textContent; if the nodeName of the element is SELECT, then the name is obtained. The present embodiment is only illustrative, and not limited thereto. In other embodiments, other text extraction approaches may also be included.
For example, after clicking an author "snow lotus" in a detail page "airbag hidden danger lexler recalls 9838 cars", the xpath path of the clicked element is automatically acquired to obtain an accurate acquisition position; and then extracting the text information corresponding to the acquisition position in a text extraction mode, and determining that the nodeName of the click element is LABLE according to the xpath path, so that the text information is textConent, and the acquired data is herba saussureae involucratae.
Step S7: and respectively acquiring data of each link in the list page link set according to the detail page click element xpath path and the corresponding text information.
The detail pages with similar structures visually indicate that the dom structures of the detail pages are similar, and the xpath paths are the same, so after the xpath path of the clicking element in one detail page is acquired, the xpath paths of the elements in other detail pages can be acquired in the same way. In general, different detail pages with the same structure are coded in the same way, only specific content replacement is performed, and the xpath paths of elements at the same position in different detail pages are the same. For example, the position of the "author" element in the detail page rendered on the browser is fixed, corresponding to an xpath path, the position of the "author" element in other detail pages with the same structure is not changed, only the specific data of the author name is replaced, and the xpath path is the same because the position is not changed.
As an exemplary embodiment, each link in the list page link set is entered separately, i.e., each link is opened one by one; finding a click element xpath path of the detail page in the detail page corresponding to the link, determining an acquisition position, and acquiring data according to the text information corresponding to the acquisition position to obtain the acquisition data corresponding to the acquisition position. The collected data is the specific data in the detail page, so that the data collection of all the links in the list page is completed.
Through the steps, merging the xpath paths of the click elements of the list pages, and extracting a common path to obtain a list page merging path; obtaining a list with the same path as the merging path in the list page according to the list page merging path, and generating a list page link set; entering a detail page corresponding to one link, and determining whether data needs to be collected in groups according to a page collection object of the detail page; if the grouping collection is not needed, it is indicated that no detail page grouping node exists in the detail pages, the detail pages are specific page detailed contents, and the collection mode is determined to be a non-grouping collection mode; extracting text information in a text extraction mode according to an xpath path of the detail page click element to obtain text information corresponding to the detail page click element; according to the method, each link in a list page link set is entered according to a click element xpath path of a detail page and corresponding text information, and specific data in the detail page is collected in the detail page corresponding to the link, so that data collection of all links in the list page is completed, a crawler collection code does not need to be written manually, automatic result acquisition is achieved, butt joint communication and complex coding links of a user and a developer are reduced, and difficulty and cost of browser data collection are reduced.
As an exemplary embodiment, as shown in FIG. 2, the method further includes steps S8-S13.
Step S8: and if the acquisition mode is a grouping mode, the detail page comprises a plurality of detail page grouping nodes, and an xpath path of the element clicked by the detail page grouping nodes is received.
As an exemplary embodiment, the collection mode is a grouping mode, which indicates that the detail page is not a detail page but a plurality of detail page grouping nodes, and receives an xpath path of a click element of the detail page grouping node.
For example, after entering a mobile phone list (a link in a list page) in the kyoto app, where the list page includes a plurality of detail page grouping nodes (each mobile phone corresponds to one detail page grouping node), the detailed information of the mobile phone corresponding to the page can be seen only after entering one of the detail page grouping nodes. Since the conventional data collection does not carry grouping information, if the page of the type is not collected in groups, it cannot be determined which elements are elements in the same detail page grouping node. By the grouping collection mode, the elements in the group can be accurately obtained, so that the data collection is more accurate. For example, after the group collection, it is obtained that "5.0-4.6 inches 16GB 3 GB", "cut 899.00" and "36 ten thousand + people comments already" belong to the same group (belong to the same detail page group node), "5.5-5.1 inches 32GB 3 GB", "skimming 1099.00" and "56 ten thousand + people comments already" belong to the same group. Specifically, if such pages are not collected in a grouped manner, the collected data include all prices such as "cutting 899.00", "cutting 1099.00", "cutting 3599.00" and "cutting 1099.00", and all numbers of comments such as "36 ten thousand +" "56 ten thousand +" "19 ten thousand +" 32 ten thousand + "so that the same detailed page grouping node cannot be determined, and the corresponding relationship between the price and the number of comments cannot be determined.
Step S9: and according to the xpath path of the detail page grouping node clicking element, obtaining the xpath paths of all elements similar to the detail page grouping node clicking element in the detail page grouping node, and generating a detail page grouping node set.
As an exemplary embodiment, the xpath paths of all elements similar to the detail page grouping node click element are obtained from the xpath paths of the detail page grouping node click element, and the xpath paths carry the grouped position information, so that the xpath paths are grouped and merged in the following process.
Step S10: and grouping and merging the xpath paths of the click elements in any detail page grouping node in the detail page grouping node set to obtain the xpath path of the detail page grouping node, wherein the xpath path of the detail page grouping node comprises a grouping mark for positioning the detail page grouping node, and the grouping mark is positioned behind the position of the father node element of the click element in the group and is close to the father node element.
As an exemplary embodiment, the xpath paths of the click elements in the detail page grouping nodes are grouped and merged, and the grouping information is determined. In this embodiment, the grouping flag is set to [ n ], as shown in fig. 8, where [ n ] in// n1[ @ id = 'id 1' ]/n2[ n ] is the grouping flag. Of course, in other embodiments, other flags conforming to the xpath syntax may be set, and this embodiment only illustrates this schematically, and is not limited to this.
For convenience of understanding, two of the detail page grouping nodes in the detail page are taken as examples for illustration, for example, the element "c cutting 3949.00" and "27 ten thousand + comments" in the first detail page grouping node are clicked, and the xpath paths of the two clicked elements are grouped and merged to obtain grouping information.
Step S11: and entering any detail page grouping node in the detail page grouping node set, opening the corresponding grouping node, and receiving an xpath path of the detail page clicking element.
As an exemplary embodiment, the method includes entering any detail page grouping node, opening a page corresponding to the grouping node, where the opened page is detail content, and clicking an element of the detail page to obtain an xpath path of a clicked element of the detail page. Similar to step S5, the detailed description is omitted here. For example, a first detail page grouping node is entered, and the detail page is obtained after the grouping node is opened.
Step S12: and extracting text information corresponding to the click element of the detail page in a text extraction mode according to the xpath path of the click element of the detail page.
As an exemplary embodiment, after obtaining the xpath path of the click element of the detail page, the position of the click element in the page can be accurately located, and text information is extracted from the click element of the detail page in a text extraction manner to obtain text information corresponding to the click element of the detail page. Similar to step S6, the detailed description is omitted here.
In this embodiment, specifically, the xpath path of the click element on the detail page is// div [ @ id = 'J _ goodsList' ]/ul [ contacts (@ class, 'gl-warp') ]/li [ n ]/div [ contacts (@ class, 'gl-i-wrap') ]/div [2]/strong/i, and the information collected at the collection position corresponding to the xpath path is "3949.00"; if the xpath path of the click element on the detail page is// div [ @ id = 'J _ goodsList' ]/ul [ contacts [ ((J))
@ class, 'gl-warp') ]/li [ n ]/div [ contacts (@ class, 'gl-i-wrap') ]/div [4]/strong, information collected at the collection position corresponding to the xpath path is '2999.0027 ten thousand + bar evaluation'; if the xpath path of the click element on the detail page is// div [ @ id = 'J _ goodsList' ]/ul [ contacts (@ class, 'gl-warp') ]/li [ n ]/div [ co ]
ntains (@ class, 'gl-i-wrap') ]/div [3]/a/em, and the information collected at the collection position corresponding to the xpath path is 'Hua is panel MatePad Pro 10.8 inch kylin 990 video, entertainment, game, office, study, full screen computer 8GB +256 GB'.
Step S13: and respectively carrying out data grouping collection on each detail page grouping node in the detail page grouping node set according to the xpath path of the detail page clicking element and the corresponding text information.
As an exemplary embodiment, since the xpath path of the element in the detail page grouping node carries the grouping mark, the xpath path of the detail page click element also carries the grouping mark correspondingly, so that which text information is in the same group can be accurately identified, and the data acquisition result is more accurate. The process of performing data grouping collection on each detail page grouping node in the detail page grouping node set is specifically similar to step S7, and is not described herein again. Specifically, the data is also collected in groups by entering each detail page grouping node in the detail page grouping node set.
Step S14: and respectively acquiring data of each link in the list page link set according to the xpath path of the detail page grouping node clicking element.
As an exemplary embodiment, according to the xpath path of the node clicking element of the detail page grouping, the data is automatically collected in each list page link respectively.
The step realizes the grouped data acquisition of the page with the detail page grouped nodes through the grouped acquisition mode, and improves the accuracy of data acquisition.
As an exemplary embodiment, the step of merging the xpath paths of the click elements of the list pages in the step S2 includes steps S201 to S206 as shown in fig. 3.
Step S201: and comparing the node types of all levels of nodes of the xpath path of the click elements of the list page one by one.
As an exemplary embodiment, the node types of the nodes of each level in the xpath path are compared separately. If the node types of the nodes at each level are the same, the node type comparison results are consistent; and if the node types of the nodes of at least one stage are different, the node comparison types are different.
Specifically, the node types include a root node, an element node, a text node, an attribute node, and the like, which are only schematically illustrated in the present embodiment and are not limited thereto.
Step S202: and if the node type comparison results are not consistent, terminating the combination.
Step S203: and if the node type comparison results are consistent, comparing the node names of the nodes at all levels one by one.
As an exemplary embodiment, if the node type comparison results are consistent, it is further determined whether the node names of the nodes at each level are consistent. If the node name comparison result is consistent, executing step S204; if the node name comparison results are not consistent, step S205 is performed.
Specifically, the node names may include span, div, meta, link, title, etc., which are only schematically illustrated in this embodiment and are not limited thereto.
Step S204: and if the node name comparison results are consistent, terminating the combination.
Step S205: and if the node name comparison results are not consistent, searching leaf nodes with the same first element name and different positions downwards from the father node, and removing the position information of the leaf nodes to obtain a list page merging path. The position information of the leaf nodes is removed to obtain the list page merging path, so that a plurality of lists with the same path as the list page merging path can be searched subsequently through the list page merging path, the lists form a list page link set, and the lists in the list page do not need to be clicked one by one to obtain the corresponding xpath path.
In the steps, the list page link set is generated through the list page merging path, the lists in the link set all have the same path, and the automatic identification of the lists with the same path is realized.
As an exemplary embodiment, the step of extracting the text information corresponding to the click element in step S12 or step S6 includes steps S1111-S1116 as shown in fig. 4.
Step S1111: if the element name (nodeName) of the click element is INPUT, and the type (type) of the INPUT is button, reset or submit, the text information is the value corresponding to the click element. Specifically, for example, in HTML4.01, the types of INPUT boxes (INPUT) are only text, button, password, submit, radio, checkbox, and hidden field, and when nodeName of an element is INPUT, and type is button, reset, submit, a value is obtained.
Step S1112: and if the element name of the click element is LABLE, the text information is textContent corresponding to the click element. Specifically, if the nodeName of the element is LABLE, textContent is acquired.
Step S1113: and if the element name of the click element is SELECT, the text information is the name corresponding to the click element. Specifically, if the nodeName of the element is SELECT, the name is acquired. Step S1114: if the element names of the click elements are not INPUT, LABLE and SELECT, judging whether the innerText corresponding to the click elements is empty or not. If not, go to step S1115; if it is empty, step S1116 is performed.
Step S1115: and if the innerText corresponding to the click element is not empty, the text information is the innerText corresponding to the click element. Specifically, if the text value corresponding to the click element is not null, the innerText is obtained.
As an exemplary embodiment, if the text value (lnertext) corresponding to the click element is not null, which indicates that there is text in the click element, the text lnertext corresponding to the click element is obtained.
Step S1116: and if the innerText corresponding to the click element is empty, the text information is the innerText of the parent node corresponding to the click element.
As an exemplary embodiment, the lnertext corresponding to the click element is null, which indicates that there is no text in the click element, and the lnertext of the parent node needs to be checked, so the lnertext of the parent node is used as the lnertext corresponding to the click element.
As an exemplary embodiment, the step S9 is a step of obtaining an xpath path of all elements similar to the click element of the detail page grouping node in the detail page grouping node according to the xpath path of the click element of the detail page grouping node, as shown in fig. 5, including steps S901 to S907.
Step S901: and receiving the detail page grouping node clicking element, and obtaining the current node according to the detail page grouping node clicking element.
As an exemplary embodiment, clicking operation is performed on elements in the detail page grouping node, an xpath path of the clicked elements of the detail page grouping node is obtained, and a current node of the clicked elements is obtained through the xpath path.
Step S902: and acquiring a father node of the current node, and traversing child nodes under the father node in sequence.
As an exemplary embodiment, the parent node of the current node is traced upward according to the xpath path, and all child nodes under the parent node are traversed, that is, the sibling nodes of the current node are searched.
Step S903: and judging whether the node type of the child node is an element. If the node type of the child node is an element, step S904 is performed.
As an exemplary embodiment, the node type of the Dom element may include an element node, a property node, a text node, a document node, a comment node, and the like. In this embodiment, other nodes besides the element node are auxiliary. Specifically, the node type may be obtained by the node name. In this embodiment, html5 is used to name nodes, and node names included in different node types can refer to the specification of html5, so that whether the node types are element nodes can be known according to the node names in xpath.
Step S904: and if the node type of the child node is an element, judging whether the node name of the child node is the same as the node name of the current node. If the node name of the child node is the same as the node name of the current node, step S905 is performed.
As an exemplary embodiment, when the node types of the child nodes are consistent, it is further required to further determine whether the node names of the child nodes are consistent with the node name of the current node.
Step S905: and if the node names are the same, accumulating and counting the position information to obtain an accumulated count value.
As an exemplary embodiment, if the node names are the same, which means that the child node is the same as the parent node of the current node, and the child node and the current node are siblings, the location information is accumulated and counted.
Step S906: if the accumulated count value is greater than 0, the xpath path of the child node uses the location information according to the preset setting.
As an exemplary embodiment, the priority order of the xpath selector by the browser general extraction tool is fixed to id, poison, class. Since data needs to be grouped, the xpath selector is set to preferentially use the location information for subsequent grouping and merging.
S907: and (4) taking the parent node of the current node as a new current node, and repeating the steps S902-S906 until the new current node is a body node or a node with an id attribute, so as to obtain an xpath path of all elements similar to the clicking element of the detail page grouping node.
As an exemplary embodiment, since the body node or the node with the id attribute is uniquely determined, recursion thereto, the xpath path of all elements of the detail page grouping node is uniquely represented, facilitating subsequent grouping merger.
As an exemplary embodiment, the step S10 of grouping and merging xpath paths of click elements in any one detail page grouping node in the detail page grouping node set includes steps S1001-S1008 as shown in fig. 6.
Step S1001: and comparing the node names of all levels of nodes of the xpath path of the detail page grouping node clicking element one by one. If the comparison result is consistent, executing step S1002; if the comparison result is not consistent, step S1003 is executed.
Step S1002: and if the node name comparison results are consistent, reserving an xpath path of the node clicking element of the detail page grouping.
As an exemplary embodiment, the node name comparison results are consistent, and no merging is required.
Step S1003: and if the node name comparison results are inconsistent, acquiring element attribute information of the click elements of the detail page grouping nodes, wherein the element attribute information comprises position information, dynamic Identification (ID) information and grouping information.
As an exemplary embodiment, if the node name comparison result is inconsistent, the element attribute information is further checked.
Step S1004: and if the element attribute information is position information, replacing the position information of the xpath path of the clicking element of the detail page grouping node with a grouping mark.
Step S1005: and if the element attribute information is dynamic identification ID information, removing the dynamic identification ID information of the xpath path of the clicked element of the detail page grouping node.
Step S1006: and if the element attribute information is grouping information, judging whether the grouping information is consistent. If the grouping information is consistent, step S1007 is executed; if the grouping information is not consistent, step S1008 is executed.
Step S1007: if the grouping information is consistent, the grouping information of the xpath path of the detail page grouping node clicking element is reserved. The grouping information is consistent, which indicates that the grouping information belongs to the same group, and the grouping information is directly reserved without processing the grouping information.
Step S1008: and if the grouping information is inconsistent, removing the grouping information of the xpath path of the clicking element of the detail page grouping node. If the grouping information is inconsistent and does not belong to the same group element, the grouping information needs to be removed.
As an exemplary embodiment, the method further includes step S15.
Step S15; and acquiring data of the detail page elements in a local self-contained browser acquisition mode. Specifically, the local self-contained browser acquisition mode is to acquire data through a local browser, such as webview or Google chrome.
As an exemplary embodiment, the method further includes step S16.
Step S16: and acquiring data of the detail page elements in a headless acquisition mode. Specifically, headless collection (header collection) is data collection using a headless browser.
As an exemplary embodiment, the method further includes step S17.
Step S17: and acquiring data of the detail page elements in a cloud acquisition mode.
The containerization of the browser is to place the browser (chrome, firefox and the like) in a docker and start and run the browser in the docker form. Specifically, the cloud acquisition mode may be a Chrome + docker mode for data acquisition.
The steps S15-S17 can be selected according to requirements in the actual data acquisition process, so that the flexibility of data acquisition is improved.
As an exemplary embodiment, as shown in FIG. 7, the method further includes steps S18-S19.
Step S18: the injected CSS pattern is received.
In the embodiment, the custom extension rendering style CSS is injected into the browser, so that the display effect of the path matching result is improved.
Step S19: performing preset display on the matching result of the xpath path matching according to the CSS style; the preset display comprises that the hovering operation of the searching element is displayed in a first color, the clicking operation of the selecting element is displayed in a second color, and the first color is different from the second color.
In this embodiment, the CSS style includes hovering and clicking, where hovering refers to moving a mouse over an element; clicking refers to pressing the left button of the mouse on an element. Hovering for exploratory finding elements and clicking for explicit selection of element objects to be captured.
Specifically, when hovering operation is performed, the element border and the background color of the corresponding element are displayed as a first color; when clicking operation is carried out, the element border and the background color of the corresponding element are displayed as a second color; the display effect is more striking. Of course, in other embodiments, only the element border or background color is displayed as the corresponding color when hovering or clicking; the present embodiment is only illustrative, and not limited thereto.
Specifically, the first color may be blue, and the second color may be red; of course, in other embodiments, the color can be set as desired. The more striking the color setting, the better the visualization.
The visual display of the matching result of the xpath path matching is realized through the steps, and the human-computer interaction effect is provided.
This is described in detail below with a specific example, as shown in fig. 9.
1-1) browser
Based on the webview container, monitoring codes and CSS styles are injected, and a simple browser function is realized.
Injected snoop code: the system comprises a user interface, a user interface module and a user interface module, wherein the user interface module is used for monitoring an event of an operation action of the user, extracting a custom xpath and CSS style for changing a frame highlight display of a matched element, and automatically extracting/combining element positioning information; where xpath can be customized to change the selector priority, full path or relative path as needed and can record packet information.
2) Main processing
The method is driven by Websocket communication messages and is responsible for communication between a user interaction interface and a browser; based on 1-1) element path information and user operation information, automatic programming and calling of automatic crawler codes are achieved, collection running tasks are pushed to the 3) task management module, and collection results are recovered.
The core function is as follows:
extracting xpath
And injecting a JS monitoring code into the browser, performing event monitoring on the operation action of the user, acquiring xpath of an element generating an event from the event, and injecting a custom extended rendering style css.
There are two ways to choose: hovering and clicking, wherein hovering means that a mouse moves to the upper part of an element and is not moved; clicking refers to pressing the left button of the mouse on an element. Hovering for user exploratory finding elements and clicking for explicit selection of element objects to be captured.
For example, clicking on the title of the first line, extracting xpath as// ul [ @ id = 'UI1' ]/li [1]/a/h3, and the background of the element selected by clicking correspondingly becomes red; when a mouse hovers above a certain element, a custom style CSS is added to the element triggering the hovering event, so that the background of the element is changed into blue, the visual dynamic state of the clicked/selected element is displayed in real time, and the flexible visual interaction effect is achieved.
When the second row title (//ul [ @ id = 'UI1' ]/li [2]/a/h3) is clicked, the second row title and the xpath of the first row title are automatically merged into: // ul [ @ id = 'UI1' ]/li/a/h 3. And visually display in real time which elements the xpath matches, which is equivalent to what other similar elements are suggested by the merging of the two elements.
Obtaining xpath of the click event and obtaining text information of the click element to self-define the collected information.
For example, clicking the author "zhang xue lian", and "the user interaction interface" will automatically obtain the xpath of the clicked element and obtain the text information of the corresponding element.
Automatic grouping association
For example: by 2 (or more) elements within a group
//div[contains(@class,'second2016_wrap') and
contains(@class,'guoji_second_wrap')]/div[3]/div[4]/div[2]/div[1]/ul[contains(@class,'idx_cm_list') and contains(@class,'idx_cm_list_h')]/li/a
And
//div[contains(@class,'second2016_wrap') and
contains(@class,'guoji_second_wrap')]/div[3]/div[3]/div[2]/ul/li[1]/a
extracting the grouping information as follows:
//div[contains(@class,'second2016_wrap') and
contains(@class,'guoji_second_wrap')]/div[3]
in the embodiment, the acquisition object is visually defined according to needs based on injecting codes and styles into the browser, and the codes are automatically generated and executed to obtain the result. The butt-joint communication and the complicated coding links of the user and the developer are reduced, and the difficulty and the cost of browser data acquisition are reduced.
In addition, in the embodiment, a clustered containerized browser (third party) is adopted, so that a plurality of browsers can be started simultaneously and a plurality of tasks can be collected concurrently.
The complete agent can operate the browser of the computer, in other words, the person can not disturb the running of the control program and can not use a mouse or a keyboard. Therefore, the invention utilizes the 'headless' mode of the modern browser, namely, the browser is not actually displayed and only runs in the memory of the machine, thereby increasing the practicability.
The present embodiment further provides an xpath-based data acquisition system, which is used for implementing the foregoing embodiments and preferred embodiments, and the description of the system that has been already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the system described in the embodiments below is preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.
The present embodiment further provides an xpath-based data acquisition system, as shown in fig. 10, including: a first processing module 1, a second processing module 2, a third processing module 3, a fourth processing module 4, a fifth processing module 5, a sixth processing module 6 and a seventh processing module 7.
The system comprises a first processing module 1, a second processing module and a third processing module, wherein the first processing module is used for receiving at least two list page click events in list pages and obtaining an xpath path of a list page click element corresponding to the list page click events; the details are described with reference to step S1.
The second processing module is used for merging the xpath paths of the click elements of the list pages to obtain list page merging paths, wherein the list page merging paths are leaf nodes which are same in name and different in position from a parent node downwards to a first element, and the xpath paths are obtained after position information of the leaf nodes is removed; the details are described with reference to step S2.
The third processing module is used for searching a list with the same path as the merging path of the list page in the list page and generating a list page link set; the details are described with reference to step S3.
The fourth processing module is used for entering any link in the list page link set, opening a corresponding detail page and determining an acquisition mode according to a page acquisition object of the detail page, wherein the acquisition mode is preset and comprises grouping acquisition and non-grouping acquisition; the details are described with reference to step S4.
The fifth processing module is used for receiving an xpath path of a click element of the detail page if the acquisition mode is a non-grouping mode, and the detail page is a detail page; the details are described with reference to step S5.
The sixth processing module 6 is configured to extract, according to the xpath path of the detail page click element, text information corresponding to the detail page click element in a text extraction manner; the details are described with reference to step S6.
A seventh processing module 7, configured to perform data acquisition on each link in the list page link set according to the detail page click element xpath path and the corresponding text information; the details are described with reference to step S7.
As an exemplary embodiment, the system further comprises: an eighth processing module, configured to, if the collection mode is a grouping mode, determine that the detail page includes a plurality of detail page grouping nodes, receive an xpath path of a click element of the detail page grouping nodes, and refer to the detailed content in step S8; a ninth processing module, configured to obtain xpath paths of all elements similar to the click element of the detail page grouping node in the detail page grouping node according to the xpath paths of the click elements of the detail page grouping node, and generate a detail page grouping node set, where the detailed content refers to that in step S9; a tenth processing module, configured to group and merge xpath paths of click elements in any one detail page grouping node in the detail page grouping node set to obtain a detail page grouping node xpath path, where the detail page grouping node xpath path includes a grouping flag used for positioning the detail page grouping node, and the grouping flag is located behind and immediately adjacent to a position of a parent node element of a click element in a group, and the detailed content refers to step S10; an eleventh processing module, configured to enter any detail page grouping node in the detail page grouping node set, open a corresponding grouping node, and receive an xpath path of a detail page click element, where the detailed content refers to step S11; a tenth processing module, configured to extract, according to the xpath path of the detail page click element, text information corresponding to the detail page click element in a text extraction manner, where the detailed content refers to step S12; a thirteenth processing module, configured to perform data grouping collection on each detail page grouping node in the detail page grouping node set according to the xpath path of the detail page click element and the corresponding text information, where the detailed content refers to step S13; a fourteenth processing module, configured to perform data acquisition on each link in the list page link set according to an xpath path of a node click element grouped in the detail page, where the detailed content refers to step S14.
As an exemplary embodiment, the second processing module includes: the first processing unit is configured to compare node types of nodes at each level of an xpath path of a list page click element one by one, and refer to the detailed content in step S201; a second processing unit, configured to terminate merging if the node type comparison results are inconsistent, where the detailed content refers to that in step S202; a third processing unit, configured to compare node names of nodes at each level one by one if the node type comparison results are consistent, and refer to step S203 for details; a fourth processing unit, configured to terminate merging if the node name comparison results are consistent, and refer to step S204 for detailed content; a fifth processing unit, configured to, if the node name comparison results are inconsistent, find a leaf node with the same first element name and a different position from the parent node, remove the position information of the leaf node, and obtain a list page merge path, where the detailed content refers to step S205.
As an exemplary embodiment, the system includes: a sixth processing unit, configured to, if the element name of the click element is INPUT and the type of the INPUT is button, reset, or submit, determine that the text information is a value corresponding to the click element, and refer to the step S1111 for the detailed content; a seventh processing unit, configured to, if the element name of the clicked element is ble, determine that the text information is textContent corresponding to the clicked element, and refer to the details in step S1112; an eighth processing unit, configured to, if the element name of the clicked element is SELECT, refer to the text information as the name corresponding to the clicked element, and refer to the details in step S1113; a ninth processing unit, configured to determine whether the innerText corresponding to the clicked element is empty if the element name of the clicked element is not INPUT, able, or SELECT, and refer to step S1114 for details; a tenth processing unit, configured to, if the innerText corresponding to the click element is not empty, determine that the text information is the innerText corresponding to the click element, and refer to step S1115 for details; an eleventh processing unit, configured to, if the innerText corresponding to the click element is empty, determine that the text information is the innerText of the parent node corresponding to the click element, and refer to the step S1116 for details.
As an exemplary embodiment, the ninth processing module includes: a twelfth processing unit, configured to receive the detail page grouping node click element, obtain a current node according to the detail page grouping node click element, and refer to the detailed content in step S901; a thirteenth processing unit, configured to obtain a parent node of the current node, and sequentially traverse child nodes under the parent node, where the detailed content refers to step S902; a fourteenth processing unit, configured to determine whether the node type of the child node is an element, where the detailed content refers to that in step S903; a fifteenth processing unit, configured to determine whether a node name of a child node is the same as a node name of a current node if the node type of the child node is an element, and refer to step S904 for details; a sixteenth processing unit, configured to perform cumulative counting on the position information if the node names are the same, to obtain a cumulative count value, and refer to step S905 for details; a seventeenth processing unit, configured to, if the accumulated count value is greater than 0, refer to step S906 for details according to preset xpath path usage location information of the child node; an eighteenth processing unit, configured to take a parent node of the current node as a new current node, and perform recursion until the new current node is a body node or a node with an id attribute, to obtain xpath paths of all elements similar to the detail page grouping node click element, where the detailed contents refer to step S907.
As an exemplary embodiment, the tenth processing module includes: a nineteenth processing unit, configured to compare node names of nodes at each level of an xpath path of the detail page grouping node click element one by one, where the detailed content refers to step S1001; a twentieth processing unit, configured to, if the node name comparison results are consistent, reserve an xpath path of the detail page grouping node click element, and refer to the detailed content in step S1002; a twenty-first processing unit, configured to, if the node name comparison result is inconsistent, obtain element attribute information of a click element of a detail page grouping node, where the element attribute information includes location information, dynamic identification ID information, and grouping information, and the detailed content refers to step S1003; a twenty-second processing unit, configured to replace, if the element attribute information is location information, location information of an xpath path where the detail page grouping node clicks an element with a grouping flag, where the detailed content refers to step S1004; a twenty-third processing unit, configured to remove the dynamic ID information of the xpath path where the detail page grouping node clicks the element if the element attribute information is the dynamic ID information, and refer to step S1005 for detailed content; a twenty-fourth processing unit, configured to, if the element attribute information is grouping information, determine whether the grouping information is consistent, where the detailed content refers to step S1006; a twenty-fifth processing unit, configured to, if the grouping information is consistent, retain the grouping information of the xpath path of the detail page grouping node click element, and refer to step S1007 for the detailed content; a twenty-sixth processing unit, configured to remove the grouping information of the xpath path of the detail page grouping node click element if the grouping information is inconsistent, where the detailed content refers to the step S1008.
As an exemplary embodiment, further comprising: a fifteenth processing module, configured to perform data collection on detail page elements in a local self-contained browser collection manner, where the detailed content refers to step S15; or, the sixteenth processing module is configured to perform data acquisition on the detail page element in a headless acquisition manner, where the detailed content refers to step S16; or, the seventeenth processing module is configured to perform data acquisition on the detail page elements in a cloud acquisition manner, where the detailed content refers to step S17.
As an exemplary embodiment, further comprising: an eighteenth processing module for receiving the injected CSS patterns; performing predetermined display on the matching result of the xpath path matching according to the CSS style, and referring to step S18 for details; a nineteenth processing module, configured to display, as the predetermined display, the hovering operation on the search element in a first color, and the clicking operation on the selection element in a second color, where the first color is different from the second color, where details are described with reference to step S19.
The xpath-based data acquisition system in this embodiment is presented in the form of functional units, where a unit refers to an ASIC circuit, a processor and a memory executing one or more software or fixed programs, and/or other devices that can provide the above-described functionality.
Further functional descriptions of the modules are the same as those of the corresponding embodiments, and are not repeated herein.
The xpath-based data acquisition system in this embodiment is presented in the form of functional units, where a unit refers to an ASIC circuit, a processor and a memory executing one or more software or fixed programs, and/or other devices that can provide the above-described functionality.
Further functional descriptions of the modules are the same as those of the corresponding embodiments, and are not repeated herein.
An embodiment of the present invention further provides an electronic device, as shown in fig. 11, where the electronic device includes one or more processors 141 and a memory 142, and one processor 141 is taken as an example in fig. 11.
The controller may further include: an input device 143 and an output device 144.
The processor 141, the memory 142, the input device 143, and the output device 144 may be connected by a bus or other means, and the bus connection is exemplified in fig. 11.
Processor 141 may be a Central Processing Unit (CPU). The Processor 141 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or combinations thereof. A general purpose processor may be a microprocessor or any conventional processor or the like.
The memory 142, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the xpath-based data collection method in the embodiments of the present application. The processor 141 executes various functional applications of the server and data processing by running non-transitory software programs, instructions and modules stored in the memory 142, that is, implements the xpath-based data acquisition method of the above method embodiment.
The memory 142 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a processing device operated by the server, and the like. Further, the memory 142 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 142 optionally includes memory located remotely from processor 141, which may be connected to a network connection device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 143 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the processing device of the server. The output device 144 may include a display device such as a display screen.
One or more modules are stored in memory 142 and, when executed by the one or more processors 141, perform the methods illustrated in fig. 1-7.
It will be understood by those skilled in the art that all or part of the processes of the method according to the above embodiments may be implemented by instructing relevant hardware through a computer program, and the executed program may be stored in a computer-readable storage medium, and when executed, may include the processes of the above embodiments of the xpath-based data acquisition method. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.
Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims (7)

1. An xpath-based data acquisition method is characterized by comprising the following steps:
receiving at least two list page click events in list pages to obtain an xpath path of a list page click element corresponding to the list page click events;
merging the xpath paths of the clicked elements of the list pages to obtain list page merging paths, wherein the list page merging paths are leaf nodes which are same in name and different in position from parent nodes to the lower part until the first element is the same, and the xpath paths are obtained after the position information of the leaf nodes is removed;
searching a list with the same path as the merging path of the list page in the list page, and generating a list page link set;
entering any link in the list page link set, opening a corresponding detail page, and determining an acquisition mode according to a page acquisition object of the detail page, wherein the acquisition mode is preset, and the acquisition mode is a grouping mode, so that the detail page comprises a plurality of detail page grouping nodes, and an xpath path of a detail page grouping node clicking element is received;
according to the xpath path of the detail page grouping node clicking elements, obtaining the xpath paths of all elements similar to the detail page grouping node clicking elements in the detail page grouping nodes, and generating a detail page grouping node set;
grouping and merging xpath paths of click elements in any detail page grouping node in a detail page grouping node set to obtain a detail page grouping node xpath path, wherein the detail page grouping node xpath path comprises a grouping mark for positioning the detail page grouping node, and the grouping mark is positioned behind the position of a father node element of the click element in a group and is adjacent to the father node element;
entering any detail page grouping node in the detail page grouping node set, opening the corresponding grouping node, and receiving an xpath path of a detail page clicking element;
extracting text information corresponding to the detail page click element in a text extraction mode according to an xpath path of the detail page click element;
respectively carrying out data grouping collection on each detail page grouping node in the detail page grouping node set according to the xpath path of the detail page clicking element and the corresponding text information;
and respectively acquiring data of each link in the list page link set according to the xpath path of the detail page grouping node clicking element.
2. The xpath-based data acquisition method according to claim 1, wherein the step of obtaining the xpath paths of all elements similar to the click element of the detail page grouping node in the detail page grouping node according to the xpath path of the click element of the detail page grouping node comprises:
s901: receiving a detail page grouping node clicking element, and obtaining a current node according to the detail page grouping node clicking element;
s902: acquiring a father node of a current node, and traversing child nodes under the father node in sequence;
s903: judging whether the node type of the child node is an element or not;
s904: if the node type of the child node is an element, judging whether the node name of the child node is the same as the node name of the current node;
s905: if the node names are the same, accumulating and counting the position information to obtain an accumulated count value;
s906: if the accumulated count value is greater than 0, using the position information by the xpath path of the child node according to the preset value;
s907: and (4) taking the parent node of the current node as a new current node, and repeating the steps S902-S906 until the new current node is a body node or a node with an id attribute, so as to obtain the xpath paths of all elements similar to the clicking element of the detail page grouping node.
3. The xpath-based data acquisition method according to claim 1, wherein the step of grouping and merging the xpath paths of the hit elements in any one of the detail page grouping nodes in the detail page grouping node set comprises:
comparing node names of nodes at each level of an xpath path of the detail page grouping node clicking elements one by one;
if the node name comparison results are consistent, reserving an xpath path of the detail page grouping node clicking element;
if the node name comparison results are inconsistent, acquiring element attribute information of a click element of a detail page grouping node, wherein the element attribute information comprises position information, dynamic Identification (ID) information and grouping information;
if the element attribute information is position information, replacing the position information of the xpath path of the clicked element of the detail page grouping node with a grouping mark;
if the element attribute information is dynamic identification ID information, removing the dynamic identification ID information of the xpath path of the clicked element of the detail page grouping node;
if the element attribute information is grouping information, judging whether the grouping information is consistent;
if the grouping information is consistent, retaining the grouping information of the xpath path of the detail page grouping node clicking element;
and if the grouping information is inconsistent, removing the grouping information of the xpath path of the clicking element of the detail page grouping node.
4. The xpath-based data collection method of claim 1, further comprising:
acquiring data of detail page elements in a local self-contained browser acquisition mode;
or, data acquisition of the detail page elements is carried out in a headless acquisition mode;
or, data acquisition of the detail page elements is carried out in a cloud acquisition mode.
5. An xpath-based data acquisition method as claimed in any one of claims 1-3, further comprising:
receiving an injected CSS pattern;
performing preset display on the matching result of the xpath path matching according to the CSS style; the preset display comprises that the hovering operation of the searching element is displayed in a first color, the clicking operation of the selecting element is displayed in a second color, and the first color is different from the second color.
6. An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to cause the at least one processor to perform the xpath-based data acquisition method of any of claims 1-5.
7. A computer-readable storage medium storing computer instructions for causing a computer to perform the xpath-based data collection method of any one of claims 1-5.
CN202011265720.2A 2020-11-13 2020-11-13 Data acquisition method based on xpath, electronic equipment and storage medium Active CN112099778B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011265720.2A CN112099778B (en) 2020-11-13 2020-11-13 Data acquisition method based on xpath, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011265720.2A CN112099778B (en) 2020-11-13 2020-11-13 Data acquisition method based on xpath, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112099778A CN112099778A (en) 2020-12-18
CN112099778B true CN112099778B (en) 2021-02-02

Family

ID=73785217

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011265720.2A Active CN112099778B (en) 2020-11-13 2020-11-13 Data acquisition method based on xpath, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112099778B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114153530A (en) * 2022-02-08 2022-03-08 广州庚亿信息科技有限公司 Element data information capturing method and device, storage medium and intelligent terminal
CN115017430A (en) * 2022-06-27 2022-09-06 京东科技控股股份有限公司 List page determination method and device, electronic equipment and storage medium
CN115328366B (en) * 2022-08-11 2023-09-19 北京智慧星光信息技术有限公司 Tens of millions of tree node searching and displaying method and system based on full path calculation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106610994A (en) * 2015-10-23 2017-05-03 北京国双科技有限公司 Method and device for counting click paths
CN107092670A (en) * 2017-04-11 2017-08-25 武汉大学 A kind of visual network crawler system and analysis method based on embedded browser
CN107633019A (en) * 2017-08-24 2018-01-26 阿里巴巴集团控股有限公司 A kind of page events acquisition method and device
CN111459365A (en) * 2020-04-03 2020-07-28 南方电网科学研究院有限责任公司 Method for managing user-defined consultation help application

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10110687B2 (en) * 2006-10-06 2018-10-23 International Business Machines Corporation Session based web usage reporter
CN104598462B (en) * 2013-10-30 2018-08-07 深圳市国信互联科技有限公司 Extract the method and device of structural data
CN107943838B (en) * 2017-10-30 2021-09-07 北京大数元科技发展有限公司 Method and system for automatically acquiring xpath generated crawler script

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106610994A (en) * 2015-10-23 2017-05-03 北京国双科技有限公司 Method and device for counting click paths
CN107092670A (en) * 2017-04-11 2017-08-25 武汉大学 A kind of visual network crawler system and analysis method based on embedded browser
CN107633019A (en) * 2017-08-24 2018-01-26 阿里巴巴集团控股有限公司 A kind of page events acquisition method and device
CN111459365A (en) * 2020-04-03 2020-07-28 南方电网科学研究院有限责任公司 Method for managing user-defined consultation help application

Also Published As

Publication number Publication date
CN112099778A (en) 2020-12-18

Similar Documents

Publication Publication Date Title
CN112099778B (en) Data acquisition method based on xpath, electronic equipment and storage medium
CN102831121B (en) Method and system for extracting webpage information
KR101402556B1 (en) Information processing apparatus, information processing method, and computer readable recording medium storing program
CN108920434A (en) A kind of general Web page subject method for extracting content and system
CN110515896B (en) Model resource management method, model file manufacturing method, device and system
CN106844458B (en) Method, computing device and storage medium for displaying online behavior track of user
CN107423322A (en) The display methods and device of the label nesting level of Webpage
CN106844635A (en) The edit methods and device of the element in webpage
CN106874339B (en) Display method of directed cyclic graph and application thereof
CN110738033B (en) Report template generation method, device and storage medium
CN102385621A (en) Method and system for implementing document index based on input method interface
CN103136358A (en) Method for automatically extracting BBS (bulletin board system) data
CN108804469A (en) A kind of web page identification method and electronic equipment
CN114511353A (en) Data analysis method and device
CN108804472A (en) A kind of webpage content extraction method, device and server
CN101446896A (en) MIB file editor
CN109657114A (en) A method of extracting webpage semi-structured data
CN113360603B (en) Contract similarity and compliance detection method and device
CN106372232A (en) Method and device for mining information based on artificial intelligence
CN112069305B (en) Data screening method and device and electronic equipment
Al-Msie'deen Tag clouds for object-oriented source code visualization
CN112765159A (en) Report generation method, system, computer equipment and storage medium
CN109656650B (en) New hand guiding manufacturing method for mixed language integration system
CN102314453B (en) The screening technique of quality version and system
CN110489686A (en) A kind of data analysing method, device and terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant