CN111949903B - Webpage data acquisition method, device, equipment and readable storage medium - Google Patents

Webpage data acquisition method, device, equipment and readable storage medium Download PDF

Info

Publication number
CN111949903B
CN111949903B CN202010886714.2A CN202010886714A CN111949903B CN 111949903 B CN111949903 B CN 111949903B CN 202010886714 A CN202010886714 A CN 202010886714A CN 111949903 B CN111949903 B CN 111949903B
Authority
CN
China
Prior art keywords
dom
target
node
webpage
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010886714.2A
Other languages
Chinese (zh)
Other versions
CN111949903A (en
Inventor
田玉秋
范渊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DBAPPSecurity Co Ltd
Original Assignee
DBAPPSecurity Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DBAPPSecurity Co Ltd filed Critical DBAPPSecurity Co Ltd
Priority to CN202010886714.2A priority Critical patent/CN111949903B/en
Publication of CN111949903A publication Critical patent/CN111949903A/en
Application granted granted Critical
Publication of CN111949903B publication Critical patent/CN111949903B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a webpage data acquisition method, device, equipment and computer readable storage medium, wherein the method comprises the following steps: collecting DOM events corresponding to each DOM node of the target webpage respectively; determining a target DOM node in the DOM nodes according to the marking information corresponding to the target webpage; triggering a DOM event of a target DOM node to obtain webpage data, and judging whether a front-end route is generated or not; if the front-end route is generated, jumping back to the target webpage and updating the mark information; the method jumps back to the target webpage when generating the front-end route, updates the marking information, determines the target DOM node according to the marking information, avoids the phenomenon that DOM events of which DOM nodes are triggered to fall into a dead loop after re-rendering the DOM tree when the webpage jumps, and further continues to collect webpage data of the target webpage, thereby improving the integrity of webpage data collection.

Description

Webpage data acquisition method, device, equipment and readable storage medium
Technical Field
The present disclosure relates to the field of web page data collection technologies, and in particular, to a web page data collection method, a web page data collection device, a web page data collection apparatus, and a computer readable storage medium.
Background
With the rapid development of internet technology, the internet has become a carrier of a large amount of information from which a large amount of useful data can be extracted. The collected data can be used for network security detection, so that the accuracy of the network security detection is directly determined by the quantity of the collected data, and the incomplete collection of the webpage data can cause that webpage security hole discovery is not completely discovered, thereby causing serious potential safety hazards. With the development of website technology, front-end and back-end separation has become an industry standard use mode for internet project development, and the mode can reduce web page refreshing by adopting front-end routing, so that the front-end routing is easily triggered when the related technology collects web page data, page jump is caused, and further web page data collection of the current web page is incomplete.
Therefore, how to solve the problem of incomplete collection of web page data in the related art is a technical problem that needs to be solved by those skilled in the art.
Disclosure of Invention
In view of the foregoing, an object of the present application is to provide a web page data collecting method, a web page data collecting device, a web page data collecting apparatus and a computer readable storage medium, which solve the problem of incomplete web page data collection in the related art.
In order to solve the above technical problems, the present application provides a method for collecting web page data, including:
collecting DOM events corresponding to each DOM node of the target webpage respectively;
determining a target DOM node in the DOM nodes according to the marking information corresponding to the target webpage;
triggering the DOM event of the target DOM node to obtain webpage data, and judging whether a front-end route is generated or not;
if the front-end route is generated, jumping back to the target webpage and updating the marking information.
Optionally, before the triggering the DOM event of the target DOM node to obtain web page data, the method further comprises:
generating a first backup corresponding to the DOM tree;
correspondingly, after the triggering the DOM event of the target DOM node to obtain web page data, the method further comprises:
generating a second backup corresponding to the DOM tree;
judging whether the first backup and the second backup are the same or not;
if the DOM tree is different, determining a newly added DOM node according to the first backup and the second backup, and updating the DOM tree according to the newly added DOM node.
Optionally, before the collecting the DOM events corresponding to the DOM nodes of the target web page, the method further comprises:
acquiring webpage information of a target webpage;
and performing simulated loading and rendering processing according to the webpage information to obtain the target webpage and target webpage data.
Optionally, the determining whether front-end routing is generated includes:
judging whether the URL anchor point value is changed or not;
if the front-end route is changed, determining that the front-end route is generated;
if the front-end route is not generated, determining that the front-end route is not generated.
Optionally, the determining whether front-end routing is generated includes:
judging whether the browser history stack is pushed or pulled;
if the push or the pop occurs, determining that the front-end route is generated;
if no push or pop occurs, determining that the front-end route is not generated.
Optionally, the updating the marking information includes:
marking the target DOM node in the DOM tree.
Optionally, the determining the target DOM node in the DOM nodes according to the marking information corresponding to the target web page includes:
judging whether historical information exists or not;
if the history information exists, determining the target DOM node according to the history information and a preset sequence;
if the history information does not exist, traversing the DOM tree according to the preset sequence, and determining the latest marked DOM node in the DOM tree;
and determining a backup node of the latest marked DOM node as the target DOM node.
The application also provides a webpage data acquisition device, which comprises:
the collecting module is used for collecting DOM events corresponding to each DOM node of the target webpage respectively;
the target node determining module is used for determining a target DOM node in the DOM nodes according to the marking information corresponding to the target webpage;
the triggering module is used for triggering the DOM event of the target DOM node to obtain webpage data and judging whether a front-end route is generated or not;
and the jump module is used for jumping back to the target webpage and updating the marking information if the front-end route is generated.
The application also provides a webpage data acquisition device, which comprises a memory and a processor, wherein:
the memory is used for storing a computer program;
the processor is configured to execute the computer program to implement the above-mentioned method for collecting web page data.
The application also provides a computer readable storage medium for storing a computer program, wherein the computer program realizes the webpage data acquisition method when being executed by a processor.
According to the webpage data acquisition method, DOM events corresponding to all DOM nodes of the target webpage are collected; determining a target DOM node in the DOM nodes according to the marking information corresponding to the target webpage; triggering a DOM event of a target DOM node to obtain webpage data, and judging whether a front-end route is generated or not; if the front-end route is generated, the target webpage is jumped back, and the marking information is updated.
Therefore, after collecting DOM events corresponding to the DOM nodes, the method determines the target DOM node in the DOM nodes according to the marking information corresponding to the target webpage. The tag information of the target web page indicates whether each DOM node is a front-end routing node. Because the webpage jump can re-render the DOM tree, a target DOM node needs to be determined in the DOM nodes according to the marking information, otherwise, DOM events of which DOM nodes are triggered cannot be determined, DOM events of all DOM nodes need to be re-triggered, and then the dead loop is trapped. After the target DOM node is determined, triggering the corresponding DOM event, collecting the webpage data and judging whether the front-end route is generated or not. If the front-end route is generated, in order to continue to acquire the webpage data of the target webpage, the marking information needs to be updated so as to redetermine the target DOM node after the jump. And simultaneously jumping back to the target webpage so as to continuously collect webpage data of the target webpage. By jumping back to the target webpage after the front-end route is generated and determining the target DOM node by using the marking information, the target webpage can be jumped back when the front-end route is generated, and the marking information is updated at the same time, so that the problem that the collection of webpage data is incomplete in the related technology is solved because the DOM event of which DOM nodes can not be determined to be triggered is trapped into a dead loop after the DOM tree is re-rendered when the webpage jumps is avoided.
In addition, the application also provides a webpage data acquisition device, webpage data acquisition equipment and a computer readable storage medium, and the webpage data acquisition device and the webpage data acquisition equipment have the beneficial effects.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.
Fig. 1 is a flowchart of a web page data collection method provided in an embodiment of the present application;
FIG. 2 is a flowchart of a specific method for collecting web page data according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a web page data acquisition device according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a web page data collection device according to an embodiment of the present application.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
Referring to fig. 1, fig. 1 is a flowchart of a web page data collection method according to an embodiment of the present application. The method comprises the following steps:
s101: and collecting DOM events corresponding to the DOM nodes of the target webpage respectively.
The DOM is Document Object Model, the document object model, is a web consortium standard, and is a common method for representing and processing an HTML (Hyper Text Markup Language, hypertext markup language, a necessary tool for web page production) or XML (Extensible Markup Language ) document. The target web page includes a plurality of DOM nodes, each DOM node collectively comprising a DOM tree. Each DOM node may have a corresponding DOM event, and the specific type and number of DOM events corresponding to each DOM node are not limited, and may be, for example, a DOM0 event, a DOM2 event, an inline event, and the like. The DOM event may specifically be a mouse click event, a mouse move event, an a-tag event, or the like. It should be noted that some DOM nodes may also have no DOM events. After the target webpage is loaded and rendered, the DOM tree of the target webpage can be determined, and then all DOM nodes of the target webpage are determined.
The embodiment does not limit the specific collection method of the DOM events, the collection methods of the DOM events of different types can be different, and the collection methods corresponding to the DOM events of various types can be set according to actual conditions. For example, when the DOM event is an inline event, since the inline event directly writes the processing function triggered by the event into the HTML, keywords can be preset, and keyword extraction operation is performed on the DOM tree by using the keywords, so as to obtain a corresponding inline event, for example, < a click= "alert (test)" >. DOM0 event is a traditional way of specifying event handlers by javascript, and assigning a function to an event handler attribute, which is widely supported by a browser and simple to use. For example
let btn=document.getElementById("test");
btn.onclick=function(){
console.log("test");
}
The DOM2 event works by registering a listening event in the addEventListener manner, for example:
let btn=document.getElementById("test");
btn.addEventListener("click",function(){
console.log("test");
},true)
for the DOM0 event and the DOM2 event, the monitoring program or the interception program can be set to detect the DOM0 event and the DOM2 event, and the DOM0 event or the DOM2 event is recorded after being triggered, so that the corresponding DOM event is obtained.
In one embodiment, the target web page is not rendered well in advance, so that the target web page needs to be rendered before step S101. Specifically, it may include:
step 11: and acquiring webpage information of the target webpage.
Step 12: and performing simulated loading and rendering processing according to the webpage information to obtain a target webpage and target webpage data.
The web page information is used for indicating the identity of the target web page, and may specifically be a URL corresponding to the target web page, where the URL is a uniform resource locator (Uniform Resource Locator), also called a web page Address, and is an Address (Address) of a standard resource on the internet. The web page information may also be in other forms as long as it can be used to load and render the web page. The web page information may be entered manually by the user or may be sent by other devices or terminals. After the webpage information is obtained, the webpage information can be simulated, loaded and rendered to obtain a target webpage. The embodiment is not limited to a specific manner of simulating the loading and rendering process, and may refer to related technologies, for example, using phantomjs, puppeteer or other tools to simulate loading and rendering web page information. The target webpage data can be obtained in the process of carrying out simulated loading and rendering processing on the webpage information while the target webpage is obtained, and the target webpage data is also part of the webpage data, but only a small part of the target webpage data is needed, and most of the webpage data is needed to be obtained by triggering DOM events on DOM nodes. In order to make the webpage data more comprehensive, the target webpage data can be collected.
S102: and determining a target DOM node in the DOM nodes according to the marking information corresponding to the target webpage.
The marking information may indicate whether the DOM node corresponding to the target web page is a front-end routing node, i.e. whether the DOM event of the DOM node generates a front-end route when triggered. When an event on a certain DOM node is triggered, a front-end route may be generated. The web page jumps after the front-end route is generated. Because the DOM tree of the target webpage is refreshed when the webpage jumps, even if the original webpage is jumped back again, the DOM node triggered before the refresh cannot be known as the DOM node, in this case, only the DOM event on each DOM node can be retried, so that the DOM event with the front-end route can be triggered again, and the target webpage jumps again after the front-end route is triggered, so that the dead loop is trapped. In order to avoid the problem, the method and the device utilize the marking information to mark whether the DOM node is the front-end routing node, and when the target DOM node is determined, the non-front-end routing node can be selected as the target DOM node according to the marking information, so that DOM events of front-end routing can be prevented from being generated on the front-end routing node by repeated triggering. The embodiment is not limited to a specific form of the markup information, and may be, for example, a text form, or may be a DOM tree of the target web page. The present embodiment is not limited to stopping determining the target DOM node under what circumstances, and in an embodiment, determining the target DOM node may be stopped when the detection of the DOM tree reaches the maximum recursion depth, and the specific size of the maximum recursion depth may be set according to the actual situation. In another embodiment, determining the target DOM node may be stopped after all DOM nodes in the DOM tree have been triggered.
In one possible implementation, a DOM tree may be utilized as the markup information. The step S102 may include:
step 21: and judging whether historical information exists or not.
Step 22: if the history information exists, determining the target DOM nodes according to the history information and the preset sequence.
Step 23: if the history information does not exist, traversing the DOM tree according to a preset sequence, and determining the latest marked DOM node in the DOM tree.
Step 24: and determining the backup node of the latest marked DOM node as a target DOM node.
It should be noted that, the history information is used to record the last triggered DOM node (the DOM event corresponding to the DOM node is triggered, that is, the DOM node is triggered). When the target DOM node is determined, whether history information exists is firstly judged, if so, the condition that the webpage jump does not occur is indicated, and therefore the corresponding target DOM node can be determined according to the history information. The preset sequence is the triggering sequence of the DOM nodes, namely, DOM events of all DOM nodes are sequentially triggered according to the preset sequence. Thus, when history information exists, the next triggered DOM node, i.e., the target DOM node, can be determined in a preset order.
If no history information exists, the condition that the webpage jump occurs before the history information exists is indicated, the DOM tree is refreshed, and the DOM tree can be traversed according to a preset sequence. After traversing the DOM tree according to the preset sequence, the latest marked DOM node in the DOM tree can be determined, and the latest marked DOM node is the last marked DOM node in the DOM tree determined according to the preset sequence because the DOM node is triggered according to the preset sequence. After the latest marked DOM node is determined, the backup nodes of the latest marked DOM node are determined according to a preset sequence. When determining the target DOM node, if the last triggered target DOM node is a front-end routing node, that is, if the web page jump occurs before the target DOM node is determined this time, the target DOM node is the backup node of the latest marked DOM node according to the preset sequence.
S103: triggering DOM events of the target DOM node to obtain webpage data, and judging whether front-end routing is generated or not.
After the target DOM node is determined, triggering DOM events of the target DOM node, and obtaining webpage data at one side. The triggering mode of the DOM event corresponds to the DOM event itself, and the specific triggering mode is not limited. When collecting the web page data, it is also necessary to determine whether the front-end routing occurs, where the front-end routing occurs in a hash manner and a history stack (i.e., a browser history stack). If the front-end route is generated, the process may proceed to step S104, and if the front-end route is not generated, the process may proceed to step S105.
In one embodiment, the step of determining whether a front-end route has occurred comprises:
step 31: and judging whether the URL anchor point value is changed or not.
Step 32: if a change occurs, it is determined that a front-end route has occurred.
Step 33: if the front-end route is not generated, determining that the front-end route is not generated.
The hash attribute is a readable and writable string that is the anchor value portion of the URL, typically triggered by the # address in href in the current page. A change in the hash (i.e., a change in the URL anchor value) does not result in a page reload. When a browser is used to access a web page, if the web page URL has a URL anchor value, the page is located at the same element of id (or name) as the URL anchor value. The hash value can be obtained and set by the window. When detecting whether the front-end route occurs, whether the URL anchor point value is changed or not can be judged, if so, the front-end route can be determined to be generated, and if not, the front-end route can be determined not to be generated.
In another embodiment, the step of determining whether a front-end route has occurred includes:
step 41: it is determined whether a push or pop has occurred to the browser history stack.
Step 42: if a push or pop occurs, it is determined that a front-end route has occurred.
Step 43: if no push or pop occurs, it is determined that no front-end routing occurs.
The browser history stack is an API newly supported by HTML5, including two APIs, pushState and replaceState, through which URL addresses can be changed without sending a request, while an onpopstate event is generated. Therefore, when the front-end route is detected, whether the browser history stack is pushed or pulled can be judged, specifically, whether an onpopstate event is generated can be detected, if the onpopstate event is generated, the fact that the push or the pull occurs can be determined, and therefore the front-end route is determined to be generated. If not, it may be determined that no front-end routing is generated.
Further, in one embodiment, the asynchronous loading request may cause a DOM tree to change, and after a certain DOM node is triggered, a new DOM node may appear. Specifically, before triggering the DOM event of the target DOM node to obtain the web page data, the method may further include:
step 51: generating a first backup corresponding to the DOM tree.
The embodiment is not limited to the specific form of the first backup, and may be, for example, a snapshot. The manner in which the first backup is generated is related to the specific form of the first snapshot, which is not limited in this embodiment.
Correspondingly, after triggering the DOM event of the target DOM node to obtain the webpage data, the method further comprises the following steps:
step 52: and generating a second backup corresponding to the DOM tree.
Step 53: it is determined whether the first backup and the second backup are the same.
Step 54: if the nodes are different, determining a new DOM node according to the first backup and the second backup, and updating the DOM tree according to the new DOM node.
After triggering the target DOM node, a second backup of the DOM tree may be generated to determine whether the DOM tree has changed after the target DOM node is triggered, i.e., to determine whether the first backup and the second backup are the same. If the two types of the data are different, determining a difference part between the first backup and the second backup, wherein the difference part is a newly added DOM node, and updating the DOM tree according to the newly added DOM node so as to trigger the newly added DOM node subsequently and collect corresponding webpage data.
S104: jump back to the target web page and update the tag information.
After the front-end route is generated, in order to continue to collect the webpage data of the target webpage, the target webpage needs to be jumped back, and meanwhile, the marking information needs to be updated so as to determine the target DOM node according to the latest marking information. The embodiment is not limited to a specific update mode of the marking information, and is related to the form of the marking information, for example, when the marking information is a DOM tree, the marking may be performed on the target DOM node in the DOM tree, so as to complete the update of the marking information.
S105: and (5) presetting operation.
The embodiment is not limited to the operation performed when the front-end routing is not generated, that is, the specific content of the preset operation is not limited, for example, the step S102 may be performed, that is, the target DOM node is updated, so as to continuously collect the web page information. Or may be other operations, such as no operation, i.e., no operation is performed.
After collecting DOM events corresponding to each DOM node, determining a target DOM node in the DOM nodes according to marking information corresponding to the target webpage by applying the webpage data collecting method. The tag information of the target web page indicates whether each DOM node is a front-end routing node. Because the webpage jump can re-render the DOM tree, a target DOM node needs to be determined in the DOM nodes according to the marking information, otherwise, DOM events of which DOM nodes are triggered cannot be determined, DOM events of all DOM nodes need to be re-triggered, and then the dead loop is trapped. After the target DOM node is determined, triggering the corresponding DOM event, collecting the webpage data and judging whether the front-end route is generated or not. If the front-end route is generated, in order to continue to acquire the webpage data of the target webpage, the marking information needs to be updated so as to redetermine the target DOM node after the jump. And simultaneously jumping back to the target webpage so as to continuously collect webpage data of the target webpage. By jumping back to the target webpage after the front-end route is generated and determining the target DOM node by using the marking information, the target webpage can be jumped back when the front-end route is generated, and the marking information is updated at the same time, so that the problem that the collection of webpage data is incomplete in the related technology is solved because the DOM event of which DOM nodes can not be determined to be triggered is trapped into a dead loop after the DOM tree is re-rendered when the webpage jumps is avoided.
Based on the above embodiments, a specific web page data collection method will be described in this embodiment. Referring to fig. 2, fig. 2 is a flowchart of a specific web page data collection method according to an embodiment of the present application. After the start, loading the page, namely loading the target webpage, collecting time after loading, and after the DOM event is collected, interacting with the DOM tree, namely determining the target DOM node and triggering the DOM event corresponding to the target DOM node. Judging whether the front-end route is found after the interaction is finished, if the front-end route is found, carrying out event interaction again, namely jumping back to the target webpage, and updating the mark information. If the front-end route does not occur, judging whether interaction is completed, namely whether all DOM events corresponding to the DOM nodes are triggered, if not, continuing to interact, and judging whether the front-end route occurs. If so, a new DOM structure is obtained, i.e., the DOM tree is updated so that the newly added DOM node is added to the DOM tree when the newly added DOM node exists. And judging whether the maximum recursion depth is reached after updating is finished, and if so, ending. If not, judging whether the triggered DOM node exists, if not, ending, and if so, interacting with the DOM tree again.
The following describes the webpage data collecting device provided in the embodiment of the present application, and the webpage data collecting device described below and the webpage data collecting method described above may be referred to correspondingly.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a web page data acquisition device according to an embodiment of the present application, including:
the collection module 110 is configured to collect DOM events corresponding to DOM nodes of the target web page respectively;
the target node determining module 120 is configured to determine a target DOM node in the DOM nodes according to the marking information corresponding to the target web page;
the triggering module 130 is configured to trigger a DOM event of the target DOM node to obtain web page data, and determine whether a front-end route is generated;
and a jump module 140, configured to jump back to the target web page and update the label information if the front-end route is generated.
Optionally, the method further comprises:
the first backup module is used for generating a first backup corresponding to the DOM tree;
correspondingly, the method further comprises the steps of:
the second backup module is used for generating a second backup corresponding to the DOM tree;
the same judging module is used for judging whether the first backup and the second backup are the same or not;
and the updating module is used for determining a newly added DOM node according to the first backup and the second backup if the first backup and the second backup are different, and updating the DOM tree according to the newly added DOM node.
Optionally, the method further comprises:
the webpage information acquisition module is used for acquiring webpage information of the target webpage;
and the loading and rendering module is used for carrying out simulated loading and rendering processing according to the webpage information to obtain the target webpage and the target webpage data.
Optionally, the triggering module 130 includes:
the first judging unit is used for judging whether the URL anchor point value is changed or not;
a first determining unit configured to determine that the front-end route is generated if a change occurs;
and the second determining unit is used for determining that the front-end route is not generated if the front-end route is not changed.
Optionally, the triggering module 130 includes:
the second judging unit is used for judging whether the browser history stack is pushed or pulled;
a third determining unit, configured to determine that a front-end route is generated if a push or pop occurs;
and the fourth determining unit is used for determining that the front-end route is not generated if the push or the pop does not occur.
Optionally, the skip module 140 includes:
and the node marking unit is used for marking the target DOM node in the DOM tree.
Optionally, the target node determining module 120 includes:
a history information judging unit for judging whether history information exists;
the first determining unit is used for determining target DOM nodes according to a preset sequence according to the history information if the history information exists;
the second determining unit is used for traversing the DOM tree according to a preset sequence if the history information does not exist, and determining the latest marked DOM node in the DOM tree;
and the third determining unit is used for determining the backup node of the latest marked DOM node as the target DOM node.
The following describes a web page data acquisition device provided in an embodiment of the present application, where the web page data acquisition device described below and the web page data acquisition method described above may be referred to correspondingly.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a web page data collection device according to an embodiment of the present application. Wherein the web page data acquisition device 400 may include a processor 401 and a memory 402, and may further include one or more of a multimedia component 403, an information input/information output (I/O) interface 404, and a communication component 405.
Wherein, the processor 401 is used for controlling the overall operation of the webpage data collecting device 400 to complete all or part of the steps in the webpage data collecting method; the memory 402 is used to store various types of data to support the operation of the web page data acquisition device 400, which may include, for example, instructions for any application or method operating on the web page data acquisition device 400, as well as application related data. The Memory 402 may be implemented by any type or combination of volatile or non-volatile Memory devices, such as one or more of static random access Memory (Static RanDOM Access Memory, SRAM), electrically erasable programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), erasable programmable Read-Only Memory (Erasable Programmable Read-Only Memory, EPROM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk.
The multimedia component 403 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen, the audio component being for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may be further stored in the memory 402 or transmitted through the communication component 405. The audio assembly further comprises at least one speaker for outputting audio signals. The I/O interface 404 provides an interface between the processor 401 and other interface modules, which may be a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 405 is used for wired or wireless communication between the web page data acquisition device 400 and other devices. Wireless communication, such as Wi-Fi, bluetooth, near field communication (Near Field Communication, NFC for short), 2G, 3G or 4G, or a combination of one or more thereof, the corresponding communication component 405 may thus comprise: wi-Fi part, bluetooth part, NFC part.
The web page data acquisition device 400 may be implemented by one or more application specific integrated circuits (Application Specific Integrated Circuit, abbreviated as ASIC), digital signal processors (Digital Signal Processor, abbreviated as DSP), digital signal processing devices (Digital Signal Processing Device, abbreviated as DSPD), programmable logic devices (Programmable Logic Device, abbreviated as PLD), field programmable gate arrays (Field Programmable Gate Array, abbreviated as FPGA), controllers, microcontrollers, microprocessors, or other electronic components for performing the web page data acquisition methods described in the above embodiments.
The following describes a computer readable storage medium provided in an embodiment of the present application, where the computer readable storage medium described below and the method for collecting web page data described above may be referred to correspondingly.
The application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the above-described web page data acquisition method.
The computer readable storage medium may include: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RanDOM Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application, but such implementation should not be considered to be beyond the scope of this application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it is further noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms include, comprise, or any other variation is intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
The principles and embodiments of the present application are described herein with specific examples, the above examples being provided only to assist in understanding the methods of the present application and their core ideas; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims (9)

1. The web page data acquisition method is characterized by comprising the following steps of:
collecting DOM events corresponding to each DOM node of the target webpage respectively;
determining a target DOM node in the DOM nodes according to the marking information corresponding to the target webpage;
triggering the DOM event of the target DOM node to obtain webpage data, and judging whether a front-end route is generated or not;
if the front-end route is generated, jumping back to the target webpage and updating the marking information;
the determining the target DOM node in the DOM nodes according to the marking information corresponding to the target web page includes:
judging whether historical information exists or not;
if the history information exists, determining the target DOM node according to the history information and a preset sequence;
if the history information does not exist, traversing the DOM tree according to the preset sequence, and determining the latest marked DOM node in the DOM tree;
determining a backup node of the latest marked DOM node as the target DOM node;
wherein the backup node of the DOM node represents other DOM nodes in the preset sequence that are triggered after the DOM node.
2. The method for collecting web page data according to claim 1, further comprising, before the triggering the DOM event of the target DOM node to obtain web page data:
generating a first backup corresponding to the DOM tree;
correspondingly, after the triggering the DOM event of the target DOM node to obtain web page data, the method further comprises:
generating a second backup corresponding to the DOM tree;
judging whether the first backup and the second backup are the same or not;
if the DOM tree is different, determining a newly added DOM node according to the first backup and the second backup, and updating the DOM tree according to the newly added DOM node.
3. The web page data collection method according to claim 1, further comprising, before the collecting DOM events corresponding to the DOM nodes of the target web page, respectively:
acquiring webpage information of a target webpage;
and performing simulated loading and rendering processing according to the webpage information to obtain the target webpage and target webpage data.
4. The method for collecting web page data according to claim 1, wherein the determining whether a front-end route is generated comprises:
judging whether the URL anchor point value is changed or not;
if the front-end route is changed, determining that the front-end route is generated;
if the front-end route is not generated, determining that the front-end route is not generated.
5. The method for collecting web page data according to claim 1, wherein the determining whether a front-end route is generated comprises:
judging whether the browser history stack is pushed or pulled;
if the push or the pop occurs, determining that the front-end route is generated;
if no push or pop occurs, determining that the front-end route is not generated.
6. The method for collecting web page data according to claim 1, wherein the updating the tag information includes:
marking the target DOM node in the DOM tree.
7. A web page data acquisition device, comprising:
the collecting module is used for collecting DOM events corresponding to each DOM node of the target webpage respectively;
the target node determining module is used for determining a target DOM node in the DOM nodes according to the marking information corresponding to the target webpage;
the triggering module is used for triggering the DOM event of the target DOM node to obtain webpage data and judging whether a front-end route is generated or not;
the jump module is used for jumping back to the target webpage and updating the marking information if the front-end route is generated;
wherein the target node determining module comprises:
a history information judging unit for judging whether history information exists;
the first determining unit is used for determining the target DOM node according to the history information and a preset sequence if the history information exists;
the second determining unit is used for traversing the DOM tree according to the preset sequence if the history information does not exist, and determining the latest marked DOM node in the DOM tree;
a third determining unit, configured to determine a backup node of the latest marked DOM node as the target DOM node;
wherein the backup node of the DOM node represents other DOM nodes in the preset sequence that are triggered after the DOM node.
8. A web page data acquisition device comprising a memory and a processor, wherein:
the memory is used for storing a computer program;
the processor is configured to execute the computer program to implement the web page data acquisition method as claimed in any one of claims 1 to 6.
9. A computer readable storage medium for storing a computer program, wherein the computer program when executed by a processor implements the method of collecting web page data according to any one of claims 1 to 6.
CN202010886714.2A 2020-08-28 2020-08-28 Webpage data acquisition method, device, equipment and readable storage medium Active CN111949903B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010886714.2A CN111949903B (en) 2020-08-28 2020-08-28 Webpage data acquisition method, device, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010886714.2A CN111949903B (en) 2020-08-28 2020-08-28 Webpage data acquisition method, device, equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN111949903A CN111949903A (en) 2020-11-17
CN111949903B true CN111949903B (en) 2024-03-08

Family

ID=73367024

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010886714.2A Active CN111949903B (en) 2020-08-28 2020-08-28 Webpage data acquisition method, device, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN111949903B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116541537B (en) * 2023-06-06 2023-11-03 简单汇信息科技(广州)有限公司 Knowledge graph-based enterprise trade information visual display method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951451A (en) * 2017-02-22 2017-07-14 北京麒麟合盛网络技术有限公司 A kind of webpage content extracting method, device and computing device
CN107729385A (en) * 2017-09-19 2018-02-23 杭州安恒信息技术有限公司 A kind of method for gathering dynamic web page partial data content
CN108846116A (en) * 2018-06-26 2018-11-20 北京京东金融科技控股有限公司 Page Impression collecting method, system, electronic equipment and storage medium
CN109697130A (en) * 2017-10-23 2019-04-30 北京金山云网络技术有限公司 Front and back end separation method, device, equipment and the storage medium of web system
CN111523074A (en) * 2020-04-26 2020-08-11 成都思维世纪科技有限责任公司 Acquisition system for dynamic page sensitive data of front-end rendering website

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951451A (en) * 2017-02-22 2017-07-14 北京麒麟合盛网络技术有限公司 A kind of webpage content extracting method, device and computing device
CN107729385A (en) * 2017-09-19 2018-02-23 杭州安恒信息技术有限公司 A kind of method for gathering dynamic web page partial data content
CN109697130A (en) * 2017-10-23 2019-04-30 北京金山云网络技术有限公司 Front and back end separation method, device, equipment and the storage medium of web system
CN108846116A (en) * 2018-06-26 2018-11-20 北京京东金融科技控股有限公司 Page Impression collecting method, system, electronic equipment and storage medium
CN111523074A (en) * 2020-04-26 2020-08-11 成都思维世纪科技有限责任公司 Acquisition system for dynamic page sensitive data of front-end rendering website

Also Published As

Publication number Publication date
CN111949903A (en) 2020-11-17

Similar Documents

Publication Publication Date Title
US8326922B2 (en) Method for server-side logging of client browser state through markup language
JP5636521B2 (en) Configuration of web crawler to extract web page information
US9176742B2 (en) Converting desktop applications to web applications
CN106897347B (en) Webpage display method, operation event recording method and device
WO2018082462A1 (en) Application interface traversing method and system, and testing device
US10353721B2 (en) Systems and methods for guided live help
CN108415804B (en) Method for acquiring information, terminal device and computer readable storage medium
US8639559B2 (en) Brand analysis using interactions with search result items
CN103177115A (en) Method and device of extracting page link of webpage
US10769234B2 (en) Document object model transaction crawler
CN111949903B (en) Webpage data acquisition method, device, equipment and readable storage medium
JP4507206B2 (en) Internet information collecting apparatus, program and method
CN113868502A (en) Page crawler method and device, electronic equipment and readable storage medium
CN107038117B (en) Web automatic testing method based on definition-reference between event processing functions
CN112527643A (en) Front-end error detection method and device, electronic equipment and readable storage medium
US11106571B2 (en) Identification of input object in a graphical user interface
US9990271B2 (en) Automatically generating object locators for automation test generation
CN112667934A (en) Dynamic simulation diagram display method and device, electronic equipment and computer readable medium
CN110232019A (en) Page test method and Related product
CN110708270A (en) Abnormal link detection method and device
JP5263635B2 (en) Search expression generation system
US20140245159A1 (en) Transport script generation based on a user interface script
US11372638B2 (en) Automated dependency detection and response
CN113282285A (en) Code compiling method and device, electronic equipment and storage medium
CN103425775A (en) Method and device for determining corresponding processing according to event collection correlated with webpages

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant