CN115087969A - Information crawling method and device, electronic equipment and storage medium - Google Patents

Information crawling method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115087969A
CN115087969A CN202080096823.2A CN202080096823A CN115087969A CN 115087969 A CN115087969 A CN 115087969A CN 202080096823 A CN202080096823 A CN 202080096823A CN 115087969 A CN115087969 A CN 115087969A
Authority
CN
China
Prior art keywords
target
information
positioning path
positioning
network resource
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202080096823.2A
Other languages
Chinese (zh)
Inventor
郭子亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Oppo Mobile Telecommunications Corp Ltd
Shenzhen Huantai Technology Co Ltd
Original Assignee
Guangdong Oppo Mobile Telecommunications Corp Ltd
Shenzhen Huantai Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Oppo Mobile Telecommunications Corp Ltd, Shenzhen Huantai Technology Co Ltd filed Critical Guangdong Oppo Mobile Telecommunications Corp Ltd
Publication of CN115087969A publication Critical patent/CN115087969A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

An information crawling method, an information crawling device, electronic equipment and a storage medium are provided, wherein the method comprises the following steps: opening a target Uniform Resource Locator (URL) network resource in a browser, and entering a target page (101) corresponding to the target URL network resource; positioning a Document Object Model (DOM) element where target value information is located in a target page to obtain a target DOM element (102); acquiring positioning path information (103) of a target DOM element; loading a target URL network resource into a new tab window (104); and extracting target value information in a new label window according to the positioning path information, and uniformly storing (105) the target value information, so that a single-machine crawler based on a pure browser environment can be realized in a short time, for developers, the development period is short, the operation is simple, the development threshold, the configuration management and the maintenance cost are effectively reduced, and the low-cost, convenient and efficient lightweight information crawling can be realized.

Description

Information crawling method and device, electronic equipment and storage medium Technical Field
The present application relates to the field of computers, and in particular, to an information crawling method, apparatus, electronic device, and storage medium.
Background
The crawler technology is a process for automatically analyzing, collecting, analyzing and storing a large amount of value information in a network. The existing crawler system is mainly divided into a single machine and a distributed machine from the system architecture. These crawler systems are mainly based on popular Python and Java crawler frameworks (such as script framework, Nutch framework) to realize analysis and crawling of target value information.
The interface of the existing crawler frame is complex and too heavy, and for a small-scale or temporary crawler task, the following defects mainly exist: firstly, the development cycle is long, the maintenance cost is high, for example, when the crawling task is realized based on the existing single machine and distributed crawler framework, not only how Python and Java code are realized, but also the configuration and management of the server and the corresponding database are considered. Therefore, for the demand of temporary crawlers, the development period of the existing crawler frame is too long, and the learning cost and the maintenance cost are too high. Secondly, the information of asynchronous JavaScript and extensible markup language (AJAX) and the value information dynamically generated by JavaScript codes are difficult to crawl, and for the value information which is asynchronously loaded by adopting the AJAX and dynamically generated by adopting the JavaScript codes, the target value information is difficult to be positioned by the conventional crawler frame, and the process of really browsing the webpage needs to be simulated by combining an automatic test frame of an application program, so that the extraction of the target value information is realized. Therefore, a browser test framework and a corresponding browser driver need to be installed, which adds extra cost and overhead. Third, it is easily restricted by the anti-crawler mechanism and login verification, for example, the browser identifier of the existing crawler frame is too simple and is easily detected by the anti-crawler mechanism.
Disclosure of Invention
The embodiment of the application provides an information crawling method and a related product, and low-cost, convenient and efficient lightweight information crawling can be achieved.
In a first aspect, an information crawling method in an embodiment of the present application includes:
opening a target Uniform Resource Locator (URL) network resource in a browser, and entering a target page corresponding to the target URL network resource;
positioning a Document Object Model (DOM) element where the target value information is located in the target page to obtain a target DOM element;
acquiring positioning path information of the target DOM element;
loading the target URL network resource to a new label window according to the positioning path information;
and extracting the target value information from the new label window, and uniformly storing the target value information.
In a second aspect, an embodiment of the present application provides an information crawling apparatus, including: the device comprises an opening unit, a positioning unit, an acquisition unit, a loading unit, an extraction unit and a storage unit, wherein,
the starting unit is used for opening a target Uniform Resource Locator (URL) network resource in a browser and entering a target page corresponding to the target URL network resource;
the positioning unit is used for positioning a Document Object Model (DOM) element where the target value information is located in the target page to obtain a target DOM element;
the acquisition unit is used for acquiring the positioning path information of the target DOM element;
the loading unit is used for loading the target URL network resource to a new label window according to the positioning path information;
an extracting unit, configured to extract the target value information in the new tab window;
and the storage unit is used for uniformly storing the target value information.
In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, a communication interface, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the processor, and the program includes instructions for executing the steps in the first aspect of the embodiment of the present application.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program for electronic data exchange, where the computer program enables a computer to perform some or all of the steps described in the first aspect of the embodiment of the present application.
In a fifth aspect, embodiments of the present application provide a computer program product, where the computer program product includes a non-transitory computer-readable storage medium storing a computer program, where the computer program is operable to cause a computer to perform some or all of the steps as described in the first aspect of the embodiments of the present application. The computer program product may be a software installation package.
Drawings
Reference will now be made in brief to the drawings that are needed in describing embodiments or prior art.
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1A is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure;
fig. 1B is a schematic structural diagram of another electronic device provided in the embodiment of the present application;
fig. 1C is a schematic flowchart of an information crawling method disclosed in an embodiment of the present application;
FIG. 2 is a schematic flow chart diagram illustrating another information crawling method disclosed in an embodiment of the present application;
fig. 3 is a schematic structural diagram of another electronic device disclosed in the embodiments of the present application;
FIG. 4A is a schematic structural diagram of an information crawling apparatus disclosed in an embodiment of the present application;
fig. 4B is a modified structure of the information crawling apparatus described in fig. 4A according to the embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present application better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," and the like in the description and claims of the present application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The electronic devices involved in the embodiments of the present application may include various handheld devices, vehicle-mounted devices, wearable devices (smartwatches, wireless headsets), computing devices or other processing devices connected to a wireless modem, and various forms of User Equipment (UE), Mobile Stations (MSs), terminal devices (terminal devices), and so on, which have wireless communication functions. For convenience of description, the above-mentioned devices are collectively referred to as electronic devices. The electronic device may also be a server.
In order to better understand the technical solutions described in the present application, the following explains the technical terms related to the embodiments of the present application:
python, an object-oriented, cross-platform computer programming language.
Java, an object-oriented, cross-platform computer programming language.
JavaScript, an object-oriented Web programming language.
Scapy, an open-source Web crawler framework written based on the Python language.
Nutch, an open source search engine written based on the Java language.
User-agent (user-agent), an HTTP header field used to identify the browser, browser operating system, encryption level, browser rendering engine.
Cookie, a hypertext transfer protocol (HTTP) header field used to identify a legitimate user identity.
The following describes embodiments of the present application in detail.
Referring to fig. 1A, fig. 1A is a schematic structural diagram of an electronic device disclosed in an embodiment of the present application, and the electronic device 100 may include a control circuit, which may include a storage and processing circuit 110. The storage and processing circuitry 110 may be a memory, such as a hard drive memory, a non-volatile memory (e.g., flash memory or other electronically programmable read-only memory used to form a solid state drive, etc.), a volatile memory (e.g., static or dynamic random access memory, etc.), etc., and the embodiments of the present application are not limited thereto. Processing circuitry in storage and processing circuitry 110 may be used to control the operation of electronic device 100. The processing circuitry may be implemented based on one or more microprocessors, microcontrollers, baseband processors, power management units, audio codec chips, application specific integrated circuits, display driver integrated circuits, and the like.
The storage and processing circuitry 110 may be used to run software in the electronic device 100, such as an internet browsing application, a Voice Over Internet Protocol (VOIP) telephone call application, an email application, a media playing application, operating system functions, and so forth. Such software may be used to perform control operations such as, for example, camera-based image capture, ambient light measurement based on an ambient light sensor, proximity sensor measurement based on a proximity sensor, information display functionality based on status indicators such as status indicator lights of light emitting diodes, touch event detection based on a touch sensor, functionality associated with displaying information on multiple (e.g., layered) displays, operations associated with performing wireless communication functions, operations associated with collecting and generating audio signals, control operations associated with collecting and processing button press event data, and other functions in the electronic device 100, and the like, without limitation of embodiments of the present application.
The electronic device 100 may also include input-output circuitry 150. The input-output circuit 150 may be used to enable the electronic device 100 to input and output data, i.e., to allow the electronic device 100 to receive data from an external device and also to allow the electronic device 100 to output data from the electronic device 100 to the external device. The input-output circuit 150 may further include a sensor 170. The sensors 170 may include ambient light sensors, proximity sensors based on light and capacitance, touch sensors (e.g., based on optical touch sensors and/or capacitive touch sensors, where the touch sensors may be part of a touch display screen or used independently as a touch sensor structure), acceleration sensors, gravity sensors, and other sensors, among others.
Input-output circuitry 150 may also include one or more displays, such as display 130. Display 130 may include one or a combination of liquid crystal displays, organic light emitting diode displays, electronic ink displays, plasma displays, displays using other display technologies. Display 130 may include an array of touch sensors (i.e., display 130 may be a touch display screen). The touch sensor may be a capacitive touch sensor formed by a transparent touch sensor electrode (e.g., an Indium Tin Oxide (ITO) electrode) array, or may be a touch sensor formed using other touch technologies, such as acoustic wave touch, pressure sensitive touch, resistive touch, optical touch, and the like, and embodiments of the present application are not limited thereto.
The audio component 140 may be used to provide audio input and output functionality for the electronic device 100. The audio components 140 in the electronic device 100 may include a speaker, a microphone, a buzzer, a tone generator, and other components for generating and detecting sound.
The communication circuit 120 may be used to provide the electronic device 100 with the capability to communicate with external devices. The communication circuit 120 may include analog and digital input-output interface circuits, and wireless communication circuits based on radio frequency signals and/or optical signals. The wireless communication circuitry in communication circuitry 120 may include radio-frequency transceiver circuitry, power amplifier circuitry, low noise amplifiers, switches, filters, and antennas. For example, the wireless communication circuitry in communication circuitry 120 may include circuitry to support Near Field Communication (NFC) by transmitting and receiving near field coupled electromagnetic signals. For example, the communication circuit 120 may include a near field communication antenna and a near field communication transceiver. The communications circuitry 120 may also include a cellular telephone transceiver and antenna, a wireless local area network transceiver circuitry and antenna, and so forth.
The electronic device 100 may further include a battery, power management circuitry, and other input-output units 160. The input-output unit 160 may include buttons, joysticks, click wheels, scroll wheels, touch pads, keypads, keyboards, cameras, light emitting diodes and other status indicators, and the like.
A user may input commands through input-output circuitry 150 to control operation of electronic device 100, and may use output data of input-output circuitry 150 to enable receipt of status information and other outputs from electronic device 100.
In the related art, the crawler framework can comprise a stand-alone crawler framework and a distributed crawler framework, wherein the script is a stand-alone crawler framework realized based on a Pyron language and mainly comprises a script engine, a task scheduler, a downloader, a crawler and a pipeline. The script engine is responsible for sending crawling commands to the modules and coordinating communication and data transfer among the modules. The task scheduler performs uniform scheduling and queue management on Uniform Resource Locator (URL) network resources sent by the script engine. The downloader is responsible for sending URL requests to URL network resources and obtaining URL responses. And the crawler analyzes the response content, extracts the required value information, and transmits the value information to the pipeline for unified analysis, filtration and storage. The Nutch is a distributed search engine and a crawler framework which are realized based on Java language, and mainly depends on a distributed infrastructure to realize distributed crawling and data storage of mass information. The system mainly comprises a generator, a task scheduler, a downloader, a parser and a memory module. The generator mainly queries target value information from a database, and a task scheduler dynamically issues a search task to a distributed system infrastructure cluster to complete the search of the target value information and establish an index. The downloader and parser are responsible for building URL network requests and extracting information fields in URL network responses. And finally, finishing centralized storage of the target value information by the memory.
The existing crawler frame has good crawling capacity for crawling tasks of massive information, but has long development period and high maintenance cost for light-weight information crawling tasks, is difficult to crawl value information dynamically generated by AJAX information and JavaScript codes, and is also easily limited by an anti-crawler mechanism.
Based on this, referring to fig. 1B, fig. 1B provides a schematic structural diagram of another electronic device, where the electronic device includes an information crawling framework for implementing the information crawling method according to the embodiment of the present application, where the information crawling framework may include a browser 100, a console 110 of the browser, a network resource loader 120, a network resource parser 130, and a storage 140, where,
the browser 100 is configured to open a target uniform resource locator URL network resource and enter a target page corresponding to the target URL network resource;
the console 110 is configured to open a new tab window, and load the URL network resource in the new tab window;
the network resource parser 130 is configured to locate a Document Object Model (DOM) element where the target value information in the target page is located, to obtain a target DOM element; acquiring positioning path information of the target DOM element;
the network resource loader 120 is configured to load the target value information through a URL network resource;
the network resource analyzer 130 is further configured to extract the target value information in the new tag window according to the positioning path information;
the memory 140 is configured to store the target value information in a unified manner.
According to the information crawling framework, Java and Python operating environments and an application program automatic testing framework dependence package do not need to be installed, any distributed system infrastructure server and database do not need to be configured, positioning of target value information is achieved only on the basis of functions of the browser, a single-machine crawler based on a pure browser environment can be achieved in a short time, for Web developers, the development period is short, operation is simple, and therefore development thresholds, configuration management and maintenance costs are effectively reduced. Because most of the existing operating systems are applied by self-contained browsers, the scheme has the advantages of cross-platform, can meet the requirements of penetration tests, safety tests and other temporary crawling and directional crawling items of different platforms, and has good cross-platform performance. In addition, the scheme starts a normal browsing behavior based on a real browser, and the anti-crawler capability is strong.
Referring to fig. 1C, fig. 1C is a schematic flow chart of an information crawling method according to an embodiment of the present disclosure, where the information crawling method described in this embodiment is applied to the electronic device shown in fig. 1A or fig. 1B, and the information crawling method includes:
101. and opening a target Uniform Resource Locator (URL) network resource in the browser, and entering a target page corresponding to the target URL network resource.
The target URL network resource can be opened in the browser, wherein the target uniform resource locator URL is used for identifying the position and the access mode of the network resource.
The target page is a browser page corresponding to the target URL network resource, and the target URL network resource can be opened through the browser to enter the target page.
Optionally, in the step 101, when the target uniform resource locator URL network resource is opened in the browser, the method may further include the following steps:
if the target URL network resource needs a login account, acquiring login account information corresponding to the URL network resource;
and verifying the login account information, and if the login account information is successfully verified, executing the operation of accessing the target page corresponding to the target URL network resource.
In consideration of the fact that some network resources need to perform user account login, for a target URL network resource of an account needing to be logged in, login account information corresponding to the URL network resource may be acquired, specifically, when logging in the login account, the login account information may be acquired in a manner of login account information input by a user, for example, the electronic device may receive a user name, a password, and an authentication code input by the user through a browser. Optionally, when logging in the login account for the first time, the login account information may be recorded and saved, so that when performing information crawling later, the saved login account information may be directly called to perform account login without the need for the user to repeatedly input the login account information.
Therefore, the embodiment of the application starts a normal browsing behavior based on a real browser and carries normal user account information, so that the existing anti-crawler technology based on login function limitation is difficult to limit, and the anti-crawler capacity can be improved.
102. And positioning the Document Object Model (DOM) element where the target value information is located in the target page to obtain the target DOM element.
In the embodiment of the application, the Document Object Model (DOM) element where the target value information is located in the target page can be positioned through the browser, so that the target DOM element is obtained. Therefore, the target value information can be positioned only through the browser, an application program automatic test frame does not need to be installed to simulate a real browsing webpage, the target value information is positioned, the cost can be saved, and the operation is simple.
Optionally, in step 102, the positioning the document object model DOM element where the target value information in the target page is located to obtain the target DOM element may include the following steps:
and positioning a target DOM element in which the target value information is positioned through a page element examination function of the browser.
In specific implementation, the browser has a page element review function, and the electronic device can locate the target DOM element where the target value information is located based on the page element review function of the browser, so that an accurate target DOM element location result can be obtained.
103. And acquiring the positioning path information of the target DOM element.
The positioning path information may include a Cascading Style Sheets (CSS) selector or an extensible markup language (Xpath) path.
In specific implementation, the electronic equipment can be positioned to a DOM node where the target information is located and obtain a CSS selector or an Xpath path of an element of the node, and positioning path information is obtained through a browser, so that the positioning path information corresponding to the target value information can be positioned in a short time, and the information crawling efficiency is improved.
Optionally, in step 103, the obtaining of the location path information of the target DOM element may include the following steps:
21. positioning a DOM node where the target value information is located;
22. and acquiring positioning path information corresponding to the positioning path of the first node element under the DOM node.
The DOM node is a node of each component in the XML document, the whole document is a document node, and each XML tag is an element node.
In specific implementation, the electronic device may first locate a DOM node where the target value information is located, and then obtain a first node element under the DOM node to obtain a CSS selector or an Xpath path where the first node element is located, so that accurate location path information may be obtained.
Optionally, after the step 103, the following steps may be further included:
31. verifying whether the positioning path corresponding to the positioning path information is valid or not through a console of the browser;
32. if yes, executing the operation of loading the target URL network resource to a new label window;
33. if not, adjusting the positioning path information.
In the embodiment of the application, the control console of the browser can be opened, whether the CSS selector or the Xpath path is valid or not is verified, if the CSS selector or the Xpath path is valid, the target URL network resource is continuously loaded to a new label window, the target value information is further extracted, and if the CSS selector or the Xpath is invalid, the CSS selector or the Xpath path can be adjusted.
Optionally, in the step 31, verifying, by the console, whether the positioning path corresponding to the positioning path information is valid may include the following steps:
and inputting a positioning path corresponding to the positioning path information in the console, and if the target DOM element can be successfully positioned, determining that the positioning path corresponding to the positioning path information is valid.
And if the target DOM element cannot be successfully positioned, the CSS selector or the Xpath path is indicated to be invalid.
Optionally, in the step 33, adjusting the positioning path information may include the following steps:
acquiring positioning path information corresponding to a positioning path of a second node element under the DOM node, wherein the second node element and the first node element respectively correspond to different child nodes under the DOM node;
and adjusting the positioning path information to the positioning path information corresponding to the positioning path of the second node element.
The electronic equipment can acquire the CSS selector or the Xpath path of the second node element under the DOM node to obtain the adjusted CSS selector or the adjusted Xpath, and can also input the adjusted CSS selector or the adjusted Xpath into the console to determine whether the adjusted CSS selector or the adjusted Xpath is effective. Therefore, the positioning path information corresponding to the target value information can be ensured to be positioned by adjusting the positioning path information.
104. And loading the target URL network resource to a new label window.
The electronic equipment can open a new label window through a control console of the browser, and then loads the target URL network resource to the new label window, so that the target value information is extracted from the new label window.
105. And extracting the target value information in the new label window according to the positioning path information, and uniformly storing the target value information.
The target value information can include value information generated by AJAX information and JavaScript codes, and in specific implementation, the electronic equipment can extract the target value information in the new label window according to the positioning path information and further store the target value information in the memory. Therefore, the value information generated by the AJAX information and the JavaScript code can be crawled only through the browser, a Java and Python operating environment and an application program automatic test frame dependence package do not need to be installed, any distributed system infrastructure server and database do not need to be configured, the target value information can be positioned only based on the functions of the browser, the target value information can be crawled in a short time, the development period is short, the operation is simple, and the development threshold and the maintenance cost are reduced.
In addition, because most of the existing operating systems are applied by browsers, the method has the advantages of cross-platform, and can meet the penetration test, safety test and other temporary crawling and directional crawling project requirements of different platforms, and therefore, the method has good cross-platform performance.
Optionally, in the step 105, extracting the target value information in the new label window according to the positioning path information may include the following steps:
51. injecting a crawler code into the new tab window through the console;
52. and executing the crawler codes and extracting the target value information according to the positioning path information.
The crawler code can be a JavaScript code, and the JavaScript code is an object-oriented Web programming language.
In specific implementation, a crawler code can be injected into a new label window through a console, the crawler code is executed in the new label window, and target value information is extracted according to a CSS selector or an Xpath path, so that value information dynamically generated by JavaScript codes can be extracted, and better dynamic information crawling capability is realized.
Optionally, the crawler code is a JavaScript code, and the extracting the target value information according to the positioning path information in step 52 may include the following steps:
analyzing and rendering the target value information through the browser;
and downloading the target value information according to the positioning path information, wherein the target value information comprises AJAX information and value information generated by the JavaScript codes, and the AJAX information is asynchronous JavaScript and extensible markup language XML information.
In the process of executing the JavaScript code, the target value information can be analyzed and rendered through the browser, and then the target value information can be downloaded according to the positioning path information, so that the target value information can be analyzed and rendered through the browser without installing Java and Python operating environments, automatically testing a frame dependence package of an application program, and configuring any distributed system infrastructure server and database.
It can be seen that, in the information crawling method described in the embodiment of the present application, a target uniform resource locator URL network resource is opened in a browser, and a target page corresponding to the target URL network resource is entered; positioning a Document Object Model (DOM) element where target value information is located in a target page to obtain a target DOM element; acquiring positioning path information of a target DOM element; loading the target URL network resource to a new label window; the target value information is extracted from the new label window according to the positioning path information and is uniformly stored, so that Java and Python running environments and application program automatic testing frame dependency packages do not need to be installed, any distributed system infrastructure server and database do not need to be configured, the positioning of the target value information is realized only on the basis of the functions of the browser, and single-machine crawlers based on a pure browser environment can be realized in a short time.
In accordance with the above, referring to fig. 2, fig. 2 is a schematic flow chart of another information crawling method provided in the embodiment of the present application, and the information crawling method described in the embodiment is applied to the electronic device shown in fig. 1A or fig. 1B, and the method may include the following steps:
opening a target Uniform Resource Locator (URL) network resource in a browser, judging whether the target URL network resource needs a login account, and if the target URL network resource needs the login account, acquiring login account information corresponding to the URL network resource; verifying the login account information, and if the login account information is successfully verified, entering a target page corresponding to the target URL network resource; positioning a target DOM element in which the target value information is positioned through a page element examination function of the browser; if the target URL network resource does not need to log in an account, directly positioning a target DOM element in which the target value information is located through a page element examination function of the browser; acquiring positioning path information of the target DOM element; verifying whether the positioning path corresponding to the positioning path information is valid or not through a console of the browser; if yes, loading the target URL network resource to a new label window; if not, adjusting the positioning path information; opening the new tab window through the console; further loading the target URL network resource to a new label window; injecting a crawler code into the new tab window through the console; executing the crawler codes and extracting the target value information according to the positioning path information; and uniformly storing the target value information.
The detailed description of the above steps may refer to the information crawling method shown in fig. 1C, and is not described herein again.
According to the information crawling method described in the embodiment of the application, the target URL network resource is opened in the browser, whether the target URL network resource needs to log in the account is judged, and if the target URL network resource needs to log in the account, the login account information corresponding to the URL network resource is obtained; verifying the login account information, and if the verification is successful, entering a target page corresponding to the target URL network resource; positioning a target DOM element in which the target value information is located through a page element examination function of the browser; if the target URL network resource does not need to log in an account, directly positioning a target DOM element where the target value information is located through a page element examination function of the browser; acquiring positioning path information of a target DOM element; verifying whether the positioning path corresponding to the positioning path information is valid or not through a console of the browser; if yes, loading the target URL network resource to a new label window; if not, adjusting the positioning path information; opening a new label window through the console; further loading the target URL network resource to a new label window; injecting a crawler code into the new label window through the console; executing the crawler codes, and extracting target value information according to the positioning path information; the target value information is stored in a unified mode, so that low-cost, convenient and efficient crawling of lightweight information can be achieved, in addition, normal browsing behaviors are started based on a real browser, and normal user account information is carried, therefore, the existing anti-crawler technology based on login function limiting is difficult to limit, and anti-crawler capacity can be improved.
The following is a device for implementing the information crawling method, and specifically includes:
in accordance with the above, please refer to fig. 3, in which fig. 3 is an electronic device according to an embodiment of the present application, including: a processor and a memory; and one or more programs stored in the memory and configured to be executed by the processor, the programs including instructions for performing the steps of:
opening a target Uniform Resource Locator (URL) network resource in a browser, and entering a target page corresponding to the target URL network resource;
positioning a Document Object Model (DOM) element where target value information is located in the target page to obtain a target DOM element;
acquiring positioning path information of the target DOM element;
loading the target URL network resource to a new label window;
and extracting the target value information in the new label window according to the positioning path information, and uniformly storing the target value information.
It can be seen that, in the electronic device described in the embodiment of the present application, a target uniform resource locator URL network resource is opened in a browser, and a target page corresponding to the target URL network resource is entered; positioning a Document Object Model (DOM) element where target value information is located in a target page to obtain a target DOM element; acquiring positioning path information of a target DOM element; loading the target URL network resource to a new label window; the target value information is extracted from the new label window according to the positioning path information and is uniformly stored, so that Java and Python running environments and application program automatic testing frame dependency packages do not need to be installed, any distributed system infrastructure server and database do not need to be configured, the positioning of the target value information is realized only on the basis of the functions of the browser, and single-machine crawlers based on a pure browser environment can be realized in a short time.
In one possible example, in the aspect of locating a document object model DOM element in which target value information is located in the target page to obtain a target DOM element, the program includes instructions for:
and positioning a target DOM element in which the target value information is positioned through a page element examination function of the browser.
In one possible example, in the obtaining of location path information for the target DOM element, the program includes instructions for:
positioning a DOM node where the target value information is located;
and acquiring positioning path information corresponding to the positioning path of the first node element under the DOM node.
In one possible example, after obtaining the location path information of the target DOM element, the program further includes instructions for:
verifying whether the positioning path corresponding to the positioning path information is valid or not through a console of the browser;
if yes, executing the operation of loading the target URL network resource to a new label window;
if not, adjusting the positioning path information.
In one possible example, in the verifying whether the positioning path corresponding to the positioning path information is valid through the console, the program includes instructions for:
and inputting a positioning path corresponding to the positioning path information in the console, and if the target DOM element can be successfully positioned, determining that the positioning path corresponding to the positioning path information is valid.
In one possible example, in said adjusting the positioning path information, the program comprises instructions for:
acquiring positioning path information corresponding to a positioning path of a second node element under the DOM node, wherein the second node element and the first node element respectively correspond to different child nodes under the DOM node;
and adjusting the positioning path information to the positioning path information corresponding to the positioning path of the second node element.
In one possible example, prior to loading the target URL network resource into the new tab window, the program further includes instructions for:
opening the new tab window through the console;
in the aspect of the extracting the target value information in the new tab window according to the positioning path information, the program includes instructions for performing the steps of:
injecting a crawler code into the new tab window through the console;
and executing the crawler codes and extracting the target value information according to the positioning path information.
In one possible example, the crawler code is JavaScript code, and in the aspect of extracting the target value information according to the positioning path information, the program includes instructions for performing the following steps:
analyzing and rendering the target value information through the browser;
and downloading target value information according to the positioning path information, wherein the target value information comprises AJAX information and value information generated by the JavaScript codes, and the AJAX information is asynchronous JavaScript and extensible markup language (XML) information.
In one possible example, the program further comprises instructions for performing the steps of:
if the target URL network resource needs a login account, acquiring login account information corresponding to the URL network resource;
and verifying the login account information, and if the verification is successful, executing the operation of accessing the target page corresponding to the target URL network resource.
Referring to fig. 4A, fig. 4A is a schematic structural diagram of an information crawling apparatus according to the present embodiment. The information crawling device is applied to the electronic equipment shown in FIG. 1A or FIG. 1B, and comprises: an opening unit 401, a positioning unit 402, an obtaining unit 403, a loading unit 404, an extracting unit 405, and a storing unit 406, wherein,
the starting unit 401 is configured to open a target uniform resource locator URL network resource in a browser, and enter a target page corresponding to the target URL network resource;
the positioning unit 402 is configured to position a document object model DOM element where target value information is located in the target page, to obtain a target DOM element;
the obtaining unit 403 is configured to obtain the location path information of the target DOM element;
the loading unit 404 is configured to load the target URL network resource into a new tag window;
the extracting unit 405 is configured to extract the target value information in the new label window according to the positioning path information;
the storage unit 406 is configured to store the target value information in a unified manner.
The information crawling device described in the embodiment of the application is applied to electronic equipment, and the target Uniform Resource Locator (URL) network resource is opened in the browser to enter the target page corresponding to the target URL network resource; positioning a Document Object Model (DOM) element where target value information is located in a target page to obtain a target DOM element; acquiring positioning path information of a target DOM element; loading the target URL network resource to a new label window; the target value information is extracted from the new label window according to the positioning path information and is uniformly stored, so that Java and Python running environments and application program automatic testing frame dependency packages do not need to be installed, any distributed system infrastructure server and database do not need to be configured, the positioning of the target value information is realized only on the basis of the functions of the browser, and single-machine crawlers based on a pure browser environment can be realized in a short time.
In a possible example, in the aspect of positioning a document object model DOM element where target value information in the target page is located to obtain a target DOM element, the positioning unit 402 is specifically configured to:
and positioning a target DOM element in which the target value information is positioned through a page element examination function of the browser.
In one possible example, in terms of obtaining the location path information of the target DOM element, the obtaining unit 403 is specifically configured to:
positioning a DOM node where the target value information is located;
and acquiring positioning path information corresponding to the positioning path of the first node element under the DOM node.
In one possible example, as shown in fig. 4B, fig. 4B is a modified structure of the information crawling apparatus depicted in fig. 4A, which may further include, compared with fig. 4A: a verification unit 407 and an adjustment unit 408, wherein,
the starting unit 401 is further configured to start a console of a browser;
the verification unit 407 is configured to verify, by the console, whether a positioning path corresponding to the positioning path information is valid;
if yes, the loading unit 404 executes the operation of loading the target URL network resource to a new tag window;
the adjusting unit 408 is configured to adjust the positioning path information if the positioning path corresponding to the positioning path information is invalid.
In one possible example, in the aspect that the verification of whether the positioning path corresponding to the positioning path information is valid is performed by the console, the verification unit 407 is specifically configured to:
and inputting a positioning path corresponding to the positioning path information in the console, and if the target DOM element can be successfully positioned, determining that the positioning path corresponding to the positioning path information is valid.
In a possible example, in terms of the adjusting the positioning path information, the adjusting unit 408 is specifically configured to:
acquiring positioning path information corresponding to a positioning path of a second node element under the DOM node, wherein the second node element and the first node element respectively correspond to different child nodes under the DOM node;
and adjusting the positioning path information to the positioning path information corresponding to the positioning path of the second node element.
In one possible example, the loading unit loads the target URL network resource before the new tab window,
the opening unit 401 is further configured to open the new tab window through the console;
in the aspect of extracting the target value information in the new tag window according to the positioning path information, the extracting unit 405 is specifically configured to:
injecting a crawler code into the new tab window through the console;
and executing the crawler codes and extracting the target value information according to the positioning path information.
In a possible example, the crawler code is a JavaScript code, and in the aspect of extracting the target value information according to the positioning path information, the extracting unit 405 is specifically configured to:
analyzing and rendering the target value information through the browser;
and downloading target value information according to the positioning path information, wherein the target value information comprises AJAX information and value information generated by the JavaScript codes, and the AJAX information is asynchronous JavaScript and extensible markup language (XML) information.
In a possible example, the obtaining unit 403 is further configured to obtain login account information corresponding to the URL network resource if the target URL network resource needs a login account;
and the starting unit is also used for verifying the login account information, and if the verification is successful, executing the operation of entering the target page corresponding to the target URL network resource.
It can be understood that the functions of each program module of the information crawling apparatus in this embodiment may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the relevant description of the foregoing method embodiment, which is not described herein again.
Embodiments of the present application also provide a computer storage medium, where the computer storage medium stores a computer program for electronic data exchange, and the computer program enables a computer to execute part or all of the steps of any one of the information crawling methods described in the above method embodiments.
Embodiments of the present application also provide a computer program product, which includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute some or all of the steps of any one of the information crawling methods described in the above method embodiments.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art will recognize that the embodiments described in this specification are preferred embodiments and that acts or modules referred to are not necessarily required for this application.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.
The integrated unit, if implemented in the form of a software program module and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solutions of the present application, in essence or part of the technical solutions contributing to the prior art, or all or part of the technical solutions, can be embodied in the form of a software product, which is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned memory comprises: various media capable of storing program codes, such as a usb disk, a read-only memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and the like.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash disk, ROM, RAM, magnetic or optical disk, and the like.
The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (20)

  1. An information crawling method, characterized in that the method comprises:
    opening a target Uniform Resource Locator (URL) network resource in a browser, and entering a target page corresponding to the target URL network resource;
    positioning a Document Object Model (DOM) element where the target value information is located in the target page to obtain a target DOM element;
    acquiring positioning path information of the target DOM element;
    loading the target URL network resource to a new label window;
    and extracting the target value information in the new label window according to the positioning path information, and uniformly storing the target value information.
  2. The method according to claim 1, wherein the positioning the Document Object Model (DOM) element where the target value information is located in the target page to obtain the target DOM element comprises:
    and positioning a target DOM element in which the target value information is positioned through a page element examination function of the browser.
  3. The method of claim 2, wherein obtaining the location path information of the target DOM element comprises:
    positioning a DOM node where the target value information is located;
    and acquiring positioning path information corresponding to the positioning path of the first node element under the DOM node.
  4. The method of claim 3, wherein after obtaining the location path information of the target DOM element, the method further comprises:
    verifying whether the positioning path corresponding to the positioning path information is valid or not through a console of the browser;
    if yes, executing the operation of loading the target URL network resource to a new label window;
    if not, adjusting the positioning path information.
  5. The method according to claim 4, wherein the verifying, by the console, whether the positioning path corresponding to the positioning path information is valid comprises:
    and inputting a positioning path corresponding to the positioning path information in the console, and if the target DOM element can be successfully positioned, determining that the positioning path corresponding to the positioning path information is valid.
  6. The method of claim 4, wherein the adjusting the positioning path information comprises:
    acquiring positioning path information corresponding to a positioning path of a second node element under the DOM node, wherein the second node element and the first node element respectively correspond to different child nodes under the DOM node;
    and adjusting the positioning path information to the positioning path information corresponding to the positioning path of the second node element.
  7. The method of any of claims 1-6, wherein prior to loading the target URL network resource into the new tab window, the method further comprises:
    opening the new tab window through the console;
    the extracting the target value information in the new label window according to the positioning path information comprises:
    injecting a crawler code into the new tab window through the console;
    and executing the crawler codes and extracting the target value information according to the positioning path information.
  8. The method of claim 7, wherein the crawler code is JavaScript code, and wherein extracting the target value information according to the positioning path information comprises:
    analyzing and rendering the target value information through the browser;
    and downloading target value information according to the positioning path information, wherein the target value information comprises AJAX information and value information generated by the JavaScript codes, and the AJAX information is asynchronous JavaScript and extensible markup language (XML) information.
  9. The method of claim 1, further comprising:
    if the target URL network resource needs a login account, acquiring login account information corresponding to the URL network resource;
    and verifying the login account information, and if the verification is successful, executing the operation of accessing the target page corresponding to the target URL network resource.
  10. An information crawling apparatus, characterized in that the apparatus comprises: the device comprises a starting unit, a positioning unit, an acquisition unit, a loading unit, an extraction unit and a storage unit,
    the starting unit is used for opening a target Uniform Resource Locator (URL) network resource in a browser and entering a target page corresponding to the target URL network resource;
    the positioning unit is used for positioning a Document Object Model (DOM) element where the target value information is located in the target page to obtain a target DOM element;
    the acquisition unit is used for acquiring the positioning path information of the target DOM element;
    the loading unit is used for loading the target URL network resource to a new label window;
    the extracting unit is used for extracting the target value information in the new label window according to the positioning path information;
    and the storage unit is used for uniformly storing the target value information.
  11. The apparatus according to claim 10, wherein in said positioning a document object model DOM element in which the target value information is located in the target page to obtain a target DOM element, said positioning unit is specifically configured to:
    and positioning a target DOM element in which the target value information is positioned through a page element examination function of the browser.
  12. The apparatus according to claim 11, wherein, in obtaining the location path information of the target DOM element, the obtaining unit is specifically configured to:
    positioning a DOM node where the target value information is located;
    and acquiring positioning path information corresponding to the positioning path of the first node element under the DOM node.
  13. The apparatus according to any one of claims 10-12, further comprising a verification unit and an adjustment unit, wherein,
    the opening unit is also used for opening a console of the browser;
    the verification unit is used for verifying whether the positioning path corresponding to the positioning path information is valid through the console;
    if yes, the loading unit executes the operation of loading the target URL network resource to a new label window;
    and the adjusting unit is used for adjusting the positioning path information if the positioning path corresponding to the positioning path information is invalid.
  14. The apparatus according to claim 13, wherein in the aspect of verifying whether the positioning path corresponding to the positioning path information is valid through the console, the verifying unit is specifically configured to:
    and inputting a positioning path corresponding to the positioning path information in the console, and if the target DOM element can be successfully positioned, determining that the positioning path corresponding to the positioning path information is effective.
  15. The apparatus according to any of claims 10-14, wherein, in said adjusting the positioning path information, the adjusting unit is specifically configured to:
    acquiring positioning path information corresponding to a positioning path of a second node element under the DOM node, wherein the second node element and the first node element respectively correspond to different child nodes under the DOM node;
    and adjusting the positioning path information to the positioning path information corresponding to the positioning path of the second node element.
  16. The apparatus according to any one of claims 10-15, wherein the loading unit loads the target URL network resource before the new tab window,
    the opening unit is also used for opening the new label window through the console;
    in the aspect of extracting the target value information in the new tab window according to the positioning path information, the extracting unit is specifically configured to:
    injecting a crawler code into the new tab window through the console;
    and executing the crawler codes and extracting the target value information according to the positioning path information.
  17. An electronic device, comprising a browser, a console of the browser, a network resource loader, a network resource parser, and a memory, wherein,
    the browser is used for opening a target Uniform Resource Locator (URL) network resource and entering a target page corresponding to the target URL network resource;
    the console is used for opening a new label window and loading the URL network resource in the new label window;
    the network resource resolver is used for positioning a Document Object Model (DOM) element where the target value information is located in the target page to obtain a target DOM element; acquiring positioning path information of the target DOM element;
    the network resource loader is used for loading the target value information through URL network resources;
    the network resource analyzer is further used for extracting the target value information in the new label window according to the positioning path information;
    and the memory is used for uniformly storing the target value information.
  18. An electronic device comprising a processor, a memory, a communication interface, and one or more programs stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps in the method of any of claims 1-9.
  19. A computer-readable storage medium, characterized in that a computer program for electronic data exchange is stored, wherein the computer program causes a computer to perform the method according to any one of claims 1-9.
  20. A computer program product, characterized in that the computer program product comprises a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform the method according to any one of claims 1-9.
CN202080096823.2A 2020-05-14 2020-05-14 Information crawling method and device, electronic equipment and storage medium Pending CN115087969A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/090329 WO2021226954A1 (en) 2020-05-14 2020-05-14 Information crawling method and apparatus, and electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN115087969A true CN115087969A (en) 2022-09-20

Family

ID=78526260

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202080096823.2A Pending CN115087969A (en) 2020-05-14 2020-05-14 Information crawling method and device, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN115087969A (en)
WO (1) WO2021226954A1 (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8452850B2 (en) * 2000-12-14 2013-05-28 International Business Machines Corporation Method, apparatus and computer program product to crawl a web site
CN102567530B (en) * 2011-12-31 2014-06-11 凤凰在线(北京)信息技术有限公司 Intelligent extraction system and intelligent extraction method for article type web pages
CN105354337A (en) * 2015-12-08 2016-02-24 北京奇虎科技有限公司 Web crawler implementation method and web crawler system
CN106484775A (en) * 2016-09-12 2017-03-08 北京量科邦信息技术有限公司 A kind of crawler capturing method and system based on selenium
CN107729385A (en) * 2017-09-19 2018-02-23 杭州安恒信息技术有限公司 A kind of method for gathering dynamic web page partial data content

Also Published As

Publication number Publication date
WO2021226954A1 (en) 2021-11-18

Similar Documents

Publication Publication Date Title
EP2374078B1 (en) Method for server-side logging of client browser state through markup language
CN107908952B (en) Method and device for identifying real machine and simulator and terminal
CN109040182B (en) Service access method and device, electronic equipment and storage medium
CN108536594B (en) Page testing method and device and storage equipment
CN107766358B (en) Page sharing method and related device
CN111078556B (en) Application testing method and device
CN108415804A (en) Obtain method, terminal device and the computer readable storage medium of information
CN108494762A (en) Web access method, device and computer readable storage medium, terminal
CN106021112A (en) Program testing system, method and device
CN106294159A (en) A kind of method controlling screenshotss and screenshotss control device
CN104965831B (en) A kind of network address error correction method, server, terminal and system
CN108763297B (en) Webpage resource processing method and device and mobile terminal
CN111563257A (en) Data detection method and device, computer readable medium and terminal equipment
CN110674444B (en) Method and terminal for downloading dynamic webpage
CN105740419A (en) Method and apparatus for acquiring dynamically loaded content in webpage
CN110198324B (en) Data monitoring method and device, browser and terminal
CN112307386A (en) Information monitoring method, system, electronic device and computer readable storage medium
CN111177612B (en) Page login authentication method and related device
CN110838929B (en) System error checking method and system error checking device
CN115087969A (en) Information crawling method and device, electronic equipment and storage medium
CN109145182B (en) Data acquisition method and device, computer equipment and system
CN105474576A (en) Method for processing http message and electronic device implementing the same
CN110445746A (en) Cookie acquisition methods, device and storage equipment
CN108874462B (en) Browser behavior acquisition method and device, storage medium and electronic equipment
CN114490307A (en) Unit testing method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination