WO2021226954A1 - 信息爬取方法、装置、电子设备及存储介质 - Google Patents
信息爬取方法、装置、电子设备及存储介质 Download PDFInfo
- Publication number
- WO2021226954A1 WO2021226954A1 PCT/CN2020/090329 CN2020090329W WO2021226954A1 WO 2021226954 A1 WO2021226954 A1 WO 2021226954A1 CN 2020090329 W CN2020090329 W CN 2020090329W WO 2021226954 A1 WO2021226954 A1 WO 2021226954A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- target
- information
- network resource
- positioning path
- target value
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 56
- 230000009193 crawling Effects 0.000 title claims abstract description 51
- 238000003860 storage Methods 0.000 title claims abstract description 26
- 230000006870 function Effects 0.000 claims description 27
- 238000004891 communication Methods 0.000 claims description 19
- 238000004590 computer program Methods 0.000 claims description 19
- 238000012795 verification Methods 0.000 claims description 14
- 238000012552 review Methods 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 5
- 238000009877 rendering Methods 0.000 claims description 5
- 238000011161 development Methods 0.000 abstract description 15
- 238000012423 maintenance Methods 0.000 abstract description 9
- 238000012360 testing method Methods 0.000 description 13
- 238000012545 processing Methods 0.000 description 9
- 239000000284 extract Substances 0.000 description 8
- 238000007726 management method Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 6
- 230000009471 action Effects 0.000 description 3
- 230000006399 behavior Effects 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 230000035515 penetration Effects 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 235000014510 cooky Nutrition 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- AMGQUBHHOARCQH-UHFFFAOYSA-N indium;oxotin Chemical compound [In].[Sn]=O AMGQUBHHOARCQH-UHFFFAOYSA-N 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Definitions
- This application relates to the field of computers, and in particular to an information crawling method, device, electronic equipment, and storage medium.
- Crawler technology is a process that automatically analyzes, collects, analyzes, and stores a large amount of valuable information in the network.
- Existing crawler systems are mainly divided into stand-alone and distributed systems in terms of system architecture. These crawler systems are mainly based on popular Python and Java crawler frameworks (such as Scrapy framework, Nutch framework) to realize the analysis and crawling of target value information.
- the interface of the existing crawler framework is complex and too heavy.
- the main defects are as follows: First, the development cycle is long and the maintenance cost is high. For example, based on the existing stand-alone and distributed When implementing a crawling task with an integrated crawler framework, it is necessary not only to consider how to implement Python and Java code, but also to consider the configuration and management of the server and the corresponding database. Therefore, for the needs of temporary crawlers, the development cycle of the existing crawler framework is too long, and the learning and maintenance costs are too high. Second, it is difficult to crawl asynchronous JavaScript and extensible markup language (asynchronous javascript and extensible markup language, AJAX) information and value information dynamically generated by JavaScript code.
- AJAX asynchronous JavaScript and extensible markup language
- the embodiments of the present application provide an information crawling method and related products, which can realize low-cost, convenient and efficient lightweight information crawling.
- an information crawling method includes:
- the target value information is extracted in the new tab window, and the target value information is uniformly stored.
- an embodiment of the present application provides an information crawling device, the device includes: an opening unit, a positioning unit, an acquisition unit, a loading unit, an extraction unit, and a storage unit, wherein:
- the opening unit is used to open the target uniform resource locator URL network resource in the browser, and enter the target page corresponding to the target URL network resource;
- the positioning unit is used to locate the DOM element of the document object model in the target page where the target value information is located, to obtain the target DOM element;
- a loading unit configured to load the target URL network resource to the new tab window according to the location path information
- An extracting unit configured to extract the target value information in the new tab window
- the storage unit is used to uniformly store the target value information.
- an embodiment of the present application provides an electronic device, including a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and are configured by the above Executed by the processor, and the foregoing program includes instructions for executing the steps in the first aspect of the embodiments of the present application.
- an embodiment of the present application provides a computer-readable storage medium, wherein the above-mentioned computer-readable storage medium stores a computer program for electronic data exchange, wherein the above-mentioned computer program enables a computer to execute Some or all of the steps described in one aspect.
- the embodiments of the present application provide a computer program product, wherein the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute as implemented in this application.
- the computer program product may be a software installation package.
- FIG. 1A is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
- FIG. 1B is a schematic structural diagram of another electronic device provided by an embodiment of the present application.
- FIG. 1C is a schematic flowchart of an information crawling method disclosed in an embodiment of the present application.
- FIG. 2 is a schematic flowchart of another information crawling method disclosed in an embodiment of the present application.
- Fig. 3 is a schematic structural diagram of another electronic device disclosed in an embodiment of the present application.
- FIG. 4A is a schematic structural diagram of an information crawling device disclosed in an embodiment of the present application.
- FIG. 4B is a modified structure of the information crawling device described in FIG. 4A disclosed in an embodiment of the present application.
- the electronic devices involved in the embodiments of this application may include various handheld devices with wireless communication functions, vehicle-mounted devices, wearable devices (smart watches, wireless headsets), computing devices or other processing devices connected to wireless modems, and various Various forms of user equipment (user equipment, UE), mobile station (mobile station, MS), terminal equipment (terminal device), and so on.
- user equipment user equipment
- MS mobile station
- terminal device terminal device
- the electronic device can also be a server.
- Python an object-oriented, cross-platform computer programming language.
- Java an object-oriented, cross-platform computer programming language.
- JavaScript an object-oriented Web programming language.
- HTTP header field used to identify the browser, browser operating system, encryption level, and browser rendering engine.
- Cookie a hypertext transport protocol (HTTP) header field used to identify legitimate users.
- HTTP hypertext transport protocol
- FIG. 1A is a schematic structural diagram of an electronic device disclosed in an embodiment of the present application.
- the electronic device 100 may include a control circuit, and the control circuit may include a storage and processing circuit 110.
- the storage and processing circuit 110 can be a memory, such as a hard disk drive memory, a non-volatile memory (such as flash memory or other electronic programmable read-only memory used to form a solid-state drive, etc.), and a volatile memory (such as a static or dynamic random access memory). Access to memory, etc.), etc., are not limited in the embodiment of the present application.
- the processing circuit in the storage and processing circuit 110 may be used to control the operation of the electronic device 100.
- the processing circuit can be implemented based on one or more microprocessors, microcontrollers, baseband processors, power management units, audio codec chips, application specific integrated circuits, display driver integrated circuits, and so on.
- the storage and processing circuit 110 can be used to run software in the electronic device 100, such as Internet browsing applications, voice over internet protocol (VOIP) phone call applications, email applications, media playback applications, and operating system functions Wait. These softwares can be used to perform some control operations, for example, camera-based image capture, ambient light measurement based on ambient light sensors, proximity sensor measurement based on proximity sensors, and information based on status indicators such as the status indicators of light-emitting diodes.
- software in the electronic device 100 such as Internet browsing applications, voice over internet protocol (VOIP) phone call applications, email applications, media playback applications, and operating system functions Wait.
- VOIP voice over internet protocol
- These softwares can be used to perform some control operations, for example, camera-based image capture, ambient light measurement based on ambient light sensors, proximity sensor measurement based on proximity sensors, and information based on status indicators such as the status indicators of light-emitting diodes.
- Display functions touch event detection based on touch sensors, functions associated with displaying information on multiple (for example, layered) displays, operations associated with performing wireless communication functions, operations associated with collecting and generating audio signals ,
- the control operations associated with the collection and processing of button press event data, as well as other functions in the electronic device 100, are not limited in the embodiment of the present application.
- the electronic device 100 may further include an input-output circuit 150.
- the input-output circuit 150 can be used to enable the electronic device 100 to input and output data, that is, allow the electronic device 100 to receive data from an external device and also allow the electronic device 100 to output data from the electronic device 100 to the external device.
- the input-output circuit 150 may further include a sensor 170.
- the sensor 170 may include an ambient light sensor, a proximity sensor based on light and capacitance, and a touch sensor (for example, a light-based touch sensor and/or a capacitive touch sensor.
- the touch sensor structure is used independently), acceleration sensor, gravity sensor, and other sensors.
- the input-output circuit 150 may also include one or more displays, such as the display 130.
- the display 130 may include one or a combination of a liquid crystal display, an organic light emitting diode display, an electronic ink display, a plasma display, and a display using other display technologies.
- the display 130 may include a touch sensor array (ie, the display 130 may be a touch display screen).
- the touch sensor can be a capacitive touch sensor formed by an array of transparent touch sensor electrodes (such as indium tin oxide (ITO) electrodes), or it can be a touch sensor formed using other touch technologies, such as sonic touch, pressure-sensitive touch, and resistance. Touch, optical touch, etc., are not limited in the embodiment of the present application.
- the audio component 140 may be used to provide audio input and output functions for the electronic device 100.
- the audio component 140 in the electronic device 100 may include a speaker, a microphone, a buzzer, a tone generator, and other components for generating and detecting sounds.
- the communication circuit 120 may be used to provide the electronic device 100 with the ability to communicate with external devices.
- the communication circuit 120 may include analog and digital input-output interface circuits, and wireless communication circuits based on radio frequency signals and/or optical signals.
- the wireless communication circuit in the communication circuit 120 may include a radio frequency transceiver circuit, a power amplifier circuit, a low noise amplifier, a switch, a filter, and an antenna.
- the wireless communication circuit in the communication circuit 120 may include a circuit for supporting near field communication (NFC) by transmitting and receiving near-field coupled electromagnetic signals.
- the communication circuit 120 may include a near field communication antenna and a near field communication transceiver.
- the communication circuit 120 may also include a cellular phone transceiver and antenna, a wireless local area network transceiver circuit and antenna, and so on.
- the electronic device 100 may further include a battery, a power management circuit, and other input-output units 160.
- the input-output unit 160 may include buttons, joysticks, click wheels, scroll wheels, touch pads, keypads, keyboards, cameras, light emitting diodes, and other status indicators.
- the user can input commands through the input-output circuit 150 to control the operation of the electronic device 100, and can use the output data of the input-output circuit 150 to realize receiving status information and other outputs from the electronic device 100.
- crawler frameworks can include stand-alone crawler frameworks and distributed crawler frameworks.
- Scrapy is a stand-alone crawler framework based on Python language, which is mainly composed of Scrapy engine, task scheduler, downloader, crawler, and pipeline. Module composition.
- the Scrapy engine is responsible for sending crawl commands to each module, as well as coordinating the communication and data transfer between the modules.
- the task scheduler performs unified scheduling and queue management on the uniform resource locator (URL) network resources sent by the Scrapy engine.
- the downloader is responsible for sending URL requests to URL network resources and obtaining URL responses.
- the crawler parses the response content and extracts the required value information, and transmits it to the pipeline for unified analysis, filtering and storage.
- Nutch is a distributed search engine and crawler framework based on Java language. It mainly relies on distributed infrastructure to realize distributed crawling and data storage of massive amounts of information. It is mainly composed of generator, task scheduler, downloader, parser, and memory module.
- the generator mainly queries the target value information from the database, and the task scheduler dynamically sends search tasks to the distributed system infrastructure cluster to complete the search and indexing of the target value information.
- the downloader and the parser are responsible for establishing the URL network request and extracting the information fields in the URL network response. Finally, the memory completes the centralized storage of the target value information.
- the above-mentioned existing crawler frameworks have good crawling capabilities for mass information crawling tasks.
- the existing crawler frameworks have long development cycles and high maintenance costs, making it difficult to The value information dynamically generated by crawling AJAX information and JavaScript code can also be easily restricted by the anti-crawler mechanism.
- Figure 1B provides a schematic structural diagram of another electronic device.
- the fetching frame may include a browser 100, a browser console 110, a network resource loader 120, a network resource parser 130, and a storage 140. Among them,
- the browser 100 is configured to open a target uniform resource locator URL network resource, and enter a target page corresponding to the target URL network resource;
- the console 110 is used to open a new tab window, and load the URL network resource in the new tab window;
- the network resource parser 130 is configured to locate a document object model (DOM) element in the target page where the target value information is located, to obtain a target DOM element; to obtain location path information of the target DOM element;
- DOM document object model
- the network resource loader 120 is configured to load the target value information through a URL network resource
- the network resource parser 130 is further configured to extract the target value information in the new tab window according to the positioning path information
- the memory 140 is used to uniformly store the target value information.
- the above information crawling framework does not need to install Java, Python operating environment and application automatic test framework dependency packages, and does not need to configure any distributed system infrastructure servers and databases. It only needs to realize the positioning of the target value information based on the browser's own functions.
- a single-machine crawler based on a pure browser environment can be realized in a relatively short period of time.
- the development cycle is short and the operation is simple, which effectively reduces the development threshold, configuration management and maintenance costs.
- this solution has the advantages of cross-platform, and may be competent for penetration testing, security testing and other temporary crawling and targeted crawling project requirements of different platforms. It has a good cross-platform Platform.
- this solution is based on the real browser to start normal browsing behavior, and has strong anti-crawling ability.
- FIG. 1C is a schematic flowchart of an information crawling method provided by an embodiment of the present application.
- the information crawling method described in this embodiment is applied to the electronic device shown in FIG. 1A or FIG. 1B ,
- the information crawling methods include:
- the target URL network resource can be opened in the browser, and the target uniform resource locator URL is used to identify the location and access method of the network resource.
- the target page is a browser page corresponding to the target URL network resource, and the target URL network resource can be opened through the browser to enter the target page.
- step 101 when opening the target uniform resource locator URL network resource in the browser, the following steps may also be included:
- the target URL network resource requires a login account, obtain the login account information corresponding to the URL network resource;
- the login account information is verified, and if the verification is successful, the operation of entering the target page corresponding to the target URL network resource is performed.
- the login account information corresponding to the URL network resource can be obtained.
- the login can be entered by the user
- the account information is used to obtain the login account information.
- the electronic device may receive the user name, password, and verification code input by the user through the browser.
- the login account information can be recorded and saved, so that when subsequent information crawling is performed, the saved login account information can be directly called for account login, and the user does not need to repeatedly input the login account information.
- the embodiment of the present application initiates normal browsing behaviors based on real browsers and carries normal user account information. Therefore, the existing anti-crawler technologies based on login function restrictions are difficult to restrict, thereby improving anti-crawling capabilities.
- the DOM element of the document object model in the target page where the target value information is located can be located through the browser to obtain the target DOM element. In this way, the positioning of the target value information can be realized only through the browser, and there is no need to install the application automatic test framework to simulate the actual browsing of the web to locate the target value information, which can save costs and is simple to operate.
- locating the document object model DOM element where the target value information in the target page is located to obtain the target DOM element may include the following steps:
- the browser has a page element review function
- the electronic device can locate the target DOM element where the target value information is located based on the page element review function of the browser, so that accurate target DOM element positioning results can be obtained.
- the location path information may include cascading style sheets (CSS) selectors or extensible markup language path (Xpath) paths.
- CSS cascading style sheets
- Xpath extensible markup language path
- the electronic device can locate the DOM node where the target information is located and obtain the CSS selector or Xpath path of the node element, and obtain the positioning path information through the browser, and the positioning path corresponding to the target value information can be located in a relatively short time. Information to improve the efficiency of information crawling.
- obtaining the location path information of the target DOM element may include the following steps:
- the DOM node means that each component in the XML document is a node, the entire document is a document node, and each XML tag is an element node.
- the electronic device can first locate the DOM node where the target value information is located, and then obtain the first node element under the DOM node to obtain the CSS selector or Xpath path that locates the first node element. In this way, an accurate positioning path can be obtained. information.
- step 103 the following steps may be further included:
- the console of the browser can be opened, and then verify whether the CSS selector or Xpath path is valid, if the CSS selector or Xpath path is valid, continue to execute loading the target URL network resource into the new tab window , And then extract the target value information. If the CSS selector or Xpath path is invalid, the CSS selector or Xpath path can be adjusted.
- verifying whether the positioning path corresponding to the positioning path information is valid through the console may include the following steps:
- Input the positioning path corresponding to the positioning path information in the console, and if the target DOM element can be successfully located, it is determined that the positioning path corresponding to the positioning path information is valid.
- adjusting the positioning path information may include the following steps:
- the electronic device can obtain the CSS selector or Xpath path of the second node element under the DOM node to obtain the adjusted CSS selector or Xpath path, and can also input the adjusted CSS selector or Xpath path into the console to confirm the adjustment Whether the following CSS selector or Xpath path is valid. In this way, by adjusting the positioning path information, it can be ensured that the positioning path information corresponding to the target value information is located.
- the electronic device can open a new tab window through the browser console, and then load the target URL network resource into the new tab window, thereby extracting the target value information in the new tab window.
- the target value information can include AJAX information and value information generated by JavaScript code.
- the electronic device can extract the target value information in the new tab window according to the positioning path information, and then store the target value information in the memory. middle.
- the crawling of AJAX information and value information generated by JavaScript code can be achieved only through the browser, without installing Java, Python operating environment and application automatic test framework dependency packages, without configuring any distributed system infrastructure servers and databases, Only need to realize the positioning of the target value information based on the browser's own function, the crawling of the target value information can be realized in a relatively short time, the development cycle is short, the operation is simple, and the development threshold and maintenance cost are reduced.
- this solution has the advantage of cross-platform, and may be competent for penetration testing, security testing and other temporary crawling and targeted crawling project requirements of different platforms. Therefore, This scheme has good cross-platform.
- extracting the target value information in the new tab window according to the positioning path information may include the following steps:
- the crawler code may be JavaScript code
- JavaScript is an object-oriented Web programming language.
- the crawler code can be injected into the new tab window through the console, the crawler code can be executed in the new tab window, and the target value information can be extracted according to the CSS selector or Xpath path. In this way, the value information dynamically generated by the JavaScript code can be extracted. Achieve better dynamic information crawling capabilities.
- step 52 extracting the target value information according to the positioning path information may include the following steps:
- the target value information is downloaded according to the positioning path information, the target value information includes AJAX information and value information generated by the JavaScript code, and the AJAX information is asynchronous JavaScript and extensible markup language XML information.
- the target value information in the process of executing the JavaScript code, can be parsed and rendered through the browser, and then the target value information can be downloaded according to the positioning path information, so that there is no need to install Java, Python operating environment, and automatic application testing.
- the framework depends on the package, without configuring any distributed system infrastructure server and database, the browser can analyze and render the target value information.
- the information crawling method described in the embodiment of this application opens the target Uniform Resource Locator URL network resource in the browser to enter the target page corresponding to the target URL network resource; locate the target value information in the target page
- the document object model DOM element of the document object model is used to obtain the target DOM element; the location path information of the target DOM element is obtained; the target URL network resource is loaded into the new tab window; the target value information is extracted in the new tab window according to the location path information, and the target value information
- Java, Python runtime environment and application automatic test framework dependency packages no need to configure any distributed system infrastructure server and database, only need to realize the positioning of the target value information based on the browser's own functions, you can A single-machine crawler based on a pure browser environment can be realized in a short time.
- the development cycle is short and the operation is simple, which effectively reduces the development threshold, configuration management and maintenance costs, thereby achieving low cost, convenience and efficiency Lightweight information crawling.
- FIG. 2 is a schematic flowchart of another information crawling method provided by an embodiment of the present application.
- the information crawling method described in this embodiment is applied to FIG. 1A or FIG. 1B.
- the method may include the following steps:
- the information crawling method described in the embodiment of this application opens the target URL network resource in a browser to determine whether the target URL network resource requires a login account, and if the target URL network resource requires a login account, obtain the URL network resource Corresponding login account information; verify the login account information, if the verification is successful, enter the target page corresponding to the target URL network resource; locate the target DOM element where the target value information is located through the browser's page element review function; if the target URL network resource No need to log in to your account, you can directly locate the target DOM element where the target value information is located through the browser's page element review function; obtain the location path information of the target DOM element; verify whether the location path corresponding to the location path information is valid through the browser console ; If yes, load the target URL network resource into the new tab window; if not, adjust the location path information; open the new tab window through the console; then load the target URL network resource into the new tab window; use the console to open the new tab window Inject crawler code; execute crawler
- the following is a device for implementing the above information crawling method, which is specifically as follows:
- FIG. 3 is an electronic device provided by an embodiment of the present application, including: a processor and a memory; and one or more programs, the one or more programs are stored in the In the memory and configured to be executed by the processor, the program includes instructions for executing the following steps:
- the target value information is extracted in the new tab window according to the positioning path information, and the target value information is uniformly stored.
- the electronic device described in the embodiment of this application by opening the target Uniform Resource Locator URL network resource in the browser, enters the target page corresponding to the target URL network resource; locates the document in the target page where the target value information is located Object model DOM element, get the target DOM element; get the location path information of the target DOM element; load the target URL network resource into the new tab window; extract the target value information in the new tab window according to the location path information, and unify the target value information
- Java, Python runtime environment and application automatic test framework dependency packages no need to configure any distributed system infrastructure server and database, only need to realize the positioning of the target value information based on the browser's own functions, which can be compared
- a single-machine crawler based on a pure browser environment can be realized in a short time.
- the development cycle is short and the operation is simple, which effectively reduces the development threshold, configuration management and maintenance costs, so as to achieve low-cost, convenient and efficient light Crawling of magnitude
- the program includes instructions for executing the following steps:
- the program includes instructions for executing the following steps:
- the program further includes instructions for executing the following steps:
- the program includes instructions for executing the following steps:
- Input the positioning path corresponding to the positioning path information in the console, and if the target DOM element can be successfully located, it is determined that the positioning path corresponding to the positioning path information is valid.
- the program includes instructions for executing the following steps:
- the program before loading the target URL network resource into the new tab window, the program further includes instructions for executing the following steps:
- the program includes instructions for executing the following steps:
- the crawler code is executed, and the target value information is extracted according to the positioning path information.
- the crawler code is JavaScript code
- the program includes instructions for executing the following steps:
- the target value information is downloaded according to the positioning path information, the target value information includes AJAX information and value information generated by the JavaScript code, and the AJAX information is asynchronous JavaScript and extensible markup language XML information.
- the program further includes instructions for executing the following steps:
- the target URL network resource requires a login account, obtain the login account information corresponding to the URL network resource;
- the login account information is verified, and if the verification is successful, the operation of entering the target page corresponding to the target URL network resource is performed.
- FIG. 4A is a schematic structural diagram of an information crawling device provided in this embodiment.
- the information crawling device is applied to the electronic equipment shown in FIG. 1A or FIG.
- the opening unit 401 is configured to open a target uniform resource locator URL network resource in a browser, and enter the target page corresponding to the target URL network resource;
- the positioning unit 402 is configured to locate the DOM element of the document object model in the target page where the target value information is located, to obtain the target DOM element;
- the acquiring unit 403 is configured to acquire the location path information of the target DOM element
- the loading unit 404 is configured to load the target URL network resource to the new tab window
- the extracting unit 405 is configured to extract the target value information in the new tab window according to the positioning path information
- the storage unit 406 is configured to uniformly store the target value information.
- the information crawling device described in the embodiment of this application is applied to electronic equipment.
- the target Uniform Resource Locator URL network resource By opening the target Uniform Resource Locator URL network resource in the browser, enter the target page corresponding to the target URL network resource; locate the target page In the document object model DOM element where the target value information is located, the target DOM element is obtained; the location path information of the target DOM element is obtained; the target URL network resource is loaded into the new tab window; the target value information is extracted in the new tab window according to the location path information, It also stores the target value information uniformly.
- Java, Python operating environment and application automatic test framework dependency packages no need to configure any distributed system infrastructure servers and databases, and only need to realize the target value based on the browser's own functions.
- Information positioning can realize a stand-alone crawler based on a pure browser environment in a relatively short period of time.
- the development cycle is short and the operation is simple, which effectively reduces the development threshold, configuration management and maintenance costs, which can be realized Low-cost, convenient and efficient lightweight information crawling.
- the positioning unit 402 is specifically configured to:
- the acquiring unit 403 is specifically configured to:
- FIG. 4B is a modified structure of the information crawling device described in FIG. 4A. Compared with FIG. 4A, it may further include a verification unit 407 and an adjustment unit 408, wherein,
- the opening unit 401 is also used to open the console of the browser;
- the verification unit 407 is configured to verify through the console whether the positioning path corresponding to the positioning path information is valid;
- the loading unit 404 executes the operation of loading the target URL network resource into the new tab window
- the adjusting unit 408 is configured to adjust the positioning path information if the positioning path corresponding to the positioning path information is invalid.
- the verification unit 407 is specifically configured to:
- Input the positioning path corresponding to the positioning path information in the console, and if the target DOM element can be successfully located, it is determined that the positioning path corresponding to the positioning path information is valid.
- the adjusting unit 408 is specifically configured to:
- the loading unit loads the target URL network resource before the new tab window
- the opening unit 401 is further configured to open the new tab window through the console;
- the extracting unit 405 is specifically configured to:
- the crawler code is executed, and the target value information is extracted according to the positioning path information.
- the crawler code is JavaScript code
- the extracting unit 405 is specifically configured to:
- the target value information is downloaded according to the positioning path information, the target value information includes AJAX information and value information generated by the JavaScript code, and the AJAX information is asynchronous JavaScript and extensible markup language XML information.
- the obtaining unit 403 is further configured to obtain login account information corresponding to the URL network resource if the target URL network resource requires a login account;
- the opening unit is further configured to verify the login account information, and if the verification is successful, execute the operation of entering the target page corresponding to the target URL network resource.
- each program module of the information crawling device of this embodiment can be implemented according to the method in the above method embodiment.
- the functions of each program module of the information crawling device of this embodiment can be implemented according to the method in the above method embodiment.
- An embodiment of the present application also provides a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, and the computer program enables a computer to execute any information crawling method described in the above method embodiments. Part or all of the steps.
- the embodiments of the present application also provide a computer program product.
- the computer program product includes a non-transitory computer-readable storage medium storing a computer program.
- the computer program is operable to cause a computer to execute the method described in the foregoing method embodiment. Part or all of the steps of any kind of information crawling method.
- the disclosed device may be implemented in other ways.
- the device embodiments described above are merely illustrative.
- the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented.
- the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical or other forms.
- the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
- the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
- the above-mentioned integrated unit can be implemented in the form of hardware or in the form of software program modules.
- the integrated unit is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer readable memory.
- the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory.
- a number of instructions are included to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the various embodiments of the present application.
- the aforementioned memory includes: U disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), mobile hard disk, magnetic disk, or optical disk and other media that can store program codes.
- the program can be stored in a computer-readable memory, and the memory can include: a flash disk , ROM, RAM, magnetic disk or CD, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
一种信息爬取方法、装置、电子设备及存储介质,该方法包括:通过在浏览器中打开目标统一资源定位符URL网络资源,进入目标URL网络资源对应的目标页面(101);定位目标页面中目标价值信息所在的文档对象模型DOM元素,得到目标DOM元素(102);获取目标DOM元素的定位路径信息(103);将目标URL网络资源加载到新标签窗口(104);根据定位路径信息在新标签窗口提取目标价值信息,并对目标价值信息进行统一存储(105),如此,可以在较短时间内实现基于纯浏览器环境的单机爬虫,对于开发人员来说,开发周期短,操作简单,从而有效降低了开发门槛、配置管理和维护成本,从而可实现低成本、便捷高效的轻量级信息爬取。
Description
本申请涉及计算机领域,具体涉及一种信息爬取方法、装置、电子设备及存储介质。
爬虫技术是一种自动对网络中大量价值信息进行分析、收集、解析和存储的过程。现有的爬虫系统,从系统架构上主要分为单机和分布式两类。这些爬虫系统主要基于流行的Python和Java爬虫框架(如Scrapy框架、Nutch框架),来实现对目标价值信息的分析和爬取。
现有的爬虫框架的接口复杂、过于繁重,对于小规模或临时性的爬虫任务来说,主要存在以下缺陷:第一,开发周期长,维护成本高,例如,在基于现有的单机和分布式爬虫框架实现爬取任务时,不仅要考虑Python、Java代码如何实现,还要考虑服务器以及相应数据库的配置及管理。因此对于临时性爬虫的需求,现有的爬虫框架开发周期过长,学习代价和维护成本过高。第二,难以爬取异步JavaScript和可扩展标记语言(asynchronous javascript and extensible markup language,AJAX)信息和JavaScript代码动态生成的价值信息,对于采用AJAX异步加载和JavaScript代码动态生成的价值信息,现有爬虫框架难以定位到目标价值信息,需要结合应用程序自动测试框架来模拟真实浏览网页的过程,实现对目标价值信息的提取。因此,需要安装浏览器测试框架和相应的浏览器驱动,增加了额外代价和开销。第三,容易被反爬虫机制及登陆验证限制,例如,现有爬虫框架的浏览器标识过于简单,很容易被反爬虫机制检测出来。
发明内容
本申请实施例提供了一种信息爬取方法及相关产品,能够实现低成本、便捷高效的轻量级信息爬取。
第一方面,本申请实施例一种信息爬取方法,包括:
在浏览器中打开目标统一资源定位符URL网络资源,进入所述目标URL网络资源对应的目标页面;
定位所述目标页面中目标价值信息所在的文档对象模型DOM元素,得到目标DOM元素;
获取所述目标DOM元素的定位路径信息;
根据所述定位路径信息将所述目标URL网络资源加载到新标签窗口;
在所述新标签窗口提取所述目标价值信息,并对所述目标价值信息进行统一存储。
第二方面,本申请实施例提供了一种信息爬取装置,所述装置包括:开启单元、定位单元、获取单元、加载单元、提取单元和存储单元,其中,
开启单元,用于在浏览器中打开目标统一资源定位符URL网络资源,进入所述目标URL网络资源对应的目标页面;
定位单元,用于定位所述目标页面中目标价值信息所在的文档对象模型DOM元素,得到目标DOM元素;
获取单元,用于获取所述目标DOM元素的定位路径信息;
加载单元,用于根据所述定位路径信息将所述目标URL网络资源加载到新标签窗口;
提取单元,用于在所述新标签窗口提取所述目标价值信息;
存储单元,用于对所述目标价值信息进行统一存储。
第三方面,本申请实施例提供一种电子设备,包括处理器、存储器、通信接口,以及一个或多个程序,其中,上述一个或多个程序被存储在上述存储器中,并且被配置由上述 处理器执行,上述程序包括用于执行本申请实施例第一方面中的步骤的指令。
第四方面,本申请实施例提供了一种计算机可读存储介质,其中,上述计算机可读存储介质存储用于电子数据交换的计算机程序,其中,上述计算机程序使得计算机执行如本申请实施例第一方面中所描述的部分或全部步骤。
第五方面,本申请实施例提供了一种计算机程序产品,其中,上述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,上述计算机程序可操作来使计算机执行如本申请实施例第一方面中所描述的部分或全部步骤。该计算机程序产品可以为一个软件安装包。
下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍。
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1A是本申请实施例提供的一种电子设备的结构示意图;
图1B本申请实施例提供的另一种电子设备的结构示意图;
图1C是本申请实施例公开的一种信息爬取方法的流程示意图;
图2是本申请实施例公开的另一种信息爬取方法的流程示意图;
图3是本申请实施例公开的另一种电子设备的结构示意图;
图4A是本申请实施例公开的一种信息爬取装置的结构示意图;
图4B是本申请实施例公开的一种图4A所描述的信息爬取装置的变型结构。
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其他步骤或单元。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
本申请实施例所涉及到的电子设备可以包括各种具有无线通信功能的手持设备、车载设备、可穿戴设备(智能手表、无线耳机)、计算设备或连接到无线调制解调器的其他处理设备,以及各种形式的用户设备(user equipment,UE),移动台(mobile station,MS),终端设备(terminal device)等等。为方便描述,上面提到的设备统称为电子设备。电子设备还可以为服务器。
为了便于更好的理解本申请所描述的技术方案,下面对本申请实施例所涉及的技术术语进行解释:
Python,一种面向对象、跨平台的计算机程序设计语言。
Java,一种面向对象、跨平台的计算机程序设计语言。
JavaScript,一种面向对象的Web程序设计语言。
Scrapy,一种基于Python语言编写的开源Web爬虫框架。
Nutch,一种基于Java语言编写的开源搜索引擎。
用户代理(user-agent),一种用于标识浏览器、浏览器操作系统、加密等级、浏览器渲染引擎的HTTP头部字段。
Cookie,一种用于标识合法用户身份的超文本传输协议(hypertext transport protocol,HTTP)头部字段。
下面对本申请实施例进行详细介绍。
请参阅图1A,图1A是本申请实施例公开的一种电子设备的结构示意图,电子设备100可以包括控制电路,该控制电路可以包括存储和处理电路110。该存储和处理电路110可以存储器,例如硬盘驱动存储器,非易失性存储器(例如闪存或用于形成固态驱动器的其它电子可编程只读存储器等),易失性存储器(例如静态或动态随机存取存储器等)等,本申请实施例不作限制。存储和处理电路110中的处理电路可以用于控制电子设备100的运转。该处理电路可以基于一个或多个微处理器,微控制器,基带处理器,功率管理单元,音频编解码器芯片,专用集成电路,显示驱动器集成电路等来实现。
存储和处理电路110可用于运行电子设备100中的软件,例如互联网浏览应用程序,互联网协议语音(voice over internet protocol,VOIP)电话呼叫应用程序,电子邮件应用程序,媒体播放应用程序,操作系统功能等。这些软件可以用于执行一些控制操作,例如,基于照相机的图像采集,基于环境光传感器的环境光测量,基于接近传感器的接近传感器测量,基于诸如发光二极管的状态指示灯等状态指示器实现的信息显示功能,基于触摸传感器的触摸事件检测,与在多个(例如分层的)显示器上显示信息相关联的功能,与执行无线通信功能相关联的操作,与收集和产生音频信号相关联的操作,与收集和处理按钮按压事件数据相关联的控制操作,以及电子设备100中的其它功能等,本申请实施例不作限制。
电子设备100还可以包括输入-输出电路150。输入-输出电路150可用于使电子设备100实现数据的输入和输出,即允许电子设备100从外部设备接收数据和也允许电子设备100将数据从电子设备100输出至外部设备。输入-输出电路150可以进一步包括传感器170。传感器170可以包括环境光传感器,基于光和电容的接近传感器,触摸传感器(例如,基于光触摸传感器和/或电容式触摸传感器,其中,触摸传感器可以是触控显示屏的一部分,也可以作为一个触摸传感器结构独立使用),加速度传感器,重力传感器,和其它传感器等。
输入-输出电路150还可以包括一个或多个显示器,例如显示器130。显示器130可以包括液晶显示器,有机发光二极管显示器,电子墨水显示器,等离子显示器,使用其它显示技术的显示器中一种或者几种的组合。显示器130可以包括触摸传感器阵列(即,显示器130可以是触控显示屏)。触摸传感器可以是由透明的触摸传感器电极(例如氧化铟锡(ITO)电极)阵列形成的电容式触摸传感器,或者可以是使用其它触摸技术形成的触摸传感器,例如音波触控,压敏触摸,电阻触摸,光学触摸等,本申请实施例不作限制。
音频组件140可以用于为电子设备100提供音频输入和输出功能。电子设备100中的音频组件140可以包括扬声器,麦克风,蜂鸣器,音调发生器以及其它用于产生和检测声音的组件。
通信电路120可以用于为电子设备100提供与外部设备通信的能力。通信电路120可 以包括模拟和数字输入-输出接口电路,和基于射频信号和/或光信号的无线通信电路。通信电路120中的无线通信电路可以包括射频收发器电路、功率放大器电路、低噪声放大器、开关、滤波器和天线。举例来说,通信电路120中的无线通信电路可以包括用于通过发射和接收近场耦合电磁信号来支持近场通信(near field communication,NFC)的电路。例如,通信电路120可以包括近场通信天线和近场通信收发器。通信电路120还可以包括蜂窝电话收发器和天线,无线局域网收发器电路和天线等。
电子设备100还可以进一步包括电池,电力管理电路和其它输入-输出单元160。输入-输出单元160可以包括按钮,操纵杆,点击轮,滚动轮,触摸板,小键盘,键盘,照相机,发光二极管和其它状态指示器等。
用户可以通过输入-输出电路150输入命令来控制电子设备100的操作,并且可以使用输入-输出电路150的输出数据以实现接收来自电子设备100的状态信息和其它输出。
相关技术中,爬虫框架可包括单机爬虫框架和分布式爬虫框架,其中,Scrapy是一款基于Python语言实现的单机爬虫框架,其主要由Scrapy引擎、任务调度器、下载器、爬虫、管道五个模块构成。Scrapy引擎负责向各模块发送爬取命令,以及协调各模块之间的通信和数据传递。任务调度器对Scrapy引擎发送的统一资源定位符(uniform resource locator,URL)网络资源进行统一调度和队列管理。下载器负责向URL网络资源发送URL请求并获取URL响应。爬虫对响应内容进行解析和提取需要的价值信息,并传递到管道进行统一分析、过滤和存储。Nutch是一款基于Java语言实现的分布式搜索引擎和爬虫框架,主要依赖分布式基础架构来实现对海量信息的分布式爬取和数据存储。其主要由生成器、任务调度器、下载器、解析器、存储器模块构成。生成器主要从数据库中查询目标价值信息,并由任务调度器动态下发搜索任务至分布式系统基础架构集群,以完成目标价值信息的搜索和建立索引。下载器和解析器负责建立URL网络请求和提取URL网络响应中的信息字段。最后由存储器完成对目标价值信息的集中存储。
上述现有的爬虫框架对于海量信息的爬取任务来说,具有较好的爬取能力,但是,对于轻量级的信息爬取任务,现有的爬虫框架开发周期长、维护成本高,难以爬取到AJAX信息和JavaScript代码动态生成的价值信息,还容易被反爬虫机制进行限制。
基于此,请参阅图1B,图1B提供了另一种电子设备的结构示意图,其中,电子设备包括用于实施本申请实施例所涉及的信息爬取方法的信息爬取框架,其中,信息爬取框架可包括浏览器100、浏览器的控制台110、网络资源加载器120、网络资源解析器130和存储器140,其中,
所述浏览器100,用于打开目标统一资源定位符URL网络资源,进入所述目标URL网络资源对应的目标页面;
所述控制台110,用于打开新标签窗口,在新标签窗口中加载所述URL网络资源;
所述网络资源解析器130,用于定位所述目标页面中目标价值信息所在的文档对象模型(document object model,DOM)元素,得到目标DOM元素;获取所述目标DOM元素的定位路径信息;
所述网络资源加载器120,用于通过URL网络资源加载所述目标价值信息;
所述网络资源解析器130,还用于根据所述定位路径信息在所述新标签窗口提取所述目标价值信息;
所述存储器140,用于对所述目标价值信息进行统一存储。
上述信息爬取框架,无需安装Java、Python运行环境和应用程序自动测试框架依赖包,无需配置任何分布式系统基础架构服务器和数据库,只需基于浏览器自身的功能实现目标价值信息的定位,可以在较短时间内实现基于纯浏览器环境的单机爬虫,对于Web开发人 员来说,开发周期短,操作简单,从而有效降低了开发门槛、配置管理和维护成本。由于现有的操作系统大多自带浏览器应用,因此本方案拥有跨平台的优势,可以能胜任不同平台的渗透测试、安全测试以及其它临时爬取和定向爬取的项目需求,具有良好的跨平台性。此外,本方案基于真实浏览器来启动正常的浏览行为,抗反爬虫能力强。
请参阅图1C,图1C是本申请实施例提供的一种信息爬取方法的流程示意图,本实施例中所描述的信息爬取方法,应用于如图1A或者图1B所示的的电子设备,该信息爬取方法包括:
101、在浏览器中打开目标统一资源定位符URL网络资源,进入所述目标URL网络资源对应的目标页面。
其中,可在浏览器中打开目标URL网络资源,其中,目标统一资源定位符URL用于标识网络资源的位置和访问方式。
其中,目标页面为目标URL网络资源对应的浏览器页面,可通过浏览器打开目标URL网络资源,进入目标页面。
可选地,上述步骤101中,在浏览器中打开目标统一资源定位符URL网络资源时,还可以包括如下步骤:
若所述目标URL网络资源需要登陆账号,获取所述URL网络资源对应的登录账号信息;
对所述登录账号信息进行验证,若验证成功,执行所述进入所述目标URL网络资源对应的目标页面的操作。
考虑到有的网络资源需要进行用户账号登录,针对需要登录账号的目标URL网络资源,可获取URL网络资源对应的登录账号信息,具体地,在登陆所述登陆账号时,可通过用户输入的登录账号信息的方式获取登录账号信息,例如,电子设备可接收用户通过浏览器输入的用户名、密码和验证码。可选地,在首次登录该登录账号时,可记录并保存登录账号信息,以便后续进行信息爬取时,可以直接调用已经保存的登录账号信息进行账号登录,不需要用户重复输入登录账号信息。
可见,本申请实施例基于真实浏览器来启动正常的浏览行为,并且携带正常用户账号信息,因此,现有的基于登陆功能限制的反爬虫技术都难以进行限制,从而可提高反爬虫能力。
102、定位所述目标页面中目标价值信息所在的文档对象模型DOM元素,得到目标DOM元素。
本申请实施例中,可通过浏览器定位目标页面中目标价值信息所在的文档对象模型DOM元素,得到目标DOM元素。如此,可仅通过浏览器实现对目标价值信息的定位,不需要安装应用程序自动测试框架模拟真实浏览网页,来定位目标价值信息,可节省成本,且操作简单。
可选地,上述步骤102中,定位所述目标页面中目标价值信息所在的文档对象模型DOM元素,得到目标DOM元素,可以包括如下步骤:
通过所述浏览器的页面元素审查功能定位所述目标价值信息所在的目标DOM元素。
其中,具体实现中,浏览器具有页面元素审查功能,电子设备可以基于浏览器的页面元素审查功能定位目标价值信息在所在的目标DOM元素,如此,可以得到精确的目标DOM元素定位结果。
103、获取所述目标DOM元素的定位路径信息。
其中,定位路径信息可包括层叠样式表(cascading style sheets,CSS)选择器或可扩展标记语言(extensible markup language path,Xpath)路径。
具体实施中,电子设备可定位到目标信息所在DOM节点并获取该节点元素的CSS选择器或Xpath路径,通过浏览器获取定位路径信息,可以在较短时间内定位到目标价值信息对应的定位路径信息从而提高信息爬取效率。
可选地,上述步骤103中,获取所述目标DOM元素的定位路径信息,可以包括如下步骤:
21、定位所述目标价值信息所在的DOM节点;
22、获取所述DOM节点下第一节点元素的定位路径对应的定位路径信息。
其中,DOM节点是指在XML文档中的每个成分都是一个节点,整个文档是一个文档节点,每个XML标签是一个元素节点。
具体实施中,电子设备可首先定位目标价值信息所在的DOM节点,然后获取DOM节点下第一节点元素,得到定位该第一节点元素的CSS选择器或Xpath路径,如此,可得到精确的定位路径信息。
可选地,上述步骤103之后,还可包括以下步骤:
31、通过所述浏览器的控制台验证所述定位路径信息对应的定位路径是否有效;
32、若是,执行所述将所述目标URL网络资源加载到新的标签窗口的操作;
33、若否,调整所述定位路径信息。
本申请实施例中,可开启浏览器的控制台,然后验证CSS选择器或Xpath路径是否有效,若CSS选择器或Xpath路径有效,则继续执行将所述目标URL网络资源加载到新的标签窗口,进而提取目标价值信息,若CSS选择器或Xpath路径无效,则可调整CSS选择器或Xpath路径。
可选地,上述步骤31中,通过所述控制台验证所述定位路径信息对应的定位路径是否有效,可包括以下步骤:
在所述控制台输入所述定位路径信息对应的定位路径,若能成功定位到所述目标DOM元素,确定所述定位路径信息对应的定位路径有效。
其中,可开启浏览器的控制台,在控制台输入CSS选择器或Xpath路径,若能成功定位到目标DOM元素,确定CSS选择器或Xpath路径有效,若不能成功定位到目标DOM元素,则表明CSS选择器或Xpath路径无效。
可选地,上述步骤33中,调整所述定位路径信息,可包括以下步骤:
获取所述DOM节点下第二节点元素的定位路径对应的定位路径信息,其中,所述第二节点元素和所述第一节点元素分别对应所述DOM节点下的不同子节点;
调整所述定位路径信息至所述第二节点元素的定位路径对应的定位路径信息。
其中,电子设备可获取DOM节点下第二节点元素的CSS选择器或Xpath路径,得到调整后的CSS选择器或Xpath路径,还可将调整后的CSS选择器或Xpath路径输入控制台,确定调整后的CSS选择器或Xpath路径是否有效。如此,通过对定位路径信息进行调整,可以保证定位到目标价值信息对应的定位路径信息。
104、将所述目标URL网络资源加载到新标签窗口。
其中,电子设备可通过浏览器的控制台开启新标签窗口,然后,将目标URL网络资源加载到新标签窗口,从而在新标签窗口中提取目标价值信息。
105、根据所述定位路径信息在所述新标签窗口提取所述目标价值信息,并对所述目标价值信息进行统一存储。
其中,目标价值信息可包括AJAX信息和JavaScript代码生成的价值信息,具体实施中,可电子设备可根据定位路径信息对新标签窗口中的目标价值信息进行提取,进而,将目标价值信息存储到存储器中。如此,可仅通过浏览器实现对AJAX信息和JavaScript代码生成的价值信息的爬取,无需安装Java、Python运行环境和应用程序自动测试框架依赖 包,无需配置任何分布式系统基础架构服务器和数据库,只需基于浏览器自身的功能实现目标价值信息的定位,可以在较短时间内实现目标价值信息的爬取,开发周期短,操作简单,降低了开发门槛和维护成本。
此外,由于现有的操作系统大多自带浏览器应用,因此本方案拥有跨平台的优势,可以能胜任不同平台的渗透测试、安全测试以及其它临时爬取和定向爬取的项目需求,因此,本方案具有良好的跨平台性。
可选地,上述步骤105中,根据所述定位路径信息在所述新标签窗口提取所述目标价值信息,可包括以下步骤:
51、通过所述控制台向所述新标签窗口注入爬虫代码;
52、执行所述爬虫代码,并根据所述定位路径信息提取所述目标价值信息。
其中,上述爬虫代码可以为JavaScript代码,JavaScript为一种面向对象的Web程序设计语言。
具体实施中,可通过控制台向新标签窗口注入爬虫代码,在新标签窗口执行爬虫代码,并根据CSS选择器或Xpath路径提取目标价值信息,如此,可以提取出JavaScript代码动态生成的价值信息,实现较好的动态信息爬取能力。
可选地,所述爬虫代码为JavaScript代码,上述步骤52中,根据所述定位路径信息提取所述目标价值信息,可包括以下步骤:
通过所述浏览器对所述目标价值信息进行解析和渲染;
根据所述定位路径信息下载所述目标价值信息,所述目标价值信息包括AJAX信息和所述JavaScript代码生成的价值信息,所述AJAX信息为异步JavaScript和可扩展标记语言XML信息。
其中,在执行JavaScript代码的过程中,可通过浏览器对所述目标价值信息进行解析和渲染然后可根据定位路径信息下载目标价值信息,从而,无需安装Java、Python运行环境,以及应用程序自动测试框架依赖包,无需配置任何分布式系统基础架构服务器和数据库,就可实现浏览器对所述目标价值信息进行解析和渲染。
可以看出,本申请实施例中所描述的信息爬取方法,通过在浏览器中打开目标统一资源定位符URL网络资源,进入目标URL网络资源对应的目标页面;定位目标页面中目标价值信息所在的文档对象模型DOM元素,得到目标DOM元素;获取目标DOM元素的定位路径信息;将目标URL网络资源加载到新标签窗口;根据定位路径信息在新标签窗口提取目标价值信息,并对目标价值信息进行统一存储,如此,无需安装Java、Python运行环境和应用程序自动测试框架依赖包,无需配置任何分布式系统基础架构服务器和数据库,只需基于浏览器自身的功能实现目标价值信息的定位,可以在较短时间内实现基于纯浏览器环境的单机爬虫,对于Web开发人员来说,开发周期短,操作简单,从而有效降低了开发门槛、配置管理和维护成本,从而可实现低成本、便捷高效的轻量级信息爬取。
与上述一致地,请参阅图2,图2是本申请实施例提供的另一种信息爬取方法的流程示意图,本实施例中所描述的信息爬取方法,应用于如图1A或者图1B所示的的电子设备,该方法可包括以下步骤:
在浏览器中打开目标统一资源定位符URL网络资源,判断所述目标URL网络资源是否需要登陆账号,若所述目标URL网络资源需要登陆账号,获取所述URL网络资源对应的登录账号信息;对所述登录账号信息进行验证,若验证成功,进入所述目标URL网络资源对应的目标页面;通过所述浏览器的页面元素审查功能定位所述目标价值信息所在的目标DOM元素;若所述目标URL网络资源不需要登陆账号,则直接通过所述浏览器的页面元素审查功能定位所述目标价值信息所在的目标DOM元素;获取所述目标DOM元素的定 位路径信息;通过所述浏览器的控制台验证所述定位路径信息对应的定位路径是否有效;若是,将所述目标URL网络资源加载到新的标签窗口;若否,调整所述定位路径信息;通过所述控制台打开所述新标签窗口;进而将所述目标URL网络资源加载到新标签窗口;通过所述控制台向所述新标签窗口注入爬虫代码;执行所述爬虫代码,并根据所述定位路径信息提取所述目标价值信息;对所述目标价值信息进行统一存储。
其中,上述步骤的具体描述可以参照图1C所示的信息爬取方法,在此不再赘述。
可以看出,本申请实施例中所描述的信息爬取方法,在浏览器中打开目标URL网络资源,判断目标URL网络资源是否需要登陆账号,若目标URL网络资源需要登陆账号,获取URL网络资源对应的登录账号信息;对登录账号信息进行验证,若验证成功,进入目标URL网络资源对应的目标页面;通过浏览器的页面元素审查功能定位目标价值信息所在的目标DOM元素;若目标URL网络资源不需要登陆账号,则直接通过浏览器的页面元素审查功能定位目标价值信息所在的目标DOM元素;获取目标DOM元素的定位路径信息;通过浏览器的控制台验证定位路径信息对应的定位路径是否有效;若是,将目标URL网络资源加载到新的标签窗口;若否,调整定位路径信息;通过控制台打开新标签窗口;进而将目标URL网络资源加载到新标签窗口;通过控制台向新标签窗口注入爬虫代码;执行爬虫代码,并根据定位路径信息提取目标价值信息;对目标价值信息进行统一存储,如此,可实现低成本、便捷高效的轻量级信息爬取,此外,本方案基于真实浏览器来启动正常的浏览行为,并且携带正常用户账号信息,因此,现有的基于登陆功能限制的反爬虫技术都难以进行限制,从而可提高反爬虫能力。
以下是实施上述信息爬取方法的装置,具体如下:
与上述一致地,请参阅图3,图3是本申请实施例提供的一种电子设备,包括:处理器和存储器;以及一个或多个程序,所述一个或多个程序被存储在所述存储器中,并且被配置成由所述处理器执行,所述程序包括用于执行以下步骤的指令:
在浏览器中打开目标统一资源定位符URL网络资源,进入所述目标URL网络资源对应的目标页面;
定位所述目标页面中目标价值信息所在的文档对象模型DOM元素,得到目标DOM元素;
获取所述目标DOM元素的定位路径信息;
将所述目标URL网络资源加载到新标签窗口;
根据所述定位路径信息在所述新标签窗口提取所述目标价值信息,并对所述目标价值信息进行统一存储。
可以看出,本申请实施例中所描述的电子设备,通过在浏览器中打开目标统一资源定位符URL网络资源,进入目标URL网络资源对应的目标页面;定位目标页面中目标价值信息所在的文档对象模型DOM元素,得到目标DOM元素;获取目标DOM元素的定位路径信息;将目标URL网络资源加载到新标签窗口;根据定位路径信息在新标签窗口提取目标价值信息,并对目标价值信息进行统一存储,如此,无需安装Java、Python运行环境和应用程序自动测试框架依赖包,无需配置任何分布式系统基础架构服务器和数据库,只需基于浏览器自身的功能实现目标价值信息的定位,可以在较短时间内实现基于纯浏览器环境的单机爬虫,对于Web开发人员来说,开发周期短,操作简单,从而有效降低了开发门槛、配置管理和维护成本,从而可实现低成本、便捷高效的轻量级信息爬取。
在一个可能的示例中,在所述定位所述目标页面中目标价值信息所在的文档对象模型DOM元素,得到目标DOM元素方面,所述程序包括用于执行以下步骤的指令:
通过所述浏览器的页面元素审查功能定位所述目标价值信息所在的目标DOM元素。
在一个可能的示例中,在所述获取所述目标DOM元素的定位路径信息方面,所述程序包括用于执行以下步骤的指令:
定位所述目标价值信息所在的DOM节点;
获取所述DOM节点下第一节点元素的定位路径对应的定位路径信息。
在一个可能的示例中,所述获取所述目标DOM元素的定位路径信息之后,所述程序还包括用于执行以下步骤的指令:
通过所述浏览器的控制台验证所述定位路径信息对应的定位路径是否有效;
若是,执行所述将所述目标URL网络资源加载到新的标签窗口的操作;
若否,调整所述定位路径信息。
在一个可能的示例中,在所述通过所述控制台验证所述定位路径信息对应的定位路径是否有效方面,所述程序包括用于执行以下步骤的指令:
在所述控制台输入所述定位路径信息对应的定位路径,若能成功定位到所述目标DOM元素,确定所述定位路径信息对应的定位路径有效。
在一个可能的示例中,在所述调整所述定位路径信息方面,所述程序包括用于执行以下步骤的指令:
获取所述DOM节点下第二节点元素的定位路径对应的定位路径信息,其中,所述第二节点元素和所述第一节点元素分别对应所述DOM节点下的不同子节点;
调整所述定位路径信息至所述第二节点元素的定位路径对应的定位路径信息。
在一个可能的示例中,所述将所述目标URL网络资源加载到新标签窗口之前,所述程序还包括用于执行以下步骤的指令:
通过所述控制台打开所述新标签窗口;
在所述根据所述定位路径信息在所述新标签窗口提取所述目标价值信息方面,所述程序包括用于执行以下步骤的指令:
通过所述控制台向所述新标签窗口注入爬虫代码;
执行所述爬虫代码,并根据所述定位路径信息提取所述目标价值信息。
在一个可能的示例中,所述爬虫代码为JavaScript代码,在所述根据所述定位路径信息提取所述目标价值信息方面,所述程序包括用于执行以下步骤的指令:
通过所述浏览器对所述目标价值信息进行解析和渲染;
根据所述定位路径信息下载目标价值信息,所述目标价值信息包括AJAX信息和所述JavaScript代码生成的价值信息,所述AJAX信息为异步JavaScript和可扩展标记语言XML信息。
在一个可能的示例中,所述程序还包括用于执行以下步骤的指令:
若所述目标URL网络资源需要登陆账号,获取所述URL网络资源对应的登录账号信息;
对所述登录账号信息进行验证,若验证成功,执行所述进入所述目标URL网络资源对应的目标页面的操作。
请参阅图4A,图4A是本实施例提供的一种信息爬取装置的结构示意图。该信息爬取装置应用于如图1A所示或者图1B所示的的电子设备,所述信息爬取装置包括:开启单元401、定位单元402、获取单元403、加载单元404、提取单元405和存储单元406,其中,
所述开启单元401,用于在浏览器中打开目标统一资源定位符URL网络资源,进入所述目标URL网络资源对应的目标页面;
所述定位单元402,用于定位所述目标页面中目标价值信息所在的文档对象模型DOM元素,得到目标DOM元素;
所述获取单元403,用于获取所述目标DOM元素的定位路径信息;
所述加载单元404,用于将所述目标URL网络资源加载到新标签窗口;
所述提取单元405,用于根据所述定位路径信息在所述新标签窗口提取所述目标价值信息;
所述存储单元406,用于对所述目标价值信息进行统一存储。
可以看出,本申请实施例中所描述的信息爬取装置,应用于电子设备,通过在浏览器中打开目标统一资源定位符URL网络资源,进入目标URL网络资源对应的目标页面;定位目标页面中目标价值信息所在的文档对象模型DOM元素,得到目标DOM元素;获取目标DOM元素的定位路径信息;将目标URL网络资源加载到新标签窗口;根据定位路径信息在新标签窗口提取目标价值信息,并对目标价值信息进行统一存储,如此,无需安装Java、Python运行环境和应用程序自动测试框架依赖包,无需配置任何分布式系统基础架构服务器和数据库,只需基于浏览器自身的功能实现目标价值信息的定位,可以在较短时间内实现基于纯浏览器环境的单机爬虫,对于Web开发人员来说,开发周期短,操作简单,从而有效降低了开发门槛、配置管理和维护成本,从而可实现低成本、便捷高效的轻量级信息爬取。
在一个可能的示例中,在所述定位所述目标页面中目标价值信息所在的文档对象模型DOM元素,得到目标DOM元素方面,所述定位单元402具体用于:
通过所述浏览器的页面元素审查功能定位所述目标价值信息所在的目标DOM元素。
在一个可能的示例中,在获取所述目标DOM元素的定位路径信息方面,所述获取单元403具体用于:
定位所述目标价值信息所在的DOM节点;
获取所述DOM节点下第一节点元素的定位路径对应的定位路径信息。
在一个可能的示例中,如图4B,图4B为图4A所描述的信息爬取装置的变型结构,其与图4A相比较,还可以包括:验证单元407和调整单元408,其中,
所述开启单元401,还用于开启浏览器的控制台;
所述验证单元407,用于通过所述控制台验证所述定位路径信息对应的定位路径是否有效;
若是,由所述加载单元404执行所述将所述目标URL网络资源加载到新的标签窗口的操作;
所述调整单元408,用于若所述定位路径信息对应的定位路径无效,调整所述定位路径信息。
在一个可能的示例中,在所述通过所述控制台验证所述定位路径信息对应的定位路径是否有效方面,所述验证单元407具体用于:
在所述控制台输入所述定位路径信息对应的定位路径,若能成功定位到所述目标DOM元素,确定所述定位路径信息对应的定位路径有效。
在一个可能的示例中,在所述调整所述定位路径信息方面,所述调整单元408具体用于:
获取所述DOM节点下第二节点元素的定位路径对应的定位路径信息,其中,所述第二节点元素和所述第一节点元素分别对应所述DOM节点下的不同子节点;
调整所述定位路径信息至所述第二节点元素的定位路径对应的定位路径信息。
在一个可能的示例中,所述加载单元将所述目标URL网络资源加载到新标签窗口之前,
所述开启单元401,还用于通过所述控制台打开所述新标签窗口;
在所述根据所述定位路径信息在所述新标签窗口提取所述目标价值信息方面,所述提取单元405具体用于:
通过所述控制台向所述新标签窗口注入爬虫代码;
执行所述爬虫代码,并根据所述定位路径信息提取所述目标价值信息。
在一个可能的示例中,所述爬虫代码为JavaScript代码,在所述根据所述定位路径信息提取所述目标价值信息方面,所述提取单元405具体用于:
通过所述浏览器对所述目标价值信息进行解析和渲染;
根据所述定位路径信息下载目标价值信息,所述目标价值信息包括AJAX信息和所述JavaScript代码生成的价值信息,所述AJAX信息为异步JavaScript和可扩展标记语言XML信息。
在一个可能的示例中,所述获取单元403,还用于若所述目标URL网络资源需要登陆账号,获取所述URL网络资源对应的登录账号信息;
所述开启单元,还用于对所述登录账号信息进行验证,若验证成功,执行所述进入所述目标URL网络资源对应的目标页面的操作。
可以理解的是,本实施例的信息爬取装置的各程序模块的功能可根据上述方法实施例中的方法具体实现,其具体实现过程可以参照上述方法实施例的相关描述,此处不再赘述。
本申请实施例还提供一种计算机存储介质,其中,该计算机存储介质存储用于电子数据交换的计算机程序,该计算机程序使得计算机执行如上述方法实施例中记载的任何一种信息爬取方法的部分或全部步骤。
本申请实施例还提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如上述方法实施例中记载的任何一种信息爬取方法的部分或全部步骤。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。
所述集成的单元如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个 人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储器中,存储器可以包括:闪存盘、ROM、RAM、磁盘或光盘等。
以上对本申请实施例进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。
Claims (20)
- 一种信息爬取方法,其特征在于,所述方法包括:在浏览器中打开目标统一资源定位符URL网络资源,进入所述目标URL网络资源对应的目标页面;定位所述目标页面中目标价值信息所在的文档对象模型DOM元素,得到目标DOM元素;获取所述目标DOM元素的定位路径信息;将所述目标URL网络资源加载到新标签窗口;根据所述定位路径信息在所述新标签窗口提取所述目标价值信息,并对所述目标价值信息进行统一存储。
- 根据权利要求1所述方法,其特征在于,所述定位所述目标页面中目标价值信息所在的文档对象模型DOM元素,得到目标DOM元素,包括:通过所述浏览器的页面元素审查功能定位所述目标价值信息所在的目标DOM元素。
- 根据权利要求2所述方法,其特征在于,所述获取所述目标DOM元素的定位路径信息,包括:定位所述目标价值信息所在的DOM节点;获取所述DOM节点下第一节点元素的定位路径对应的定位路径信息。
- 根据权利要求3所述方法,其特征在于,所述获取所述目标DOM元素的定位路径信息之后,所述方法还包括:通过所述浏览器的控制台验证所述定位路径信息对应的定位路径是否有效;若是,执行所述将所述目标URL网络资源加载到新的标签窗口的操作;若否,调整所述定位路径信息。
- 根据权利要求4所述方法,其特征在于,所述通过所述控制台验证所述定位路径信息对应的定位路径是否有效,包括:在所述控制台输入所述定位路径信息对应的定位路径,若能成功定位到所述目标DOM元素,确定所述定位路径信息对应的定位路径有效。
- 根据权利要求4所述方法,其特征在于,所述调整所述定位路径信息,包括:获取所述DOM节点下第二节点元素的定位路径对应的定位路径信息,其中,所述第二节点元素和所述第一节点元素分别对应所述DOM节点下的不同子节点;调整所述定位路径信息至所述第二节点元素的定位路径对应的定位路径信息。
- 根据权利要求1-6任一项所述方法,其特征在于,所述将所述目标URL网络资源加载到新标签窗口之前,所述方法还包括:通过所述控制台打开所述新标签窗口;所述根据所述定位路径信息在所述新标签窗口提取所述目标价值信息,包括:通过所述控制台向所述新标签窗口注入爬虫代码;执行所述爬虫代码,并根据所述定位路径信息提取所述目标价值信息。
- 根据权利要求7所述方法,其特征在于,所述爬虫代码为JavaScript代码,所述根据所述定位路径信息提取所述目标价值信息,包括:通过所述浏览器对所述目标价值信息进行解析和渲染;根据所述定位路径信息下载目标价值信息,所述目标价值信息包括AJAX信息和所述JavaScript代码生成的价值信息,所述AJAX信息为异步JavaScript和可扩展标记语言XML信息。
- 根据权利要求1所述方法,其特征在于,所述方法还包括:若所述目标URL网络资源需要登陆账号,获取所述URL网络资源对应的登录账号信 息;对所述登录账号信息进行验证,若验证成功,执行所述进入所述目标URL网络资源对应的目标页面的操作。
- 一种信息爬取装置,其特征在于,所述装置包括:开启单元、定位单元、获取单元、加载单元、提取单元和存储单元,其中,所述开启单元,用于在浏览器中打开目标统一资源定位符URL网络资源,进入所述目标URL网络资源对应的目标页面;所述定位单元,用于定位所述目标页面中目标价值信息所在的文档对象模型DOM元素,得到目标DOM元素;所述获取单元,用于获取所述目标DOM元素的定位路径信息;所述加载单元,用于将所述目标URL网络资源加载到新标签窗口;所述提取单元,用于根据所述定位路径信息在所述新标签窗口提取所述目标价值信息;所述存储单元,用于对所述目标价值信息进行统一存储。
- 根据权利要求10所述的装置,其特征在于,在所述定位所述目标页面中目标价值信息所在的文档对象模型DOM元素,得到目标DOM元素方面,所述定位单元具体用于:通过所述浏览器的页面元素审查功能定位所述目标价值信息所在的目标DOM元素。
- 根据权利要求11所述的装置,其特征在于,在获取所述目标DOM元素的定位路径信息方面,所述获取单元具体用于:定位所述目标价值信息所在的DOM节点;获取所述DOM节点下第一节点元素的定位路径对应的定位路径信息。
- 根据权利要求10-12任一项所述的装置,其特征在于,所述装置还包括验证单元和调整单元,其中,所述开启单元,还用于开启浏览器的控制台;所述验证单元,用于通过所述控制台验证所述定位路径信息对应的定位路径是否有效;若是,由所述加载单元执行所述将所述目标URL网络资源加载到新的标签窗口的操作;所述调整单元,用于若所述定位路径信息对应的定位路径无效,调整所述定位路径信息。
- 根据权利要求13所述的装置,其特征在于,在所述通过所述控制台验证所述定位路径信息对应的定位路径是否有效方面,所述验证单元具体用于:在所述控制台输入所述定位路径信息对应的定位路径,若能成功定位到所述目标DOM元素,确定所述定位路径信息对应的定位路径有效。
- 根据权利要求10-14任一项所述的装置,其特征在于,在所述调整所述定位路径信息方面,所述调整单元具体用于:获取所述DOM节点下第二节点元素的定位路径对应的定位路径信息,其中,所述第二节点元素和所述第一节点元素分别对应所述DOM节点下的不同子节点;调整所述定位路径信息至所述第二节点元素的定位路径对应的定位路径信息。
- 根据权利要求10-15任一项所述的装置,其特征在于,所述加载单元将所述目标URL网络资源加载到新标签窗口之前,所述开启单元,还用于通过所述控制台打开所述新标签窗口;在所述根据所述定位路径信息在所述新标签窗口提取所述目标价值信息方面,所述提取单元具体用于:通过所述控制台向所述新标签窗口注入爬虫代码;执行所述爬虫代码,并根据所述定位路径信息提取所述目标价值信息。
- 一种电子设备,其特征在于,包括浏览器、浏览器的控制台、网络资源加载器、 网络资源解析器和存储器,其中,所述浏览器,用于打开目标统一资源定位符URL网络资源,进入所述目标URL网络资源对应的目标页面;所述控制台,用于打开新标签窗口,在新标签窗口中加载所述URL网络资源;所述网络资源解析器,用于定位所述目标页面中目标价值信息所在的文档对象模型DOM元素,得到目标DOM元素;获取所述目标DOM元素的定位路径信息;所述网络资源加载器,用于通过URL网络资源加载所述目标价值信息;所述网络资源解析器,还用于根据所述定位路径信息在所述新标签窗口提取所述目标价值信息;所述存储器,用于对所述目标价值信息进行统一存储。
- 一种电子设备,其特征在于,包括处理器、存储器、通信接口,以及一个或多个程序,所述一个或多个程序被存储在所述存储器中,并且被配置由所述处理器执行,所述程序包括用于执行如权利要求1-9任一项所述的方法中的步骤的指令。
- 一种计算机可读存储介质,其特征在于,存储用于电子数据交换的计算机程序,其中,所述计算机程序使得计算机执行如权利要求1-9任一项所述的方法。
- 一种计算机程序产品,其特征在于,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如权利要求1-9任一项所述的方法。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202080096823.2A CN115087969A (zh) | 2020-05-14 | 2020-05-14 | 信息爬取方法、装置、电子设备及存储介质 |
PCT/CN2020/090329 WO2021226954A1 (zh) | 2020-05-14 | 2020-05-14 | 信息爬取方法、装置、电子设备及存储介质 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2020/090329 WO2021226954A1 (zh) | 2020-05-14 | 2020-05-14 | 信息爬取方法、装置、电子设备及存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021226954A1 true WO2021226954A1 (zh) | 2021-11-18 |
Family
ID=78526260
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/090329 WO2021226954A1 (zh) | 2020-05-14 | 2020-05-14 | 信息爬取方法、装置、电子设备及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN115087969A (zh) |
WO (1) | WO2021226954A1 (zh) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102567530A (zh) * | 2011-12-31 | 2012-07-11 | 凤凰在线(北京)信息技术有限公司 | 一种文章类型网页智能抽取系统及其方法 |
US20130041882A1 (en) * | 2000-12-14 | 2013-02-14 | International Business Machines Corporation | Technology for web site crawling, including action sequences for selecting non-hypertext-link parameters |
CN105354337A (zh) * | 2015-12-08 | 2016-02-24 | 北京奇虎科技有限公司 | 一种网络爬虫实现方法和网络爬虫系统 |
CN106484775A (zh) * | 2016-09-12 | 2017-03-08 | 北京量科邦信息技术有限公司 | 一种基于selenium的爬虫抓取方法及系统 |
CN107729385A (zh) * | 2017-09-19 | 2018-02-23 | 杭州安恒信息技术有限公司 | 一种采集动态网页完整数据内容的方法 |
-
2020
- 2020-05-14 CN CN202080096823.2A patent/CN115087969A/zh active Pending
- 2020-05-14 WO PCT/CN2020/090329 patent/WO2021226954A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130041882A1 (en) * | 2000-12-14 | 2013-02-14 | International Business Machines Corporation | Technology for web site crawling, including action sequences for selecting non-hypertext-link parameters |
CN102567530A (zh) * | 2011-12-31 | 2012-07-11 | 凤凰在线(北京)信息技术有限公司 | 一种文章类型网页智能抽取系统及其方法 |
CN105354337A (zh) * | 2015-12-08 | 2016-02-24 | 北京奇虎科技有限公司 | 一种网络爬虫实现方法和网络爬虫系统 |
CN106484775A (zh) * | 2016-09-12 | 2017-03-08 | 北京量科邦信息技术有限公司 | 一种基于selenium的爬虫抓取方法及系统 |
CN107729385A (zh) * | 2017-09-19 | 2018-02-23 | 杭州安恒信息技术有限公司 | 一种采集动态网页完整数据内容的方法 |
Also Published As
Publication number | Publication date |
---|---|
CN115087969A (zh) | 2022-09-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10003671B2 (en) | Capturing and replaying application sessions using resource files | |
CN106970790B (zh) | 一种应用程序创建的方法、相关设备及系统 | |
CN108536594B (zh) | 页面测试方法、装置及存储设备 | |
CN108345543B (zh) | 一种数据处理方法、装置、设备及存储介质 | |
CN104598513B (zh) | 一种基于网页框架的数据流控制方法和系统 | |
US20160241589A1 (en) | Method and apparatus for identifying malicious website | |
CN106254436A (zh) | 一种远程调试的方法、相关设备及系统 | |
CN110489626A (zh) | 一种信息采集方法和装置 | |
CN109408150A (zh) | 一种快应用加载方法及移动终端 | |
CN106021112A (zh) | 程序测试系统、方法及装置 | |
CN105243407A (zh) | 读写智能卡的方法及装置 | |
CN103870551B (zh) | 一种跨域数据获取的方法和装置 | |
CN103617164B (zh) | 网页预取方法、装置及终端设备 | |
CN107766358A (zh) | 一种页面分享的方法及相关装置 | |
CN106294839A (zh) | 一种链接跳转方法和装置 | |
CN103455602B (zh) | 一种视频url抓取方法、装置及终端设备 | |
CN106326489A (zh) | 网络资源更新的方法和装置 | |
CN110445746A (zh) | cookie获取方法、装置及存储设备 | |
WO2022127743A1 (zh) | 内容显示方法及终端设备 | |
CN105740419A (zh) | 获取网页中动态加载内容的方法及装置 | |
CN108268232A (zh) | 一种图片显示方法、装置、系统和存储介质 | |
CN110198324B (zh) | 数据监控方法、装置、浏览器及终端 | |
CN108959062B (zh) | 网页元素获取方法及装置 | |
CN107861827A (zh) | 卡屏检测方法、移动终端及计算机可读存储介质 | |
CN109145182A (zh) | 数据采集方法、装置、计算机设备及系统 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20935612 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 170423) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20935612 Country of ref document: EP Kind code of ref document: A1 |