CN110569414A

CN110569414A - puppeteeer-based website data collection method

Info

Publication number: CN110569414A
Application number: CN201910773517.7A
Authority: CN
Inventors: 曹特磊
Original assignee: Interactive (beijing) Technology Co Ltd
Current assignee: Interactive (beijing) Technology Co Ltd
Priority date: 2019-08-21
Filing date: 2019-08-21
Publication date: 2019-12-13

Abstract

the invention provides a website data collection method based on puppeteer, which is characterized in that a target website is loaded in the puppeteer, internal data capture is carried out through a plurality of independent data crawling insect processes, and each data crawling insect process comprises the following steps: step 1, acquiring a data grabbing task and locking the grabbing task; step 2, opening a target website by using puppeteer, and extracting target data after a preset event occurs; and 3, storing the captured data, unlocking and marking the capture task as finished. The method and the device can directly take the rendered page to acquire all effective information. The method can be operated in a non-view mode, a plurality of instances can be operated simultaneously, the system resource occupation is low, the distributed deployment can be realized, the overall stability and the capturing efficiency are very high, the method can be deployed on a server of a common linux system, and the captured network service can be provided as a service.

Description

Puppeteeer-based website data collection method

Technical Field

The invention belongs to the field of website data collection, and particularly relates to a website data collection method based on puppeteer.

Background

And the ordinary website data acquisition is to acquire an http request corresponding to the url of the website through a crawler and analyze the result of the http request. A common crawler can crawl traditional web content. But currently, most websites are used, ajax is used to obtain content, and pages are rendered by JavaScript. The common web crawler cannot obtain valid data or can obtain only partial data.

some of the more advanced acquisition methods are to open the browser through the selenium and acquire the web page content through the orientation of the element positioning. The Selenium is a testing framework, must run on an operating system with a view, and cannot be deployed on a common linux server. Therefore, distributed deployment cannot be achieved, stability and capturing efficiency are very poor, and capturing service cannot be provided as service.

Disclosure of Invention

The invention aims to provide a website data collection method based on puppeteer, and the method is used for solving the technical problem.

the invention provides a website data collection method based on puppeteer, which is characterized in that a target website is loaded in the puppeteer, internal data capture is carried out through a plurality of independent data crawling insect processes, and each data crawling insect process comprises the following steps:

Step 1, acquiring a data grabbing task and locking the grabbing task;

Step 2, opening a target website by using puppeteer, and extracting target data after a preset event occurs;

and 3, storing the captured data, unlocking and marking the capture task as finished.

Further, the step 1 comprises:

Setting a timing trigger, and inquiring whether a data task which is not captured exists in a task table every 1 minute; if the data tasks which are not captured exist, locking the tasks to prevent other processes from repeatedly executing the same capturing task, and then calling and calling the data capturing module; and if the task cannot be inquired in the task table, the process enters the dormancy and waits for the next awakening of the timing trigger.

Further, the step 2 comprises:

And after the grabbing module is awakened, the grabbed task type is obtained from the grabbing task, and the target website url is determined according to the task type.

The grabbing module starts the chrome through puppeteer and opens the url of the target website;

When the chrome opens the target website url, the following events are monitored: sending an http request, returning contents by the http request, completing page loading and failing page loading;

And setting one or more data grabbing scripts, associating the grabbing scripts with one or more events, and calling puppeteer to grab data through the grabbing scripts when a specific event occurs.

Further, the step 3 comprises:

analyzing the data content, and then checking whether the data content is legal or not; if the data content is illegal, setting the task state as abnormal, entering a task pool, and waiting for retry of data capture;

If the data content is legal, performing completeness check on the data content; if the data is incomplete, setting the task state to be captured, and waiting for the task data to be captured continuously;

If the data content is finished, pushing the data into a message queue; while the task state is set to complete.

compared with the prior art, the invention has the beneficial effects that:

1) The method and the device can directly take the rendered page to acquire all effective information.

2) the invention is based on puppeteer which is a set of headless visitors and an operation api library thereof. The invention can operate in a mode without a view, can simultaneously operate a plurality of instances, has low system resource occupation, can be deployed in a distributed mode, has very high overall stability and capturing efficiency, can be deployed on a server of a common linux system, and can be used as a service to provide captured network service.

Drawings

FIG. 1 is an overall system framework diagram of the present invention;

FIG. 2 is a flow chart of the acquisition and locking of the data capture task of the present invention;

FIG. 3 is a flow chart of data capture according to the present invention;

FIG. 4 is a task data storage flow diagram of the present invention.

Detailed Description

the present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.

The embodiment provides a website data collection method based on Puppeteer, which operates the Chrome to open a target website through Puppeteer (a non-interface version of Chrome and a js interface sleeve for operating the Chrome). And after the target website is loaded in the puppeteer, acquiring data. The system can simultaneously have a plurality of instances to capture data, namely distributed deployment, and can be deployed on a plurality of machines, and a plurality of processes on each server can capture data simultaneously. Each process of the system is independent to grab tasks and mutually independent. The architecture of the overall system is shown in fig. 1:

The whole system can be deployed on any linux server. After the system is operated, a main process is started on the server. And the main process can start N crawler subprocesses according to the CPU core number obtained by the linux server. Each crawler subprocess runs independently and does not communicate with each other. The main process can monitor the subprocess and is responsible for the survival and restart of the subprocess.

the internal data of each data crawling insect process is captured, and the process can be roughly divided into three steps:

Acquiring data grabbing tasks and locking the grabbing tasks

opening a target website by using puppeteer, and extracting target data after a preset event occurs

Storing the captured data, unlocking and marking the capture task as completed

the following is a detailed description of the three steps.

the first step is the acquisition and locking of the data capture task. As shown in fig. 2.

the system is provided with a timing trigger, and whether a data task which is not captured exists is inquired in a task table every 1 minute. If the data task which is not grabbed exists, the task is locked, and other processes are prevented from repeatedly executing the same grabbing task. And then the data capture module is called up and called.

If the task is not found in the task table, the process goes to sleep and waits for the next wake-up of the timing trigger.

And secondly, data capture. As shown in fig. 3.

this step is the core function of puppeteer capture.

after the grabbing module is awakened, the type of the grabbed task is obtained from the grabbing task, and the target website url is determined according to the task type.

The crawling module will start the chrome through puppeteer and open the target web site url.

When the chrome opens the target website url, various events can be continuously started, and there are four events that the system needs to monitor:

Sending out an http request;

http requests to return content;

Completing page loading;

The page load failed.

one or more data capture scripts are set for different websites. The crawling script calls puppeteer to crawl target data from http request url, http request content, page html elements.

the capture script may be associated with one or more of the 4 events described above, and captures data when a particular event occurs.

if the page fails to open due to network reasons, website reasons, etc., the crawling task itself is still complete, but the data content is 'failure'.

and thirdly, storing the task data. As shown in fig. 4.

Firstly, analyzing the data content, and then checking whether the data content is legal or not.

And if the data content is illegal, setting the task state as abnormal, entering a task pool, and waiting for retry of data capture.

And if the data content is legal, performing completeness check on the data content. And if the data is incomplete, setting the task state to be captured, and waiting for the task data to be captured continuously.

If the data content is complete, the data is pushed into the kafka message queue. (kafka is a common message queue implementation).

While the task state is set to complete.

The technical effects of the invention comprise:

1) the invention is based on puppeteer which is a set of headless visitors and an operation api library thereof. The invention can run in a mode without a view, so the invention can be deployed under the traditional linux operating system.

2) the invention can simultaneously run a plurality of instances, has low system resource occupation, can be deployed in a distributed way, fully utilizes server resources, and has very high overall stability and grasping efficiency. Only new server resources are needed to be added, and the capacity can be conveniently expanded integrally.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. the website data collection method based on puppeteer is characterized in that a target website is loaded in the puppeteer, internal data capture is carried out through a plurality of independent data crawling subprocesses, and each data crawling subprocess comprises the following steps:

Step 1, acquiring a data grabbing task and locking the grabbing task;

2. the puppeteer-based website data collection method according to claim 1, wherein the step 1 comprises the following steps:

3. The puppeteer-based website data collection method according to claim 2, wherein the step 2 comprises:

4. the puppeteer-based website data collection method according to claim 3, wherein the step 3 comprises: