CN110569414A - puppeteeer-based website data collection method - Google Patents

puppeteeer-based website data collection method Download PDF

Info

Publication number
CN110569414A
CN110569414A CN201910773517.7A CN201910773517A CN110569414A CN 110569414 A CN110569414 A CN 110569414A CN 201910773517 A CN201910773517 A CN 201910773517A CN 110569414 A CN110569414 A CN 110569414A
Authority
CN
China
Prior art keywords
data
task
puppeteer
grabbing
website
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910773517.7A
Other languages
Chinese (zh)
Inventor
曹特磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Interactive (beijing) Technology Co Ltd
Original Assignee
Interactive (beijing) Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Interactive (beijing) Technology Co Ltd filed Critical Interactive (beijing) Technology Co Ltd
Priority to CN201910773517.7A priority Critical patent/CN110569414A/en
Publication of CN110569414A publication Critical patent/CN110569414A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/4401Bootstrapping
    • G06F9/4418Suspend and resume; Hibernate and awake
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues

Abstract

the invention provides a website data collection method based on puppeteer, which is characterized in that a target website is loaded in the puppeteer, internal data capture is carried out through a plurality of independent data crawling insect processes, and each data crawling insect process comprises the following steps: step 1, acquiring a data grabbing task and locking the grabbing task; step 2, opening a target website by using puppeteer, and extracting target data after a preset event occurs; and 3, storing the captured data, unlocking and marking the capture task as finished. The method and the device can directly take the rendered page to acquire all effective information. The method can be operated in a non-view mode, a plurality of instances can be operated simultaneously, the system resource occupation is low, the distributed deployment can be realized, the overall stability and the capturing efficiency are very high, the method can be deployed on a server of a common linux system, and the captured network service can be provided as a service.

Description

Puppeteeer-based website data collection method
Technical Field
The invention belongs to the field of website data collection, and particularly relates to a website data collection method based on puppeteer.
Background
And the ordinary website data acquisition is to acquire an http request corresponding to the url of the website through a crawler and analyze the result of the http request. A common crawler can crawl traditional web content. But currently, most websites are used, ajax is used to obtain content, and pages are rendered by JavaScript. The common web crawler cannot obtain valid data or can obtain only partial data.
some of the more advanced acquisition methods are to open the browser through the selenium and acquire the web page content through the orientation of the element positioning. The Selenium is a testing framework, must run on an operating system with a view, and cannot be deployed on a common linux server. Therefore, distributed deployment cannot be achieved, stability and capturing efficiency are very poor, and capturing service cannot be provided as service.
Disclosure of Invention
The invention aims to provide a website data collection method based on puppeteer, and the method is used for solving the technical problem.
the invention provides a website data collection method based on puppeteer, which is characterized in that a target website is loaded in the puppeteer, internal data capture is carried out through a plurality of independent data crawling insect processes, and each data crawling insect process comprises the following steps:
Step 1, acquiring a data grabbing task and locking the grabbing task;
Step 2, opening a target website by using puppeteer, and extracting target data after a preset event occurs;
and 3, storing the captured data, unlocking and marking the capture task as finished.
Further, the step 1 comprises:
Setting a timing trigger, and inquiring whether a data task which is not captured exists in a task table every 1 minute; if the data tasks which are not captured exist, locking the tasks to prevent other processes from repeatedly executing the same capturing task, and then calling and calling the data capturing module; and if the task cannot be inquired in the task table, the process enters the dormancy and waits for the next awakening of the timing trigger.
Further, the step 2 comprises:
And after the grabbing module is awakened, the grabbed task type is obtained from the grabbing task, and the target website url is determined according to the task type.
The grabbing module starts the chrome through puppeteer and opens the url of the target website;
When the chrome opens the target website url, the following events are monitored: sending an http request, returning contents by the http request, completing page loading and failing page loading;
And setting one or more data grabbing scripts, associating the grabbing scripts with one or more events, and calling puppeteer to grab data through the grabbing scripts when a specific event occurs.
Further, the step 3 comprises:
analyzing the data content, and then checking whether the data content is legal or not; if the data content is illegal, setting the task state as abnormal, entering a task pool, and waiting for retry of data capture;
If the data content is legal, performing completeness check on the data content; if the data is incomplete, setting the task state to be captured, and waiting for the task data to be captured continuously;
If the data content is finished, pushing the data into a message queue; while the task state is set to complete.
compared with the prior art, the invention has the beneficial effects that:
1) The method and the device can directly take the rendered page to acquire all effective information.
2) the invention is based on puppeteer which is a set of headless visitors and an operation api library thereof. The invention can operate in a mode without a view, can simultaneously operate a plurality of instances, has low system resource occupation, can be deployed in a distributed mode, has very high overall stability and capturing efficiency, can be deployed on a server of a common linux system, and can be used as a service to provide captured network service.
Drawings
FIG. 1 is an overall system framework diagram of the present invention;
FIG. 2 is a flow chart of the acquisition and locking of the data capture task of the present invention;
FIG. 3 is a flow chart of data capture according to the present invention;
FIG. 4 is a task data storage flow diagram of the present invention.
Detailed Description
the present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.
The embodiment provides a website data collection method based on Puppeteer, which operates the Chrome to open a target website through Puppeteer (a non-interface version of Chrome and a js interface sleeve for operating the Chrome). And after the target website is loaded in the puppeteer, acquiring data. The system can simultaneously have a plurality of instances to capture data, namely distributed deployment, and can be deployed on a plurality of machines, and a plurality of processes on each server can capture data simultaneously. Each process of the system is independent to grab tasks and mutually independent. The architecture of the overall system is shown in fig. 1:
The whole system can be deployed on any linux server. After the system is operated, a main process is started on the server. And the main process can start N crawler subprocesses according to the CPU core number obtained by the linux server. Each crawler subprocess runs independently and does not communicate with each other. The main process can monitor the subprocess and is responsible for the survival and restart of the subprocess.
the internal data of each data crawling insect process is captured, and the process can be roughly divided into three steps:
Acquiring data grabbing tasks and locking the grabbing tasks
opening a target website by using puppeteer, and extracting target data after a preset event occurs
Storing the captured data, unlocking and marking the capture task as completed
the following is a detailed description of the three steps.
the first step is the acquisition and locking of the data capture task. As shown in fig. 2.
the system is provided with a timing trigger, and whether a data task which is not captured exists is inquired in a task table every 1 minute. If the data task which is not grabbed exists, the task is locked, and other processes are prevented from repeatedly executing the same grabbing task. And then the data capture module is called up and called.
If the task is not found in the task table, the process goes to sleep and waits for the next wake-up of the timing trigger.
And secondly, data capture. As shown in fig. 3.
this step is the core function of puppeteer capture.
after the grabbing module is awakened, the type of the grabbed task is obtained from the grabbing task, and the target website url is determined according to the task type.
The crawling module will start the chrome through puppeteer and open the target web site url.
When the chrome opens the target website url, various events can be continuously started, and there are four events that the system needs to monitor:
Sending out an http request;
http requests to return content;
Completing page loading;
The page load failed.
one or more data capture scripts are set for different websites. The crawling script calls puppeteer to crawl target data from http request url, http request content, page html elements.
the capture script may be associated with one or more of the 4 events described above, and captures data when a particular event occurs.
if the page fails to open due to network reasons, website reasons, etc., the crawling task itself is still complete, but the data content is 'failure'.
and thirdly, storing the task data. As shown in fig. 4.
Firstly, analyzing the data content, and then checking whether the data content is legal or not.
And if the data content is illegal, setting the task state as abnormal, entering a task pool, and waiting for retry of data capture.
And if the data content is legal, performing completeness check on the data content. And if the data is incomplete, setting the task state to be captured, and waiting for the task data to be captured continuously.
If the data content is complete, the data is pushed into the kafka message queue. (kafka is a common message queue implementation).
While the task state is set to complete.
The technical effects of the invention comprise:
1) the invention is based on puppeteer which is a set of headless visitors and an operation api library thereof. The invention can run in a mode without a view, so the invention can be deployed under the traditional linux operating system.
2) the invention can simultaneously run a plurality of instances, has low system resource occupation, can be deployed in a distributed way, fully utilizes server resources, and has very high overall stability and grasping efficiency. Only new server resources are needed to be added, and the capacity can be conveniently expanded integrally.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims (4)

1. the website data collection method based on puppeteer is characterized in that a target website is loaded in the puppeteer, internal data capture is carried out through a plurality of independent data crawling subprocesses, and each data crawling subprocess comprises the following steps:
Step 1, acquiring a data grabbing task and locking the grabbing task;
Step 2, opening a target website by using puppeteer, and extracting target data after a preset event occurs;
And 3, storing the captured data, unlocking and marking the capture task as finished.
2. the puppeteer-based website data collection method according to claim 1, wherein the step 1 comprises the following steps:
Setting a timing trigger, and inquiring whether a data task which is not captured exists in a task table every 1 minute; if the data tasks which are not captured exist, locking the tasks to prevent other processes from repeatedly executing the same capturing task, and then calling and calling the data capturing module; and if the task cannot be inquired in the task table, the process enters the dormancy and waits for the next awakening of the timing trigger.
3. The puppeteer-based website data collection method according to claim 2, wherein the step 2 comprises:
And after the grabbing module is awakened, the grabbed task type is obtained from the grabbing task, and the target website url is determined according to the task type.
The grabbing module starts the chrome through puppeteer and opens the url of the target website;
When the chrome opens the target website url, the following events are monitored: sending an http request, returning contents by the http request, completing page loading and failing page loading;
and setting one or more data grabbing scripts, associating the grabbing scripts with one or more events, and calling puppeteer to grab data through the grabbing scripts when a specific event occurs.
4. the puppeteer-based website data collection method according to claim 3, wherein the step 3 comprises:
analyzing the data content, and then checking whether the data content is legal or not; if the data content is illegal, setting the task state as abnormal, entering a task pool, and waiting for retry of data capture;
If the data content is legal, performing completeness check on the data content; if the data is incomplete, setting the task state to be captured, and waiting for the task data to be captured continuously;
if the data content is finished, pushing the data into a message queue; while the task state is set to complete.
CN201910773517.7A 2019-08-21 2019-08-21 puppeteeer-based website data collection method Pending CN110569414A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910773517.7A CN110569414A (en) 2019-08-21 2019-08-21 puppeteeer-based website data collection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910773517.7A CN110569414A (en) 2019-08-21 2019-08-21 puppeteeer-based website data collection method

Publications (1)

Publication Number Publication Date
CN110569414A true CN110569414A (en) 2019-12-13

Family

ID=68774103

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910773517.7A Pending CN110569414A (en) 2019-08-21 2019-08-21 puppeteeer-based website data collection method

Country Status (1)

Country Link
CN (1) CN110569414A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112948659A (en) * 2021-03-09 2021-06-11 深圳九星互动科技有限公司 Webpage data acquisition method, device, system and medium
CN113065055A (en) * 2021-04-21 2021-07-02 平安国际智慧城市科技股份有限公司 News information capturing method and device, electronic equipment and storage medium
CN113934914A (en) * 2021-12-20 2022-01-14 成都橙视传媒科技股份公司 Method for collecting batch encrypted data of news media

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1980243A (en) * 2005-10-28 2007-06-13 埃森哲全球服务有限公司 Service broker integration layer for supporting telecommunication client service requests
CN103856467A (en) * 2012-12-06 2014-06-11 百度在线网络技术(北京)有限公司 Method and distributed system for achieving safety scanning
CN109471979A (en) * 2018-12-20 2019-03-15 北京奇安信科技有限公司 A kind of method, system, equipment and medium grabbing dynamic page
CN109815384A (en) * 2019-01-29 2019-05-28 携程旅游信息技术(上海)有限公司 Method, system, equipment and the storage medium that crawler is realized

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1980243A (en) * 2005-10-28 2007-06-13 埃森哲全球服务有限公司 Service broker integration layer for supporting telecommunication client service requests
CN103856467A (en) * 2012-12-06 2014-06-11 百度在线网络技术(北京)有限公司 Method and distributed system for achieving safety scanning
CN109471979A (en) * 2018-12-20 2019-03-15 北京奇安信科技有限公司 A kind of method, system, equipment and medium grabbing dynamic page
CN109815384A (en) * 2019-01-29 2019-05-28 携程旅游信息技术(上海)有限公司 Method, system, equipment and the storage medium that crawler is realized

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李竹瑶: "基于Node爬虫的微博舆情采集系统分析与设计", 《中国优秀博硕士学位论文全文数据库 信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112948659A (en) * 2021-03-09 2021-06-11 深圳九星互动科技有限公司 Webpage data acquisition method, device, system and medium
CN113065055A (en) * 2021-04-21 2021-07-02 平安国际智慧城市科技股份有限公司 News information capturing method and device, electronic equipment and storage medium
CN113065055B (en) * 2021-04-21 2024-04-02 深圳赛安特技术服务有限公司 News information capturing method and device, electronic equipment and storage medium
CN113934914A (en) * 2021-12-20 2022-01-14 成都橙视传媒科技股份公司 Method for collecting batch encrypted data of news media

Similar Documents

Publication Publication Date Title
CN110569414A (en) puppeteeer-based website data collection method
CN107317724B (en) Data acquisition system and method based on cloud computing technology
CN107895009B (en) Distributed internet data acquisition method and system
CN103605764B (en) A kind of network crawler system and web crawlers multitask execution and dispatching method
CN101799751B (en) Method for building monitoring agent software of host machine
CN109934361B (en) Automatic operation and maintenance platform model based on container and big data
CN101594261B (en) Forgery website monitoring method and system thereof
CN111181767A (en) Monitoring and fault self-healing system and method for complex system
CN110347899B (en) Distributed internet data acquisition system and method based on event-driven model
WO2019169761A1 (en) Automated testing method and apparatus, and storage medium
CN101651707A (en) Method for automatically acquiring user behavior log of network
CN104539053A (en) Power dispatching automation polling robot and method based on reptile technology
CN109660532B (en) Distributed agricultural network data acquisition method and acquisition system thereof
CN110781143A (en) Method and device for querying and extracting server logs
CN106407219B (en) Crawling method and device for webpage links
CN105224441B (en) Virtual machine information acquisition device, method and virtual machine information maintaining method and system
CN104462158A (en) Data grabbing method and data grabbing system
CN103428212A (en) Malicious code detection and defense method
WO2023231704A1 (en) Algorithm running method, apparatus and device, and storage medium
CN113656673A (en) Master-slave distributed content crawling robot for advertisement delivery
CN102024042B (en) Method, device and system for monitoring picture showing effect
CN103886033B (en) Intelligent vertical searching device and method for safety industry chain
US20070101338A1 (en) Detection, diagnosis and resolution of deadlocks and hangs
CN107291938B (en) Order inquiry system and method
JP2003228498A (en) History data collecting system and history data collecting program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191213