CN110569414A - puppeteeer-based website data collection method - Google Patents
puppeteeer-based website data collection method Download PDFInfo
- Publication number
- CN110569414A CN110569414A CN201910773517.7A CN201910773517A CN110569414A CN 110569414 A CN110569414 A CN 110569414A CN 201910773517 A CN201910773517 A CN 201910773517A CN 110569414 A CN110569414 A CN 110569414A
- Authority
- CN
- China
- Prior art keywords
- data
- task
- puppeteer
- grabbing
- website
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/4401—Bootstrapping
- G06F9/4418—Suspend and resume; Hibernate and awake
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/546—Message passing systems or structures, e.g. queues
Abstract
the invention provides a website data collection method based on puppeteer, which is characterized in that a target website is loaded in the puppeteer, internal data capture is carried out through a plurality of independent data crawling insect processes, and each data crawling insect process comprises the following steps: step 1, acquiring a data grabbing task and locking the grabbing task; step 2, opening a target website by using puppeteer, and extracting target data after a preset event occurs; and 3, storing the captured data, unlocking and marking the capture task as finished. The method and the device can directly take the rendered page to acquire all effective information. The method can be operated in a non-view mode, a plurality of instances can be operated simultaneously, the system resource occupation is low, the distributed deployment can be realized, the overall stability and the capturing efficiency are very high, the method can be deployed on a server of a common linux system, and the captured network service can be provided as a service.
Description
Technical Field
The invention belongs to the field of website data collection, and particularly relates to a website data collection method based on puppeteer.
Background
And the ordinary website data acquisition is to acquire an http request corresponding to the url of the website through a crawler and analyze the result of the http request. A common crawler can crawl traditional web content. But currently, most websites are used, ajax is used to obtain content, and pages are rendered by JavaScript. The common web crawler cannot obtain valid data or can obtain only partial data.
some of the more advanced acquisition methods are to open the browser through the selenium and acquire the web page content through the orientation of the element positioning. The Selenium is a testing framework, must run on an operating system with a view, and cannot be deployed on a common linux server. Therefore, distributed deployment cannot be achieved, stability and capturing efficiency are very poor, and capturing service cannot be provided as service.
Disclosure of Invention
The invention aims to provide a website data collection method based on puppeteer, and the method is used for solving the technical problem.
the invention provides a website data collection method based on puppeteer, which is characterized in that a target website is loaded in the puppeteer, internal data capture is carried out through a plurality of independent data crawling insect processes, and each data crawling insect process comprises the following steps:
Step 1, acquiring a data grabbing task and locking the grabbing task;
Step 2, opening a target website by using puppeteer, and extracting target data after a preset event occurs;
and 3, storing the captured data, unlocking and marking the capture task as finished.
Further, the step 1 comprises:
Setting a timing trigger, and inquiring whether a data task which is not captured exists in a task table every 1 minute; if the data tasks which are not captured exist, locking the tasks to prevent other processes from repeatedly executing the same capturing task, and then calling and calling the data capturing module; and if the task cannot be inquired in the task table, the process enters the dormancy and waits for the next awakening of the timing trigger.
Further, the step 2 comprises:
And after the grabbing module is awakened, the grabbed task type is obtained from the grabbing task, and the target website url is determined according to the task type.
The grabbing module starts the chrome through puppeteer and opens the url of the target website;
When the chrome opens the target website url, the following events are monitored: sending an http request, returning contents by the http request, completing page loading and failing page loading;
And setting one or more data grabbing scripts, associating the grabbing scripts with one or more events, and calling puppeteer to grab data through the grabbing scripts when a specific event occurs.
Further, the step 3 comprises:
analyzing the data content, and then checking whether the data content is legal or not; if the data content is illegal, setting the task state as abnormal, entering a task pool, and waiting for retry of data capture;
If the data content is legal, performing completeness check on the data content; if the data is incomplete, setting the task state to be captured, and waiting for the task data to be captured continuously;
If the data content is finished, pushing the data into a message queue; while the task state is set to complete.
compared with the prior art, the invention has the beneficial effects that:
1) The method and the device can directly take the rendered page to acquire all effective information.
2) the invention is based on puppeteer which is a set of headless visitors and an operation api library thereof. The invention can operate in a mode without a view, can simultaneously operate a plurality of instances, has low system resource occupation, can be deployed in a distributed mode, has very high overall stability and capturing efficiency, can be deployed on a server of a common linux system, and can be used as a service to provide captured network service.
Drawings
FIG. 1 is an overall system framework diagram of the present invention;
FIG. 2 is a flow chart of the acquisition and locking of the data capture task of the present invention;
FIG. 3 is a flow chart of data capture according to the present invention;
FIG. 4 is a task data storage flow diagram of the present invention.
Detailed Description
the present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.
The embodiment provides a website data collection method based on Puppeteer, which operates the Chrome to open a target website through Puppeteer (a non-interface version of Chrome and a js interface sleeve for operating the Chrome). And after the target website is loaded in the puppeteer, acquiring data. The system can simultaneously have a plurality of instances to capture data, namely distributed deployment, and can be deployed on a plurality of machines, and a plurality of processes on each server can capture data simultaneously. Each process of the system is independent to grab tasks and mutually independent. The architecture of the overall system is shown in fig. 1:
The whole system can be deployed on any linux server. After the system is operated, a main process is started on the server. And the main process can start N crawler subprocesses according to the CPU core number obtained by the linux server. Each crawler subprocess runs independently and does not communicate with each other. The main process can monitor the subprocess and is responsible for the survival and restart of the subprocess.
the internal data of each data crawling insect process is captured, and the process can be roughly divided into three steps:
Acquiring data grabbing tasks and locking the grabbing tasks
opening a target website by using puppeteer, and extracting target data after a preset event occurs
Storing the captured data, unlocking and marking the capture task as completed
the following is a detailed description of the three steps.
the first step is the acquisition and locking of the data capture task. As shown in fig. 2.
the system is provided with a timing trigger, and whether a data task which is not captured exists is inquired in a task table every 1 minute. If the data task which is not grabbed exists, the task is locked, and other processes are prevented from repeatedly executing the same grabbing task. And then the data capture module is called up and called.
If the task is not found in the task table, the process goes to sleep and waits for the next wake-up of the timing trigger.
And secondly, data capture. As shown in fig. 3.
this step is the core function of puppeteer capture.
after the grabbing module is awakened, the type of the grabbed task is obtained from the grabbing task, and the target website url is determined according to the task type.
The crawling module will start the chrome through puppeteer and open the target web site url.
When the chrome opens the target website url, various events can be continuously started, and there are four events that the system needs to monitor:
Sending out an http request;
http requests to return content;
Completing page loading;
The page load failed.
one or more data capture scripts are set for different websites. The crawling script calls puppeteer to crawl target data from http request url, http request content, page html elements.
the capture script may be associated with one or more of the 4 events described above, and captures data when a particular event occurs.
if the page fails to open due to network reasons, website reasons, etc., the crawling task itself is still complete, but the data content is 'failure'.
and thirdly, storing the task data. As shown in fig. 4.
Firstly, analyzing the data content, and then checking whether the data content is legal or not.
And if the data content is illegal, setting the task state as abnormal, entering a task pool, and waiting for retry of data capture.
And if the data content is legal, performing completeness check on the data content. And if the data is incomplete, setting the task state to be captured, and waiting for the task data to be captured continuously.
If the data content is complete, the data is pushed into the kafka message queue. (kafka is a common message queue implementation).
While the task state is set to complete.
The technical effects of the invention comprise:
1) the invention is based on puppeteer which is a set of headless visitors and an operation api library thereof. The invention can run in a mode without a view, so the invention can be deployed under the traditional linux operating system.
2) the invention can simultaneously run a plurality of instances, has low system resource occupation, can be deployed in a distributed way, fully utilizes server resources, and has very high overall stability and grasping efficiency. Only new server resources are needed to be added, and the capacity can be conveniently expanded integrally.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
Claims (4)
1. the website data collection method based on puppeteer is characterized in that a target website is loaded in the puppeteer, internal data capture is carried out through a plurality of independent data crawling subprocesses, and each data crawling subprocess comprises the following steps:
Step 1, acquiring a data grabbing task and locking the grabbing task;
Step 2, opening a target website by using puppeteer, and extracting target data after a preset event occurs;
And 3, storing the captured data, unlocking and marking the capture task as finished.
2. the puppeteer-based website data collection method according to claim 1, wherein the step 1 comprises the following steps:
Setting a timing trigger, and inquiring whether a data task which is not captured exists in a task table every 1 minute; if the data tasks which are not captured exist, locking the tasks to prevent other processes from repeatedly executing the same capturing task, and then calling and calling the data capturing module; and if the task cannot be inquired in the task table, the process enters the dormancy and waits for the next awakening of the timing trigger.
3. The puppeteer-based website data collection method according to claim 2, wherein the step 2 comprises:
And after the grabbing module is awakened, the grabbed task type is obtained from the grabbing task, and the target website url is determined according to the task type.
The grabbing module starts the chrome through puppeteer and opens the url of the target website;
When the chrome opens the target website url, the following events are monitored: sending an http request, returning contents by the http request, completing page loading and failing page loading;
and setting one or more data grabbing scripts, associating the grabbing scripts with one or more events, and calling puppeteer to grab data through the grabbing scripts when a specific event occurs.
4. the puppeteer-based website data collection method according to claim 3, wherein the step 3 comprises:
analyzing the data content, and then checking whether the data content is legal or not; if the data content is illegal, setting the task state as abnormal, entering a task pool, and waiting for retry of data capture;
If the data content is legal, performing completeness check on the data content; if the data is incomplete, setting the task state to be captured, and waiting for the task data to be captured continuously;
if the data content is finished, pushing the data into a message queue; while the task state is set to complete.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910773517.7A CN110569414A (en) | 2019-08-21 | 2019-08-21 | puppeteeer-based website data collection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910773517.7A CN110569414A (en) | 2019-08-21 | 2019-08-21 | puppeteeer-based website data collection method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110569414A true CN110569414A (en) | 2019-12-13 |
Family
ID=68774103
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910773517.7A Pending CN110569414A (en) | 2019-08-21 | 2019-08-21 | puppeteeer-based website data collection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110569414A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112948659A (en) * | 2021-03-09 | 2021-06-11 | 深圳九星互动科技有限公司 | Webpage data acquisition method, device, system and medium |
CN113065055A (en) * | 2021-04-21 | 2021-07-02 | 平安国际智慧城市科技股份有限公司 | News information capturing method and device, electronic equipment and storage medium |
CN113934914A (en) * | 2021-12-20 | 2022-01-14 | 成都橙视传媒科技股份公司 | Method for collecting batch encrypted data of news media |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1980243A (en) * | 2005-10-28 | 2007-06-13 | 埃森哲全球服务有限公司 | Service broker integration layer for supporting telecommunication client service requests |
CN103856467A (en) * | 2012-12-06 | 2014-06-11 | 百度在线网络技术(北京)有限公司 | Method and distributed system for achieving safety scanning |
CN109471979A (en) * | 2018-12-20 | 2019-03-15 | 北京奇安信科技有限公司 | A kind of method, system, equipment and medium grabbing dynamic page |
CN109815384A (en) * | 2019-01-29 | 2019-05-28 | 携程旅游信息技术(上海)有限公司 | Method, system, equipment and the storage medium that crawler is realized |
-
2019
- 2019-08-21 CN CN201910773517.7A patent/CN110569414A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1980243A (en) * | 2005-10-28 | 2007-06-13 | 埃森哲全球服务有限公司 | Service broker integration layer for supporting telecommunication client service requests |
CN103856467A (en) * | 2012-12-06 | 2014-06-11 | 百度在线网络技术(北京)有限公司 | Method and distributed system for achieving safety scanning |
CN109471979A (en) * | 2018-12-20 | 2019-03-15 | 北京奇安信科技有限公司 | A kind of method, system, equipment and medium grabbing dynamic page |
CN109815384A (en) * | 2019-01-29 | 2019-05-28 | 携程旅游信息技术(上海)有限公司 | Method, system, equipment and the storage medium that crawler is realized |
Non-Patent Citations (1)
Title |
---|
李竹瑶: "基于Node爬虫的微博舆情采集系统分析与设计", 《中国优秀博硕士学位论文全文数据库 信息科技辑》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112948659A (en) * | 2021-03-09 | 2021-06-11 | 深圳九星互动科技有限公司 | Webpage data acquisition method, device, system and medium |
CN113065055A (en) * | 2021-04-21 | 2021-07-02 | 平安国际智慧城市科技股份有限公司 | News information capturing method and device, electronic equipment and storage medium |
CN113065055B (en) * | 2021-04-21 | 2024-04-02 | 深圳赛安特技术服务有限公司 | News information capturing method and device, electronic equipment and storage medium |
CN113934914A (en) * | 2021-12-20 | 2022-01-14 | 成都橙视传媒科技股份公司 | Method for collecting batch encrypted data of news media |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110569414A (en) | puppeteeer-based website data collection method | |
CN107317724B (en) | Data acquisition system and method based on cloud computing technology | |
CN107895009B (en) | Distributed internet data acquisition method and system | |
CN103605764B (en) | A kind of network crawler system and web crawlers multitask execution and dispatching method | |
CN101799751B (en) | Method for building monitoring agent software of host machine | |
CN109934361B (en) | Automatic operation and maintenance platform model based on container and big data | |
CN101594261B (en) | Forgery website monitoring method and system thereof | |
CN111181767A (en) | Monitoring and fault self-healing system and method for complex system | |
CN110347899B (en) | Distributed internet data acquisition system and method based on event-driven model | |
WO2019169761A1 (en) | Automated testing method and apparatus, and storage medium | |
CN101651707A (en) | Method for automatically acquiring user behavior log of network | |
CN104539053A (en) | Power dispatching automation polling robot and method based on reptile technology | |
CN109660532B (en) | Distributed agricultural network data acquisition method and acquisition system thereof | |
CN110781143A (en) | Method and device for querying and extracting server logs | |
CN106407219B (en) | Crawling method and device for webpage links | |
CN105224441B (en) | Virtual machine information acquisition device, method and virtual machine information maintaining method and system | |
CN104462158A (en) | Data grabbing method and data grabbing system | |
CN103428212A (en) | Malicious code detection and defense method | |
WO2023231704A1 (en) | Algorithm running method, apparatus and device, and storage medium | |
CN113656673A (en) | Master-slave distributed content crawling robot for advertisement delivery | |
CN102024042B (en) | Method, device and system for monitoring picture showing effect | |
CN103886033B (en) | Intelligent vertical searching device and method for safety industry chain | |
US20070101338A1 (en) | Detection, diagnosis and resolution of deadlocks and hangs | |
CN107291938B (en) | Order inquiry system and method | |
JP2003228498A (en) | History data collecting system and history data collecting program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191213 |