CN111078975A

CN111078975A - Multi-node incremental data acquisition system and acquisition method

Info

Publication number: CN111078975A
Application number: CN201911338747.7A
Authority: CN
Inventors: 邢文涛
Original assignee: Beijing Tianyuan Innovation Technology Co ltd
Current assignee: Beijing Tianyuan Innovation Technology Co ltd
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2020-04-28
Anticipated expiration: 2039-12-23
Also published as: CN111078975B

Abstract

The embodiment of the invention provides a multi-node incremental data acquisition system and an acquisition method. The system comprises: the method comprises the following steps that a collection node, a task distribution and deduplication node and a data storage node are arranged; the method comprises the following steps: receiving an acquisition task issued by at least one task distribution and duplicate removal node by at least one acquisition node, analyzing the website address of the website to be acquired according to the acquisition task, and acquiring data; and sending the acquired data to at least one data storage node for storage, and feeding back the acquisition state to at least one task distribution and deduplication node for state updating. The embodiment of the invention solves the problem of disk IO bottleneck caused by the storage of a large amount of data, realizes task distribution and effective duplicate removal, improves the collection performance of the Pyeeteer and avoids the bottleneck of network requests by using the Pyeeteer collection, Redis task distribution and Kafka distributed storage technology.

Description

Multi-node incremental data acquisition system and acquisition method

Technical Field

The invention relates to the technical field of data acquisition, in particular to a multi-node incremental data acquisition system and an acquisition method.

Background

At present, when most of network data acquisition tools are used for acquiring data of webpages, particularly, the data acquisition tools are not supported by websites which need to log in to acquire data, most of websites can browse data after logging in, and most of websites which need to turn pages to browse data are subjected to Ajax dynamic rendering to obtain pictures, so that the difficulty is increased for data acquisition.

In addition, because the data volume of network data acquisition is huge, the requirement on the acquisition speed is higher, and the acquired data is often sent to corresponding storage equipment in the prior art, the data reading and writing speed is affected due to the bottleneck problem of storage, and the accuracy of data transmission cannot be ensured.

Disclosure of Invention

The embodiment of the invention provides a multi-node incremental data acquisition system and an acquisition method, which are used for solving the bottleneck of disk IO read-write caused by the influence on the actual acquisition speed due to huge data acquisition amount in the prior art.

In a first aspect, an embodiment of the present invention provides a multi-node incremental data acquisition system, including:

the method comprises the following steps that a collection node, a task distribution and deduplication node and a data storage node are arranged; wherein:

the acquisition node is used for acquiring cookie information of the website to be acquired after logging in the website to be acquired based on authentication parameters according to the website address of the website to be acquired, downloading the content of a target webpage, analyzing all target addresses needing to be acquired in the target webpage, transmitting preset parameter values of the target addresses to the task distribution and deduplication node, receiving acquisition tasks issued by the task distribution and deduplication node, and acquiring data according to the acquisition tasks;

the task distribution and deduplication node is used for receiving the target address analyzed by the acquisition node, maintaining an acquisition task queue, acquiring a historical download queue of all historical webpage acquisition records, judging whether the target address is added to the historical download queue according to whether the target address record exists in the historical download queue, and distributing the tasks in the acquisition task queue to a plurality of idle acquisition nodes;

the data storage node is used for receiving the collected data of the collecting node.

The collection node is further used for executing the operation on the data storage node according to the state result of data collection, and sending the state result of data collection to the task distribution and deduplication node.

The acquisition node is further configured to execute an operation on the data storage node according to a state result of data acquisition, and send the state result of data acquisition to the task distribution and deduplication node, and specifically includes:

if the data acquisition is judged to be successful, the acquired data is sent to the data storage node, and the successful state of the data acquisition is sent to the task distribution and duplicate removal node;

and if the data acquisition failure is judged and known, sending the failure state of the data acquisition to the task distribution and deduplication node.

Wherein, the determining whether to add the target address to the historical download queue according to whether the record of the target address exists in the historical download queue specifically includes:

if the task distribution and deduplication node does not acquire the target address from the historical download queue, adding the target address to the acquisition task queue;

and if the task distribution and deduplication node acquires the target address from the historical download queue, forbidding to add the target address to the acquisition task queue.

The task distribution and deduplication node is also used for returning acquisition results to the plurality of idle acquisition nodes;

if the acquisition is successful, adding the webpage address corresponding to the acquisition result to the historical download queue, and giving a first mark to the historical download queue;

and if the acquisition fails, increasing the failure times of the webpage address corresponding to the acquisition result in the task distribution and duplication removal node by 1, if the download failure times exceed a preset download time threshold, stopping the acquisition, adding the webpage address corresponding to the acquisition result to the historical download queue, and giving a second mark to the historical download queue.

In a second aspect, an embodiment of the present invention provides a multi-node incremental data acquisition method, including:

the method comprises the steps that after a collection node logs in a website to be collected according to the website address of the website to be collected and based on authentication parameters, cookie information of the website to be collected is obtained, the content of a target webpage is downloaded, all target addresses needing to be collected in the target webpage are analyzed, preset parameter values of the target addresses are transmitted to a task distribution and deduplication node, collection tasks issued by the task distribution and deduplication node are received, and data collection is carried out according to the collection tasks;

receiving the target address analyzed by the acquisition node by the task distribution and deduplication node, maintaining an acquisition task queue, acquiring a historical download queue of all historical webpage acquisition records, judging whether the target address is added to the historical download queue according to whether the target address record exists in the historical download queue, and distributing the tasks in the acquisition task queue to a plurality of idle acquisition nodes;

and receiving the acquisition data of the acquisition node by a data storage node.

The method comprises the following steps that after logging in a website to be collected by a collection node according to the website of the website to be collected and based on authentication parameters, cookie information of the website to be collected is obtained, the content of a target webpage is downloaded, all target addresses needing to be collected in the target webpage are analyzed, preset parameter values of the target addresses are transmitted to a task distribution and deduplication node, collection tasks issued by the task distribution and deduplication node are received, data collection is carried out according to the collection tasks, and the method further comprises the following steps:

and executing the operation on the data storage node according to the result of data acquisition, and sending the state result of the data acquisition to the task distribution and deduplication node.

The executing the operation on the data storage node according to the result of data acquisition, and sending the state result of data acquisition to the task distribution and deduplication node specifically include:

According to the multi-node incremental data acquisition system and the acquisition method provided by the embodiment of the invention, through the Pppeteeer acquisition, Redis task distribution and Kafka distributed storage technology, the problem of disk IO bottleneck caused by task distribution, effective duplication removal and large data storage is solved, the acquisition performance of Pppeteeer is improved through multi-node asynchronous acquisition, the bottleneck of network requests is avoided through incremental crawling, and the memory space of a server is effectively utilized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a block diagram of a multi-node incremental data acquisition system according to an embodiment of the present invention;

fig. 2 is a flowchart of a multi-node incremental data acquisition method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to solve a series of problems occurring in data acquisition in the prior art, embodiments of the present invention provide a multi-node incremental data acquisition system and an acquisition method, where the acquisition system is divided into a pyppeneer acquisition module, a Redis task distribution and deduplication module, and a Kafka data collection module. Pyppeteer realizes logging in a target website and acquiring logging Cookie data, provides a logging certificate for data acquisition, completes an acquisition task issued by Redis, and finally stores the acquired data in a Kafka message queue; the Redis main task is an acquisition task for collecting Pyppeteer module transmission, and an acquisition queue is maintained and mainly divided into a task collection queue and a historical download queue. The task collection queue mainly completes collection of collection tasks transmitted by analysis of a Pyreeteer module and records collection logs, repeated collection is avoided, and the Pyreeteer provides an API (application program interface) with rich functions, is convenient to develop, can visualize login pages and realize the interaction process between manual work and running programs, and can realize Ajax dynamic rendering data; the Kafka data collection queue is mainly used for storing data generated by a Pppeteeer collection task. According to the scheme, a plurality of Pyppeteer acquisition nodes, Redis clusters and Kafka clusters are used, the acquisition efficiency can be effectively improved, the stability of task distribution and data storage is improved, and incremental acquisition of data is achieved through a Redis message queue.

Fig. 1 is a structural diagram of a multi-node incremental data acquisition system according to an embodiment of the present invention, as shown in fig. 1, including:

Specifically, the collection node is configured to log in a website by using authentication parameters, such as parameters of transmitting a user name, a password, and the like, according to a website address of the website to be collected, and then obtain Cookie information of the website for a credential for later website data collection; downloading the content of a target webpage, analyzing all network addresses needing to be collected in the webpage, and transmitting HASH values of the addresses, namely preset parameter values to task distribution and duplicate removal nodes; and receiving a target address to be acquired by the task distribution node, and acquiring data.

The task distribution and deduplication node is used for receiving a destination webpage address analyzed by the acquisition node, maintaining an acquisition task queue { HASH, webpage address, whether a task is issued or not, failure times } and a historical download queue { HASH, webpage address, whether acquisition is successful } for recording all historical webpage acquisition records, wherein HASH is an HASH value generated by the webpage address. Meanwhile, when the task distribution node receives task data sent by the acquisition node, whether a history task has a record or not needs to be checked from the history downloading queue, so that the target address is prevented from being repeatedly added into the downloading queue, and the task in the acquisition task queue is sent to an idle acquisition node to be executed after the judgment.

And the data storage node receives the acquired data of the acquisition node and stores the acquired data into the corresponding topic.

The embodiment of the invention is based on Pyether acquisition, Redis task distribution and Kafka distributed storage technology, realizes an efficient and stable network data acquisition mode, effectively solves the problems of login, asynchronous data loading and the like in the network data acquisition process, can effectively utilize network bandwidth, avoids repeated acquisition of network resources, and solves the problems of disk IO bottleneck and the like.

Based on the above embodiment, the collection node is further configured to execute an operation on the data storage node according to a status result of data collection, and send the status result of data collection to the task distribution and deduplication node.

Specifically, when the acquisition node acquires data, if the acquisition is successful, the acquisition result is sent to the data storage node, and the state of successful task acquisition is sent to the task distribution node; and if the collection fails, sending the state of the task collection failure to the task distribution node.

Based on any of the above embodiments, the task distribution and deduplication node is further configured to wait for the plurality of idle collection nodes to return collection results;

Specifically, when the task distribution node receives task data sent by the acquisition node, whether a history task has a record or not needs to be checked from a history download queue, and if the webpage address does not exist, the webpage address is added into the acquisition task queue; and if the webpage address exists, prohibiting the webpage address information from being added into the acquisition task queue. The task distribution node distributes the tasks in the collection task queue to each idle collection node (whether the task of the task in the task queue is issued with an attribute marked as true) and waits for the collection result of the collection node to be returned, if the collection is successful, the webpage address is added to a historical download queue of Redis, and a field marked as true for the successful collection of the marked queue, namely a first mark; if the downloading fails, increasing 1 to the failure times of the webpage in Redis, setting a default threshold n of the downloading times, if the failure times of the downloading exceeds n, abandoning the downloading task of the webpage, maintaining the webpage in a historical downloading queue, and marking whether the acquisition success field of the queue is false, namely a second mark.

Fig. 2 is a flowchart of a multi-node incremental data acquisition method according to an embodiment of the present invention, as shown in fig. 2, including:

s1, the acquisition node logs in the website to be acquired according to the website of the website to be acquired and based on authentication parameters, acquires cookie information of the website to be acquired, downloads the content of a target webpage, analyzes all target addresses to be acquired in the target webpage, transmits preset parameter values of the target addresses to a task distribution and deduplication node, receives acquisition tasks issued by the task distribution and deduplication node, and acquires data according to the acquisition tasks;

s2, the task distributing and deduplication node receives the target address analyzed by the acquisition node, maintains an acquisition task queue, acquires a historical download queue of all historical webpage acquisition records, judges whether the target address is added to the historical download queue according to whether the target address record exists in the historical download queue, and distributes the tasks in the acquisition task queue to a plurality of idle acquisition nodes;

and S3, receiving the collection data of the collection node by the data storage node.

Specifically, in step S1, at least one collection node logs in a website by using authentication parameters, such as parameters of transmitting a user name and a password, according to a website address of the website to be collected, and then obtains Cookie information of the website for a credential for later website data collection; downloading the content of a target webpage, analyzing all network addresses needing to be collected in the webpage, and transmitting HASH values of the addresses, namely preset parameter values to task distribution and duplicate removal nodes; and receiving a target address to be acquired by the task distribution node, and acquiring data. The Pyether collection node collects the website addresses required to be collected by a target website, the collection node tool uses Pyether, a plurality of nodes asynchronously execute collection tasks and then push the collection tasks to the Redis cluster node, and the Redis judges whether the collected website addresses required to be collected need to be added to a message queue, so that the situation that the same webpage addresses are repeatedly accessed can be effectively avoided, meanwhile, the data collected by the collection node cannot be repeated, and the storage effectiveness of the Kafka cluster node is ensured;

in step S2, after a plurality of collection nodes collect data, the collected data is stored in a data storage node, and the collection status is fed back to a task distribution node to update the status of the collection task, at least one task distribution and deduplication node receives a destination web page address analyzed by the collection node, and maintains a collection task queue { HASH, web page address, whether a task is issued, failure times } and a history download queue { HASH, web page address, whether collection is successful } which records all history web page collection records, HASH is a HASH value generated by a web page address. Meanwhile, when the task distribution node receives task data sent by the acquisition node, whether a history task has a record or not needs to be checked from the history downloading queue, so that the target address is prevented from being repeatedly added into the downloading queue, and the task in the acquisition task queue is sent to an idle acquisition node to be executed after the judgment. At least one task distribution and duplicate removal node can build a Redis cluster environment for master-slave backup, and timeliness and stability of task distribution and duplicate removal node data receiving and task distribution are improved; here, Redis an open source, log-type, Key-Value database written in ANSI C language, supporting network, based on memory, and persistent, and provides API for multiple languages.

In step S3, at least one data storage node receives the collected data of multiple collection nodes, and the multiple data storage nodes may construct a kafka cluster environment for achieving high throughput and storage stability of data storage. Here, Kafka is a high-throughput distributed publish-subscribe messaging system that can handle all the flow data of consumer actions in web sites, such as web browsing, searching and other user actions, which are a key factor for many social functions on modern networks, and is usually solved by processing logs and log aggregation due to throughput requirements, and is a feasible solution for log data like Hadoop and offline analysis system, but also for real-time processing limitations, and the object of Kafka is to unify online and offline message processing through Hadoop's parallel loading mechanism, and also to provide real-time messages through clustering.

The embodiment of the invention realizes an efficient and stable network data acquisition mode by performing data acquisition at multiple nodes, distributing tasks, removing duplication and storing data, and effectively solves the problems of login, asynchronous data loading and the like in the network data acquisition process.

According to any of the above embodiments, the method further includes, after step S1:

According to the embodiment of the invention, by adopting Pyppeteer acquisition, Redis task distribution and Kafka distributed storage technology, the problem of acquisition of a lot of website data needing to be logged in is solved, the acquisition of Ajax dynamic rendering data, task distribution and effective duplicate removal are realized, and the bottleneck problem of disk IO caused by storage of a large amount of data is avoided.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A multi-node incremental data acquisition system, comprising: the method comprises the following steps that a collection node, a task distribution and deduplication node and a data storage node are arranged; wherein:

2. The multi-node incremental data collection system of claim 1, wherein the collection node is further configured to perform operations on the data storage node according to a status result of data collection, and to send the status result of data collection to the task distribution and deduplication node.

3. The multi-node incremental data collection system of claim 2, wherein the collection node is further configured to perform an operation on the data storage node according to a status result of data collection, and send the status result of data collection to the task distribution and deduplication node, specifically including:

4. The system according to claim 1, wherein the determining whether to add the target address to the historical download queue according to whether there is a record of the target address in the historical download queue comprises:

5. The multi-node incremental data collection system of claim 1, wherein the task distribution and deduplication node is further configured to wait for the number of idle collection nodes to return collection results;

6. A multi-node incremental data acquisition method is characterized by comprising the following steps:

7. The multi-node incremental data acquisition method according to claim 6, wherein the acquiring node logs in the website to be acquired according to the website address of the website to be acquired and based on the authentication parameter, acquires cookie information of the website to be acquired, downloads content of a target webpage, analyzes all target addresses required to be acquired in the target webpage, transmits preset parameter values of the target addresses to the task distribution and deduplication node, receives an acquisition task issued by the task distribution and deduplication node, and performs data acquisition according to the acquisition task, and then further comprises:

8. The multi-node incremental data collection method according to claim 7, wherein the performing the operation on the data storage node according to the result of data collection and sending the status result of data collection to the task distribution and deduplication node specifically includes: