CN111078975B

CN111078975B - Multi-node incremental data acquisition system and acquisition method

Info

Publication number: CN111078975B
Application number: CN201911338747.7A
Authority: CN
Inventors: 邢文涛
Original assignee: Beijing Tianyuan Innovation Technology Co ltd
Current assignee: Beijing Tianyuan Innovation Technology Co ltd
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2023-04-28
Anticipated expiration: 2039-12-23
Also published as: CN111078975A

Abstract

The embodiment of the invention provides a multi-node incremental data acquisition system and a multi-node incremental data acquisition method. The system comprises: the system comprises an acquisition node, a task distribution node, a deduplication node and a data storage node; the method comprises the following steps: at least one acquisition node receives at least one acquisition task distributed by the task distribution node and issued by the deduplication node, analyzes the website of the website to be acquired according to the acquisition task, and performs data acquisition; and sending the acquired data to at least one data storage node for storage, and feeding back the acquisition state to at least one task distribution and deduplication node for state update. According to the embodiment of the invention, the website data acquisition required to be logged in is solved by the Pypseteer acquisition, redis task distribution and Kafka distributed storage technology, the task distribution and effective deduplication are realized, the disk IO bottleneck problem caused by a large amount of data storage is avoided, the acquisition performance of the Pypseteer is improved, and the bottleneck of network requests is avoided.

Description

Multi-node incremental data acquisition system and acquisition method

Technical Field

The invention relates to the technical field of data acquisition, in particular to a multi-node incremental data acquisition system and a multi-node incremental data acquisition method.

Background

At present, when most network data acquisition tools are used for acquiring data of webpages, particularly, the support of websites which need to log in to acquire the data on the data acquisition tools is not very friendly and convenient, and most websites need to log in to browse the data, and in addition, most websites need to turn pages to browse the data, and the data are all obtained through Ajax dynamic rendering, so that the difficulty is increased for data acquisition.

In addition, because the data volume of network data acquisition is huge, the requirement on the acquisition speed is higher, the bottleneck problem of storage is encountered when the acquired data is sent to corresponding storage equipment in the prior art, the speed of data reading and writing is influenced, and the accuracy of data transmission cannot be ensured.

Disclosure of Invention

The embodiment of the invention provides a multi-node incremental data acquisition system and an acquisition method, which are used for solving the bottleneck of IO read-write of a disk caused by the fact that the actual acquisition speed is influenced due to the huge data acquisition amount in the prior art.

In a first aspect, an embodiment of the present invention provides a multi-node incremental data collection system, including:

the system comprises an acquisition node, a task distribution node, a deduplication node and a data storage node; wherein:

the acquisition node is used for acquiring cookie information of a website to be acquired according to the website of the website to be acquired, acquiring cookie information of the website to be acquired based on authentication parameters after logging in the website to be acquired, downloading contents of a target webpage, analyzing all target addresses to be acquired in the target webpage, transmitting preset parameter values of the target addresses to the task distribution and deduplication node, receiving acquisition tasks issued by the task distribution and deduplication node, and acquiring data according to the acquisition tasks;

the task distributing and deduplicating node is used for receiving the target address analyzed by the collecting node, maintaining a collecting task queue, acquiring a history downloading queue of all history webpage collecting records, judging whether the target address is added to the history downloading queue according to whether the history downloading queue has the record of the target address or not, and distributing the tasks in the collecting task queue to a plurality of idle collecting nodes;

the data storage node is used for receiving the acquired data of the acquisition node.

The collection node is further configured to perform an operation on the data storage node according to a status result of data collection, and send the status result of data collection to the task distribution and deduplication node.

The collecting node is further configured to perform an operation on the data storage node according to a status result of data collection, and send the status result of data collection to the task distributing and deduplicating node, and specifically includes:

if the data acquisition is judged to be successful, the acquired data is sent to the data storage node, and the successful state of the data acquisition is sent to the task distribution and deduplication node;

and if the data acquisition failure is judged, sending the failure state of the data acquisition to the task distribution and deduplication node.

The determining whether to add the target address to the history download queue according to whether the record of the target address exists in the history download queue specifically includes:

if the task distribution and deduplication node does not acquire the target address from the historical download queue, adding the target address into the acquisition task queue;

and if the task distribution and deduplication node acquires the target address from the historical download queue, prohibiting the target address from being added into the acquisition task queue.

The task distribution and deduplication nodes are also used for returning acquisition results to the plurality of idle acquisition nodes;

if the acquisition is successful, adding a webpage address corresponding to the acquisition result to the history downloading queue, and giving a first mark to the history downloading queue;

if the acquisition fails, increasing the failure times of the webpage address corresponding to the acquisition result in the task distribution and duplication removal node by 1, if the download failure times exceed a preset download time threshold, stopping acquisition, adding the webpage address corresponding to the acquisition result to the historical download queue, and giving a second mark to the historical download queue.

In a second aspect, an embodiment of the present invention provides a multi-node incremental data collection method, including:

acquiring cookie information of a website to be acquired by an acquisition node according to the website of the website to be acquired, acquiring cookie information of the website to be acquired based on authentication parameters, downloading the content of a target webpage, analyzing all target addresses to be acquired in the target webpage, transmitting preset parameter values of the target addresses to a task distribution and deduplication node, receiving acquisition tasks issued by the task distribution and deduplication node, and acquiring data according to the acquisition tasks;

the task distributing and deduplicating node receives the target address analyzed by the collecting node, maintains a collecting task queue, acquires a history downloading queue of all history webpage collecting records, judges whether the target address is added to the history downloading queue according to whether the history downloading queue has the record of the target address or not, and distributes tasks in the collecting task queue to a plurality of idle collecting nodes;

and receiving the acquired data of the acquisition node by a data storage node.

The method comprises the steps that an acquisition node acquires cookie information of a website to be acquired after logging in the website to be acquired according to the website address of the website to be acquired and based on authentication parameters, downloads contents of target webpages, analyzes all target addresses to be acquired in the target webpages, transmits preset parameter values of the target addresses to task distribution and deduplication nodes, receives acquisition tasks issued by the task distribution and deduplication nodes, acquires data according to the acquisition tasks, and then further comprises the steps of:

and executing the operation on the data storage node according to the data acquisition result, and sending the data acquisition state result to the task distribution and deduplication node.

The method specifically includes the steps of executing an operation on the data storage node according to a data acquisition result, and sending a data acquisition state result to the task distribution and deduplication node, wherein the operation specifically includes:

According to the multi-node incremental data acquisition system and the acquisition method, the problems of task distribution, effective duplication removal and disk IO bottleneck caused by a large amount of data storage are solved by acquiring the website data needing to be logged on based on Pypreteer acquisition, redis task distribution and Kafka distributed storage technology, the acquisition performance of the Pypreteer is improved by multi-node asynchronous acquisition, the bottleneck of network requests is avoided by incremental crawling, and the memory space of a server is effectively utilized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a block diagram of a multi-node incremental data acquisition system provided by an embodiment of the present invention;

FIG. 2 is a flowchart of a multi-node incremental data collection method according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order to solve a series of problems in data acquisition in the prior art, the embodiment of the invention provides a multi-node incremental data acquisition system and an acquisition method. The Pypseteer realizes logging in a target website and acquires logging in Cookie data, provides logging in credentials for data acquisition, completes acquisition tasks issued by Redis, and finally stores the acquired data into a Kafka message queue; the Redis main tasks are collection tasks transmitted by the Pypeteer module, and the collection queues are maintained and mainly divided into task collection queues and history downloading queues. The task collection queue mainly completes collection tasks which are transmitted by the Pypeteer module in a parsing way, records collection logs and avoids repeated collection, and here, the Pypeteer provides an API interface with rich functions based on a web automation test frame of a chrome, is convenient to develop, can visualize a login page and can realize an interaction process between a person and an operation program, and can realize data of Ajax dynamic rendering; the Kafka data collection queue is mainly used for storing data generated by the Pypeteer acquisition task. According to the scheme, a plurality of Pypseteer acquisition nodes and Redis and Kafka clusters are used, so that the acquisition efficiency can be effectively improved, the stability of task distribution and data storage can be improved, and the incremental acquisition of data can be realized through a Redis message queue.

FIG. 1 is a block diagram of a multi-node incremental data acquisition system according to an embodiment of the present invention, as shown in FIG. 1, including:

Specifically, the collection node is configured to log in to a website by using authentication parameters, such as a transfer user name, a password, and other parameters, according to a website address of the website to be collected, and then obtain Cookie information of the website for later website data collection; downloading the content of a target webpage, analyzing all network addresses to be acquired in the webpage, and transmitting HASH values of the addresses, namely preset parameter values, to task distribution and duplication removal nodes; and receiving a target address which needs to be acquired by the task distribution node, and acquiring data.

The task distributing and deduplicating node is used for receiving the destination webpage address analyzed by the collecting node, maintaining a collecting task queue { HASH, webpage address, whether the task is issued or not, failure times } and recording the historical download queues { HASH, webpage address, whether the collection is successful } of all the historical webpage collecting records, wherein HASH is a HASH value generated by the webpage address. Meanwhile, when the task distribution node receives task data sent by the acquisition node, whether a history task exists or not needs to be checked from a history downloading queue, so that the situation that a target address is repeatedly added into the downloading queue is avoided, and tasks in the acquisition task queue are sent to an idle acquisition node to be executed after judgment.

And the data storage node is used for receiving the acquired data of the acquisition node and storing the acquired data into the corresponding topic.

The embodiment of the invention realizes a high-efficiency and stable network data acquisition mode based on Pypeteer acquisition, redis task distribution and Kafka distributed storage technology, effectively solves the problems of login, asynchronous data loading and the like in the process of network data acquisition, can effectively utilize network bandwidth, avoids repeated acquisition of network resources, and solves the problems of bottleneck of disk IO and the like.

Based on the above embodiment, the collecting node is further configured to perform an operation on the data storage node according to a status result of data collection, and send the status result of data collection to the task distribution and deduplication node.

Specifically, when the acquisition node acquires data, if the acquisition is successful, the acquisition result is sent to the data storage node, and the successful task acquisition state is sent to the task distribution node; and if the acquisition fails, transmitting the state of the task acquisition failure to a task distribution node.

Based on any one of the above embodiments, the task distribution and deduplication node is further configured to wait for the plurality of idle acquisition nodes to return an acquisition result;

Specifically, when the task distribution node receives task data sent by the acquisition node, it is required to check whether a record exists in a history task from a history download queue, and if the record does not exist in the history task, the webpage address is added into the acquisition task queue; and prohibiting the addition of the web page address information to the acquisition task queue if the web page address already exists. The task distributing node distributes the task in the acquisition task queue to each idle acquisition node (whether the task of the task in the task queue issues an attribute marked as true) and waits for the return of the acquisition result of the acquisition node, if the acquisition is successful, the webpage address is added to the history downloading queue of the Redis, and whether the acquisition success field of the marking queue is marked as true, namely a first mark; if the downloading fails, the failure frequency of the web page in the Redis value is increased by 1, a default downloading frequency threshold value n is set, if the downloading failure frequency exceeds n, the downloading task of the web page is abandoned, the web page is maintained in a historical downloading queue, and whether the acquisition success field of the marking queue is marked as false, namely a second marking.

FIG. 2 is a flowchart of a multi-node incremental data collection method according to an embodiment of the present invention, as shown in FIG. 2, including:

s1, acquiring cookie information of a website to be acquired by an acquisition node according to the website of the website to be acquired, acquiring the cookie information of the website to be acquired based on authentication parameters, downloading the content of a target webpage, analyzing all target addresses to be acquired in the target webpage, transmitting preset parameter values of the target addresses to a task distribution and deduplication node, receiving acquisition tasks issued by the task distribution and deduplication node, and acquiring data according to the acquisition tasks;

s2, receiving the target address analyzed by the acquisition node by the task distribution and deduplication node, maintaining an acquisition task queue, acquiring a history downloading queue of all history webpage acquisition records, judging whether to add the target address to the history downloading queue according to whether the history downloading queue has the record of the target address or not, and distributing the tasks in the acquisition task queue to a plurality of idle acquisition nodes;

and S3, receiving the acquired data of the acquisition node by the data storage node.

Specifically, in step S1, at least one collection node uses authentication parameters, such as a transfer user name, a password, and the like, to log in to a website according to a website address of the website to be collected, and then obtains Cookie information of the website for later website data collection; downloading the content of a target webpage, analyzing all network addresses to be acquired in the webpage, and transmitting HASH values of the addresses, namely preset parameter values, to task distribution and duplication removal nodes; and receiving a target address which needs to be acquired by the task distribution node, and acquiring data. The Pypseteer collecting node collects the website addresses required to be collected of the target website, the collecting node tool uses the Pypseteer, a plurality of nodes asynchronously execute collecting tasks and then push the collecting tasks to the Redis cluster node, the Redis judges whether the collected website addresses required to be collected need to be added to a message queue or not, the repeated access to the same webpage address can be effectively avoided, meanwhile, the data collected by the collecting node is ensured not to be repeated, and the storage effectiveness of the Kafka cluster node is ensured;

in step S2, after the plurality of collection nodes collect the collection data, the collection data is stored in the data storage node, and the collection status is fed back to the task distribution node to update the status of the collection task, at least one task distribution and duplication removal node receives the destination web page address analyzed by the collection node, and maintains a collection task queue { HASH, web page address, whether the task is issued, the failure times } and records the historical download queues { HASH, web page address, whether the collection is successful }, where HASH is the HASH value generated by the web page address. Meanwhile, when the task distribution node receives task data sent by the acquisition node, whether a history task exists or not needs to be checked from a history downloading queue, so that the situation that a target address is repeatedly added into the downloading queue is avoided, and tasks in the acquisition task queue are sent to an idle acquisition node to be executed after judgment. Here, at least one task distribution and deduplication node can build a Redis cluster environment for master-slave backup, so that timeliness and stability of task distribution, deduplication node data reception and task distribution are improved; here, redis is an open-source log-type, key-Value database written in ANSI C language, supporting network, and capable of being based on memory and persistent, and provides APIs in multiple languages.

In step S3, at least one data storage node receives the collected data of a plurality of collection nodes, where the plurality of data storage nodes may build a kafka cluster environment for achieving high throughput and storage stability of data storage. Here, kafka is a high throughput distributed publish-subscribe messaging system that can handle all action flow data of consumers in websites, such actions as web browsing, searching and other user actions, which are a key factor for many social functions on modern networks, are usually solved by handling logs and log aggregations due to throughput requirements, and is a viable solution for log data and offline analysis systems like Hadoop, but requiring real-time processing limitations, the purpose of Kafka is to unify on-line and off-line message processing by Hadoop parallel loading mechanisms, and also to provide real-time messages by clustering.

According to the embodiment of the invention, the data acquisition is carried out in a multi-node manner, the task distribution de-duplication and the data storage are carried out, so that a high-efficiency and stable network data acquisition mode is realized, and the problems of login, asynchronous data loading and the like in the network data acquisition process are effectively solved.

Based on any of the above embodiments, the method further includes, after step S1:

The embodiment of the invention solves the problems of acquisition of a plurality of website data needing to be logged in, acquisition of Ajax dynamic rendering data, task distribution, effective de-duplication and bottleneck of disk IO caused by a large amount of data warehouse entry by using Pypeteer acquisition, redis task distribution and Kafka distributed storage technology.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A multi-node incremental data acquisition system, comprising: the system comprises an acquisition node, a task distribution node, a deduplication node and a data storage node; wherein:

2. The multi-node incremental data collection system of claim 1 wherein the collection node is further configured to perform operations on the data storage node based on status results of data collection and to send status results of the data collection to the task distribution and deduplication node.

3. The multi-node incremental data collection system of claim 2 wherein the collection node is further configured to perform operations on the data storage node based on status results of data collection and to send status results of data collection to the task distribution and deduplication node, and in particular comprising:

4. The multi-node incremental data collection system of claim 1 wherein the determining whether to add the destination address to the history download queue based on whether a record of the destination address exists in the history download queue comprises:

5. The multi-node incremental data collection system of claim 1 wherein the task distribution and deduplication node is further configured to wait for the plurality of idle collection nodes to return a collection result;

6. The multi-node incremental data acquisition method is characterized by comprising the following steps of:

and receiving the acquired data of the acquisition node by a data storage node.

7. The method for incremental data collection of multiple nodes according to claim 6, wherein the collecting node obtains cookie information of the website to be collected after logging in the website to be collected based on authentication parameters, downloads contents of target web pages, analyzes all target addresses to be collected in the target web pages, transmits preset parameter values of the target addresses to task distribution and deduplication nodes, receives collection tasks issued by the task distribution and deduplication nodes, and performs data collection according to the collection tasks, and further comprises:

8. The multi-node incremental data collection method of claim 7 wherein the performing the operation on the data storage node based on the results of the data collection and sending the status results of the data collection to the task distribution and deduplication node comprises: