CN111078975A - Multi-node incremental data acquisition system and acquisition method - Google Patents

Multi-node incremental data acquisition system and acquisition method Download PDF

Info

Publication number
CN111078975A
CN111078975A CN201911338747.7A CN201911338747A CN111078975A CN 111078975 A CN111078975 A CN 111078975A CN 201911338747 A CN201911338747 A CN 201911338747A CN 111078975 A CN111078975 A CN 111078975A
Authority
CN
China
Prior art keywords
node
acquisition
data
task distribution
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911338747.7A
Other languages
Chinese (zh)
Other versions
CN111078975B (en
Inventor
邢文涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Tianyuan Innovation Technology Co ltd
Original Assignee
Beijing Tianyuan Innovation Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Tianyuan Innovation Technology Co ltd filed Critical Beijing Tianyuan Innovation Technology Co ltd
Priority to CN201911338747.7A priority Critical patent/CN111078975B/en
Publication of CN111078975A publication Critical patent/CN111078975A/en
Application granted granted Critical
Publication of CN111078975B publication Critical patent/CN111078975B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9574Browsing optimisation, e.g. caching or content distillation of access to content, e.g. by caching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a multi-node incremental data acquisition system and an acquisition method. The system comprises: the method comprises the following steps that a collection node, a task distribution and deduplication node and a data storage node are arranged; the method comprises the following steps: receiving an acquisition task issued by at least one task distribution and duplicate removal node by at least one acquisition node, analyzing the website address of the website to be acquired according to the acquisition task, and acquiring data; and sending the acquired data to at least one data storage node for storage, and feeding back the acquisition state to at least one task distribution and deduplication node for state updating. The embodiment of the invention solves the problem of disk IO bottleneck caused by the storage of a large amount of data, realizes task distribution and effective duplicate removal, improves the collection performance of the Pyeeteer and avoids the bottleneck of network requests by using the Pyeeteer collection, Redis task distribution and Kafka distributed storage technology.

Description

Multi-node incremental data acquisition system and acquisition method
Technical Field
The invention relates to the technical field of data acquisition, in particular to a multi-node incremental data acquisition system and an acquisition method.
Background
At present, when most of network data acquisition tools are used for acquiring data of webpages, particularly, the data acquisition tools are not supported by websites which need to log in to acquire data, most of websites can browse data after logging in, and most of websites which need to turn pages to browse data are subjected to Ajax dynamic rendering to obtain pictures, so that the difficulty is increased for data acquisition.
In addition, because the data volume of network data acquisition is huge, the requirement on the acquisition speed is higher, and the acquired data is often sent to corresponding storage equipment in the prior art, the data reading and writing speed is affected due to the bottleneck problem of storage, and the accuracy of data transmission cannot be ensured.
Disclosure of Invention
The embodiment of the invention provides a multi-node incremental data acquisition system and an acquisition method, which are used for solving the bottleneck of disk IO read-write caused by the influence on the actual acquisition speed due to huge data acquisition amount in the prior art.
In a first aspect, an embodiment of the present invention provides a multi-node incremental data acquisition system, including:
the method comprises the following steps that a collection node, a task distribution and deduplication node and a data storage node are arranged; wherein:
the acquisition node is used for acquiring cookie information of the website to be acquired after logging in the website to be acquired based on authentication parameters according to the website address of the website to be acquired, downloading the content of a target webpage, analyzing all target addresses needing to be acquired in the target webpage, transmitting preset parameter values of the target addresses to the task distribution and deduplication node, receiving acquisition tasks issued by the task distribution and deduplication node, and acquiring data according to the acquisition tasks;
the task distribution and deduplication node is used for receiving the target address analyzed by the acquisition node, maintaining an acquisition task queue, acquiring a historical download queue of all historical webpage acquisition records, judging whether the target address is added to the historical download queue according to whether the target address record exists in the historical download queue, and distributing the tasks in the acquisition task queue to a plurality of idle acquisition nodes;
the data storage node is used for receiving the collected data of the collecting node.
The collection node is further used for executing the operation on the data storage node according to the state result of data collection, and sending the state result of data collection to the task distribution and deduplication node.
The acquisition node is further configured to execute an operation on the data storage node according to a state result of data acquisition, and send the state result of data acquisition to the task distribution and deduplication node, and specifically includes:
if the data acquisition is judged to be successful, the acquired data is sent to the data storage node, and the successful state of the data acquisition is sent to the task distribution and duplicate removal node;
and if the data acquisition failure is judged and known, sending the failure state of the data acquisition to the task distribution and deduplication node.
Wherein, the determining whether to add the target address to the historical download queue according to whether the record of the target address exists in the historical download queue specifically includes:
if the task distribution and deduplication node does not acquire the target address from the historical download queue, adding the target address to the acquisition task queue;
and if the task distribution and deduplication node acquires the target address from the historical download queue, forbidding to add the target address to the acquisition task queue.
The task distribution and deduplication node is also used for returning acquisition results to the plurality of idle acquisition nodes;
if the acquisition is successful, adding the webpage address corresponding to the acquisition result to the historical download queue, and giving a first mark to the historical download queue;
and if the acquisition fails, increasing the failure times of the webpage address corresponding to the acquisition result in the task distribution and duplication removal node by 1, if the download failure times exceed a preset download time threshold, stopping the acquisition, adding the webpage address corresponding to the acquisition result to the historical download queue, and giving a second mark to the historical download queue.
In a second aspect, an embodiment of the present invention provides a multi-node incremental data acquisition method, including:
the method comprises the steps that after a collection node logs in a website to be collected according to the website address of the website to be collected and based on authentication parameters, cookie information of the website to be collected is obtained, the content of a target webpage is downloaded, all target addresses needing to be collected in the target webpage are analyzed, preset parameter values of the target addresses are transmitted to a task distribution and deduplication node, collection tasks issued by the task distribution and deduplication node are received, and data collection is carried out according to the collection tasks;
receiving the target address analyzed by the acquisition node by the task distribution and deduplication node, maintaining an acquisition task queue, acquiring a historical download queue of all historical webpage acquisition records, judging whether the target address is added to the historical download queue according to whether the target address record exists in the historical download queue, and distributing the tasks in the acquisition task queue to a plurality of idle acquisition nodes;
and receiving the acquisition data of the acquisition node by a data storage node.
The method comprises the following steps that after logging in a website to be collected by a collection node according to the website of the website to be collected and based on authentication parameters, cookie information of the website to be collected is obtained, the content of a target webpage is downloaded, all target addresses needing to be collected in the target webpage are analyzed, preset parameter values of the target addresses are transmitted to a task distribution and deduplication node, collection tasks issued by the task distribution and deduplication node are received, data collection is carried out according to the collection tasks, and the method further comprises the following steps:
and executing the operation on the data storage node according to the result of data acquisition, and sending the state result of the data acquisition to the task distribution and deduplication node.
The executing the operation on the data storage node according to the result of data acquisition, and sending the state result of data acquisition to the task distribution and deduplication node specifically include:
if the data acquisition is judged to be successful, the acquired data is sent to the data storage node, and the successful state of the data acquisition is sent to the task distribution and duplicate removal node;
and if the data acquisition failure is judged and known, sending the failure state of the data acquisition to the task distribution and deduplication node.
According to the multi-node incremental data acquisition system and the acquisition method provided by the embodiment of the invention, through the Pppeteeer acquisition, Redis task distribution and Kafka distributed storage technology, the problem of disk IO bottleneck caused by task distribution, effective duplication removal and large data storage is solved, the acquisition performance of Pppeteeer is improved through multi-node asynchronous acquisition, the bottleneck of network requests is avoided through incremental crawling, and the memory space of a server is effectively utilized.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a block diagram of a multi-node incremental data acquisition system according to an embodiment of the present invention;
fig. 2 is a flowchart of a multi-node incremental data acquisition method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to solve a series of problems occurring in data acquisition in the prior art, embodiments of the present invention provide a multi-node incremental data acquisition system and an acquisition method, where the acquisition system is divided into a pyppeneer acquisition module, a Redis task distribution and deduplication module, and a Kafka data collection module. Pyppeteer realizes logging in a target website and acquiring logging Cookie data, provides a logging certificate for data acquisition, completes an acquisition task issued by Redis, and finally stores the acquired data in a Kafka message queue; the Redis main task is an acquisition task for collecting Pyppeteer module transmission, and an acquisition queue is maintained and mainly divided into a task collection queue and a historical download queue. The task collection queue mainly completes collection of collection tasks transmitted by analysis of a Pyreeteer module and records collection logs, repeated collection is avoided, and the Pyreeteer provides an API (application program interface) with rich functions, is convenient to develop, can visualize login pages and realize the interaction process between manual work and running programs, and can realize Ajax dynamic rendering data; the Kafka data collection queue is mainly used for storing data generated by a Pppeteeer collection task. According to the scheme, a plurality of Pyppeteer acquisition nodes, Redis clusters and Kafka clusters are used, the acquisition efficiency can be effectively improved, the stability of task distribution and data storage is improved, and incremental acquisition of data is achieved through a Redis message queue.
Fig. 1 is a structural diagram of a multi-node incremental data acquisition system according to an embodiment of the present invention, as shown in fig. 1, including:
the method comprises the following steps that a collection node, a task distribution and deduplication node and a data storage node are arranged; wherein:
the acquisition node is used for acquiring cookie information of the website to be acquired after logging in the website to be acquired based on authentication parameters according to the website address of the website to be acquired, downloading the content of a target webpage, analyzing all target addresses needing to be acquired in the target webpage, transmitting preset parameter values of the target addresses to the task distribution and deduplication node, receiving acquisition tasks issued by the task distribution and deduplication node, and acquiring data according to the acquisition tasks;
the task distribution and deduplication node is used for receiving the target address analyzed by the acquisition node, maintaining an acquisition task queue, acquiring a historical download queue of all historical webpage acquisition records, judging whether the target address is added to the historical download queue according to whether the target address record exists in the historical download queue, and distributing the tasks in the acquisition task queue to a plurality of idle acquisition nodes;
the data storage node is used for receiving the collected data of the collecting node.
Specifically, the collection node is configured to log in a website by using authentication parameters, such as parameters of transmitting a user name, a password, and the like, according to a website address of the website to be collected, and then obtain Cookie information of the website for a credential for later website data collection; downloading the content of a target webpage, analyzing all network addresses needing to be collected in the webpage, and transmitting HASH values of the addresses, namely preset parameter values to task distribution and duplicate removal nodes; and receiving a target address to be acquired by the task distribution node, and acquiring data.
The task distribution and deduplication node is used for receiving a destination webpage address analyzed by the acquisition node, maintaining an acquisition task queue { HASH, webpage address, whether a task is issued or not, failure times } and a historical download queue { HASH, webpage address, whether acquisition is successful } for recording all historical webpage acquisition records, wherein HASH is an HASH value generated by the webpage address. Meanwhile, when the task distribution node receives task data sent by the acquisition node, whether a history task has a record or not needs to be checked from the history downloading queue, so that the target address is prevented from being repeatedly added into the downloading queue, and the task in the acquisition task queue is sent to an idle acquisition node to be executed after the judgment.
And the data storage node receives the acquired data of the acquisition node and stores the acquired data into the corresponding topic.
The embodiment of the invention is based on Pyether acquisition, Redis task distribution and Kafka distributed storage technology, realizes an efficient and stable network data acquisition mode, effectively solves the problems of login, asynchronous data loading and the like in the network data acquisition process, can effectively utilize network bandwidth, avoids repeated acquisition of network resources, and solves the problems of disk IO bottleneck and the like.
Based on the above embodiment, the collection node is further configured to execute an operation on the data storage node according to a status result of data collection, and send the status result of data collection to the task distribution and deduplication node.
The acquisition node is further configured to execute an operation on the data storage node according to a state result of data acquisition, and send the state result of data acquisition to the task distribution and deduplication node, and specifically includes:
if the data acquisition is judged to be successful, the acquired data is sent to the data storage node, and the successful state of the data acquisition is sent to the task distribution and duplicate removal node;
and if the data acquisition failure is judged and known, sending the failure state of the data acquisition to the task distribution and deduplication node.
Specifically, when the acquisition node acquires data, if the acquisition is successful, the acquisition result is sent to the data storage node, and the state of successful task acquisition is sent to the task distribution node; and if the collection fails, sending the state of the task collection failure to the task distribution node.
Based on any of the above embodiments, the task distribution and deduplication node is further configured to wait for the plurality of idle collection nodes to return collection results;
if the acquisition is successful, adding the webpage address corresponding to the acquisition result to the historical download queue, and giving a first mark to the historical download queue;
and if the acquisition fails, increasing the failure times of the webpage address corresponding to the acquisition result in the task distribution and duplication removal node by 1, if the download failure times exceed a preset download time threshold, stopping the acquisition, adding the webpage address corresponding to the acquisition result to the historical download queue, and giving a second mark to the historical download queue.
Specifically, when the task distribution node receives task data sent by the acquisition node, whether a history task has a record or not needs to be checked from a history download queue, and if the webpage address does not exist, the webpage address is added into the acquisition task queue; and if the webpage address exists, prohibiting the webpage address information from being added into the acquisition task queue. The task distribution node distributes the tasks in the collection task queue to each idle collection node (whether the task of the task in the task queue is issued with an attribute marked as true) and waits for the collection result of the collection node to be returned, if the collection is successful, the webpage address is added to a historical download queue of Redis, and a field marked as true for the successful collection of the marked queue, namely a first mark; if the downloading fails, increasing 1 to the failure times of the webpage in Redis, setting a default threshold n of the downloading times, if the failure times of the downloading exceeds n, abandoning the downloading task of the webpage, maintaining the webpage in a historical downloading queue, and marking whether the acquisition success field of the queue is false, namely a second mark.
Fig. 2 is a flowchart of a multi-node incremental data acquisition method according to an embodiment of the present invention, as shown in fig. 2, including:
s1, the acquisition node logs in the website to be acquired according to the website of the website to be acquired and based on authentication parameters, acquires cookie information of the website to be acquired, downloads the content of a target webpage, analyzes all target addresses to be acquired in the target webpage, transmits preset parameter values of the target addresses to a task distribution and deduplication node, receives acquisition tasks issued by the task distribution and deduplication node, and acquires data according to the acquisition tasks;
s2, the task distributing and deduplication node receives the target address analyzed by the acquisition node, maintains an acquisition task queue, acquires a historical download queue of all historical webpage acquisition records, judges whether the target address is added to the historical download queue according to whether the target address record exists in the historical download queue, and distributes the tasks in the acquisition task queue to a plurality of idle acquisition nodes;
and S3, receiving the collection data of the collection node by the data storage node.
Specifically, in step S1, at least one collection node logs in a website by using authentication parameters, such as parameters of transmitting a user name and a password, according to a website address of the website to be collected, and then obtains Cookie information of the website for a credential for later website data collection; downloading the content of a target webpage, analyzing all network addresses needing to be collected in the webpage, and transmitting HASH values of the addresses, namely preset parameter values to task distribution and duplicate removal nodes; and receiving a target address to be acquired by the task distribution node, and acquiring data. The Pyether collection node collects the website addresses required to be collected by a target website, the collection node tool uses Pyether, a plurality of nodes asynchronously execute collection tasks and then push the collection tasks to the Redis cluster node, and the Redis judges whether the collected website addresses required to be collected need to be added to a message queue, so that the situation that the same webpage addresses are repeatedly accessed can be effectively avoided, meanwhile, the data collected by the collection node cannot be repeated, and the storage effectiveness of the Kafka cluster node is ensured;
in step S2, after a plurality of collection nodes collect data, the collected data is stored in a data storage node, and the collection status is fed back to a task distribution node to update the status of the collection task, at least one task distribution and deduplication node receives a destination web page address analyzed by the collection node, and maintains a collection task queue { HASH, web page address, whether a task is issued, failure times } and a history download queue { HASH, web page address, whether collection is successful } which records all history web page collection records, HASH is a HASH value generated by a web page address. Meanwhile, when the task distribution node receives task data sent by the acquisition node, whether a history task has a record or not needs to be checked from the history downloading queue, so that the target address is prevented from being repeatedly added into the downloading queue, and the task in the acquisition task queue is sent to an idle acquisition node to be executed after the judgment. At least one task distribution and duplicate removal node can build a Redis cluster environment for master-slave backup, and timeliness and stability of task distribution and duplicate removal node data receiving and task distribution are improved; here, Redis an open source, log-type, Key-Value database written in ANSI C language, supporting network, based on memory, and persistent, and provides API for multiple languages.
In step S3, at least one data storage node receives the collected data of multiple collection nodes, and the multiple data storage nodes may construct a kafka cluster environment for achieving high throughput and storage stability of data storage. Here, Kafka is a high-throughput distributed publish-subscribe messaging system that can handle all the flow data of consumer actions in web sites, such as web browsing, searching and other user actions, which are a key factor for many social functions on modern networks, and is usually solved by processing logs and log aggregation due to throughput requirements, and is a feasible solution for log data like Hadoop and offline analysis system, but also for real-time processing limitations, and the object of Kafka is to unify online and offline message processing through Hadoop's parallel loading mechanism, and also to provide real-time messages through clustering.
The embodiment of the invention realizes an efficient and stable network data acquisition mode by performing data acquisition at multiple nodes, distributing tasks, removing duplication and storing data, and effectively solves the problems of login, asynchronous data loading and the like in the network data acquisition process.
According to any of the above embodiments, the method further includes, after step S1:
and executing the operation on the data storage node according to the result of data acquisition, and sending the state result of the data acquisition to the task distribution and deduplication node.
The executing the operation on the data storage node according to the result of data acquisition, and sending the state result of data acquisition to the task distribution and deduplication node specifically include:
if the data acquisition is judged to be successful, the acquired data is sent to the data storage node, and the successful state of the data acquisition is sent to the task distribution and duplicate removal node;
and if the data acquisition failure is judged and known, sending the failure state of the data acquisition to the task distribution and deduplication node.
Specifically, when the acquisition node acquires data, if the acquisition is successful, the acquisition result is sent to the data storage node, and the state of successful task acquisition is sent to the task distribution node; and if the collection fails, sending the state of the task collection failure to the task distribution node.
According to the embodiment of the invention, by adopting Pyppeteer acquisition, Redis task distribution and Kafka distributed storage technology, the problem of acquisition of a lot of website data needing to be logged in is solved, the acquisition of Ajax dynamic rendering data, task distribution and effective duplicate removal are realized, and the bottleneck problem of disk IO caused by storage of a large amount of data is avoided.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (8)

1. A multi-node incremental data acquisition system, comprising: the method comprises the following steps that a collection node, a task distribution and deduplication node and a data storage node are arranged; wherein:
the acquisition node is used for acquiring cookie information of the website to be acquired after logging in the website to be acquired based on authentication parameters according to the website address of the website to be acquired, downloading the content of a target webpage, analyzing all target addresses needing to be acquired in the target webpage, transmitting preset parameter values of the target addresses to the task distribution and deduplication node, receiving acquisition tasks issued by the task distribution and deduplication node, and acquiring data according to the acquisition tasks;
the task distribution and deduplication node is used for receiving the target address analyzed by the acquisition node, maintaining an acquisition task queue, acquiring a historical download queue of all historical webpage acquisition records, judging whether the target address is added to the historical download queue according to whether the target address record exists in the historical download queue, and distributing the tasks in the acquisition task queue to a plurality of idle acquisition nodes;
the data storage node is used for receiving the collected data of the collecting node.
2. The multi-node incremental data collection system of claim 1, wherein the collection node is further configured to perform operations on the data storage node according to a status result of data collection, and to send the status result of data collection to the task distribution and deduplication node.
3. The multi-node incremental data collection system of claim 2, wherein the collection node is further configured to perform an operation on the data storage node according to a status result of data collection, and send the status result of data collection to the task distribution and deduplication node, specifically including:
if the data acquisition is judged to be successful, the acquired data is sent to the data storage node, and the successful state of the data acquisition is sent to the task distribution and duplicate removal node;
and if the data acquisition failure is judged and known, sending the failure state of the data acquisition to the task distribution and deduplication node.
4. The system according to claim 1, wherein the determining whether to add the target address to the historical download queue according to whether there is a record of the target address in the historical download queue comprises:
if the task distribution and deduplication node does not acquire the target address from the historical download queue, adding the target address to the acquisition task queue;
and if the task distribution and deduplication node acquires the target address from the historical download queue, forbidding to add the target address to the acquisition task queue.
5. The multi-node incremental data collection system of claim 1, wherein the task distribution and deduplication node is further configured to wait for the number of idle collection nodes to return collection results;
if the acquisition is successful, adding the webpage address corresponding to the acquisition result to the historical download queue, and giving a first mark to the historical download queue;
and if the acquisition fails, increasing the failure times of the webpage address corresponding to the acquisition result in the task distribution and duplication removal node by 1, if the download failure times exceed a preset download time threshold, stopping the acquisition, adding the webpage address corresponding to the acquisition result to the historical download queue, and giving a second mark to the historical download queue.
6. A multi-node incremental data acquisition method is characterized by comprising the following steps:
the method comprises the steps that after a collection node logs in a website to be collected according to the website address of the website to be collected and based on authentication parameters, cookie information of the website to be collected is obtained, the content of a target webpage is downloaded, all target addresses needing to be collected in the target webpage are analyzed, preset parameter values of the target addresses are transmitted to a task distribution and deduplication node, collection tasks issued by the task distribution and deduplication node are received, and data collection is carried out according to the collection tasks;
receiving the target address analyzed by the acquisition node by the task distribution and deduplication node, maintaining an acquisition task queue, acquiring a historical download queue of all historical webpage acquisition records, judging whether the target address is added to the historical download queue according to whether the target address record exists in the historical download queue, and distributing the tasks in the acquisition task queue to a plurality of idle acquisition nodes;
and receiving the acquisition data of the acquisition node by a data storage node.
7. The multi-node incremental data acquisition method according to claim 6, wherein the acquiring node logs in the website to be acquired according to the website address of the website to be acquired and based on the authentication parameter, acquires cookie information of the website to be acquired, downloads content of a target webpage, analyzes all target addresses required to be acquired in the target webpage, transmits preset parameter values of the target addresses to the task distribution and deduplication node, receives an acquisition task issued by the task distribution and deduplication node, and performs data acquisition according to the acquisition task, and then further comprises:
and executing the operation on the data storage node according to the result of data acquisition, and sending the state result of the data acquisition to the task distribution and deduplication node.
8. The multi-node incremental data collection method according to claim 7, wherein the performing the operation on the data storage node according to the result of data collection and sending the status result of data collection to the task distribution and deduplication node specifically includes:
if the data acquisition is judged to be successful, the acquired data is sent to the data storage node, and the successful state of the data acquisition is sent to the task distribution and duplicate removal node;
and if the data acquisition failure is judged and known, sending the failure state of the data acquisition to the task distribution and deduplication node.
CN201911338747.7A 2019-12-23 2019-12-23 Multi-node incremental data acquisition system and acquisition method Active CN111078975B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911338747.7A CN111078975B (en) 2019-12-23 2019-12-23 Multi-node incremental data acquisition system and acquisition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911338747.7A CN111078975B (en) 2019-12-23 2019-12-23 Multi-node incremental data acquisition system and acquisition method

Publications (2)

Publication Number Publication Date
CN111078975A true CN111078975A (en) 2020-04-28
CN111078975B CN111078975B (en) 2023-04-28

Family

ID=70316843

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911338747.7A Active CN111078975B (en) 2019-12-23 2019-12-23 Multi-node incremental data acquisition system and acquisition method

Country Status (1)

Country Link
CN (1) CN111078975B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115794539A (en) * 2022-09-20 2023-03-14 北京世纪国源科技股份有限公司 Log incremental monitoring method, device and equipment for space-time data API service

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106484828A (en) * 2016-09-29 2017-03-08 西南科技大学 A kind of distributed interconnection data Fast Acquisition System and acquisition method
US20180198762A1 (en) * 2017-01-11 2018-07-12 Red Hat, Inc. Distribution of secure data with entitlement enforcement
US20180351816A1 (en) * 2017-06-02 2018-12-06 Yan Li Methods and apparatus for parameter tuning using a cloud service

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106484828A (en) * 2016-09-29 2017-03-08 西南科技大学 A kind of distributed interconnection data Fast Acquisition System and acquisition method
US20180198762A1 (en) * 2017-01-11 2018-07-12 Red Hat, Inc. Distribution of secure data with entitlement enforcement
US20180351816A1 (en) * 2017-06-02 2018-12-06 Yan Li Methods and apparatus for parameter tuning using a cloud service

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
齐卫宁;王劲林;: "大数据时代的安全云存储平台" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115794539A (en) * 2022-09-20 2023-03-14 北京世纪国源科技股份有限公司 Log incremental monitoring method, device and equipment for space-time data API service
CN115794539B (en) * 2022-09-20 2023-09-01 北京世纪国源科技股份有限公司 Log incremental monitoring method, device and equipment for space-time data API service

Also Published As

Publication number Publication date
CN111078975B (en) 2023-04-28

Similar Documents

Publication Publication Date Title
CN107895009B (en) Distributed internet data acquisition method and system
JP4327481B2 (en) Database system, server, inquiry input method and data update method
WO2020087082A1 (en) Trace and span sampling and analysis for instrumented software
Kotenko et al. Aggregation of elastic stack instruments for collecting, storing and processing of security information and events
US12019634B1 (en) Reassigning a processing node from downloading to searching a data group
Laboshin et al. The big data approach to collecting and analyzing traffic data in large scale networks
US20130185429A1 (en) Processing Store Visiting Data
CN107580052B (en) Self-evolution network self-adaptive crawler method and system
CN112087520B (en) Data processing method, device, equipment and computer readable storage medium
CN105260388A (en) Optimization method of distributed vertical crawler service system
US11892976B2 (en) Enhanced search performance using data model summaries stored in a remote data store
CN105515836A (en) Log processing method, device and server
US11687487B1 (en) Text files updates to an active processing pipeline
CN112988741A (en) Real-time service data merging method and device and electronic equipment
CN113656673A (en) Master-slave distributed content crawling robot for advertisement delivery
CN108337100B (en) Cloud platform monitoring method and device
CN111159135A (en) Data processing method and device, electronic equipment and storage medium
CN104503983A (en) Method and device for providing website certification data for search engine
WO2022187008A1 (en) Asynchronous replication of linked parent and child records across data storage regions
CN111078975B (en) Multi-node incremental data acquisition system and acquisition method
CN111130882A (en) Monitoring system and method of network equipment
CN116028192A (en) Multi-source heterogeneous data acquisition method, device and storage medium
US20220245091A1 (en) Facilitating generation of data model summaries
Jeřábek et al. Big data network flow processing using Apache Spark
Abead et al. A comparative study of HDFS replication approaches

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant