CN111078975B - Multi-node incremental data acquisition system and acquisition method - Google Patents

Multi-node incremental data acquisition system and acquisition method Download PDF

Info

Publication number
CN111078975B
CN111078975B CN201911338747.7A CN201911338747A CN111078975B CN 111078975 B CN111078975 B CN 111078975B CN 201911338747 A CN201911338747 A CN 201911338747A CN 111078975 B CN111078975 B CN 111078975B
Authority
CN
China
Prior art keywords
node
acquisition
data
task
queue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911338747.7A
Other languages
Chinese (zh)
Other versions
CN111078975A (en
Inventor
邢文涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Tianyuan Innovation Technology Co ltd
Original Assignee
Beijing Tianyuan Innovation Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Tianyuan Innovation Technology Co ltd filed Critical Beijing Tianyuan Innovation Technology Co ltd
Priority to CN201911338747.7A priority Critical patent/CN111078975B/en
Publication of CN111078975A publication Critical patent/CN111078975A/en
Application granted granted Critical
Publication of CN111078975B publication Critical patent/CN111078975B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9574Browsing optimisation, e.g. caching or content distillation of access to content, e.g. by caching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The embodiment of the invention provides a multi-node incremental data acquisition system and a multi-node incremental data acquisition method. The system comprises: the system comprises an acquisition node, a task distribution node, a deduplication node and a data storage node; the method comprises the following steps: at least one acquisition node receives at least one acquisition task distributed by the task distribution node and issued by the deduplication node, analyzes the website of the website to be acquired according to the acquisition task, and performs data acquisition; and sending the acquired data to at least one data storage node for storage, and feeding back the acquisition state to at least one task distribution and deduplication node for state update. According to the embodiment of the invention, the website data acquisition required to be logged in is solved by the Pypseteer acquisition, redis task distribution and Kafka distributed storage technology, the task distribution and effective deduplication are realized, the disk IO bottleneck problem caused by a large amount of data storage is avoided, the acquisition performance of the Pypseteer is improved, and the bottleneck of network requests is avoided.

Description

Multi-node incremental data acquisition system and acquisition method
Technical Field
The invention relates to the technical field of data acquisition, in particular to a multi-node incremental data acquisition system and a multi-node incremental data acquisition method.
Background
At present, when most network data acquisition tools are used for acquiring data of webpages, particularly, the support of websites which need to log in to acquire the data on the data acquisition tools is not very friendly and convenient, and most websites need to log in to browse the data, and in addition, most websites need to turn pages to browse the data, and the data are all obtained through Ajax dynamic rendering, so that the difficulty is increased for data acquisition.
In addition, because the data volume of network data acquisition is huge, the requirement on the acquisition speed is higher, the bottleneck problem of storage is encountered when the acquired data is sent to corresponding storage equipment in the prior art, the speed of data reading and writing is influenced, and the accuracy of data transmission cannot be ensured.
Disclosure of Invention
The embodiment of the invention provides a multi-node incremental data acquisition system and an acquisition method, which are used for solving the bottleneck of IO read-write of a disk caused by the fact that the actual acquisition speed is influenced due to the huge data acquisition amount in the prior art.
In a first aspect, an embodiment of the present invention provides a multi-node incremental data collection system, including:
the system comprises an acquisition node, a task distribution node, a deduplication node and a data storage node; wherein:
the acquisition node is used for acquiring cookie information of a website to be acquired according to the website of the website to be acquired, acquiring cookie information of the website to be acquired based on authentication parameters after logging in the website to be acquired, downloading contents of a target webpage, analyzing all target addresses to be acquired in the target webpage, transmitting preset parameter values of the target addresses to the task distribution and deduplication node, receiving acquisition tasks issued by the task distribution and deduplication node, and acquiring data according to the acquisition tasks;
the task distributing and deduplicating node is used for receiving the target address analyzed by the collecting node, maintaining a collecting task queue, acquiring a history downloading queue of all history webpage collecting records, judging whether the target address is added to the history downloading queue according to whether the history downloading queue has the record of the target address or not, and distributing the tasks in the collecting task queue to a plurality of idle collecting nodes;
the data storage node is used for receiving the acquired data of the acquisition node.
The collection node is further configured to perform an operation on the data storage node according to a status result of data collection, and send the status result of data collection to the task distribution and deduplication node.
The collecting node is further configured to perform an operation on the data storage node according to a status result of data collection, and send the status result of data collection to the task distributing and deduplicating node, and specifically includes:
if the data acquisition is judged to be successful, the acquired data is sent to the data storage node, and the successful state of the data acquisition is sent to the task distribution and deduplication node;
and if the data acquisition failure is judged, sending the failure state of the data acquisition to the task distribution and deduplication node.
The determining whether to add the target address to the history download queue according to whether the record of the target address exists in the history download queue specifically includes:
if the task distribution and deduplication node does not acquire the target address from the historical download queue, adding the target address into the acquisition task queue;
and if the task distribution and deduplication node acquires the target address from the historical download queue, prohibiting the target address from being added into the acquisition task queue.
The task distribution and deduplication nodes are also used for returning acquisition results to the plurality of idle acquisition nodes;
if the acquisition is successful, adding a webpage address corresponding to the acquisition result to the history downloading queue, and giving a first mark to the history downloading queue;
if the acquisition fails, increasing the failure times of the webpage address corresponding to the acquisition result in the task distribution and duplication removal node by 1, if the download failure times exceed a preset download time threshold, stopping acquisition, adding the webpage address corresponding to the acquisition result to the historical download queue, and giving a second mark to the historical download queue.
In a second aspect, an embodiment of the present invention provides a multi-node incremental data collection method, including:
acquiring cookie information of a website to be acquired by an acquisition node according to the website of the website to be acquired, acquiring cookie information of the website to be acquired based on authentication parameters, downloading the content of a target webpage, analyzing all target addresses to be acquired in the target webpage, transmitting preset parameter values of the target addresses to a task distribution and deduplication node, receiving acquisition tasks issued by the task distribution and deduplication node, and acquiring data according to the acquisition tasks;
the task distributing and deduplicating node receives the target address analyzed by the collecting node, maintains a collecting task queue, acquires a history downloading queue of all history webpage collecting records, judges whether the target address is added to the history downloading queue according to whether the history downloading queue has the record of the target address or not, and distributes tasks in the collecting task queue to a plurality of idle collecting nodes;
and receiving the acquired data of the acquisition node by a data storage node.
The method comprises the steps that an acquisition node acquires cookie information of a website to be acquired after logging in the website to be acquired according to the website address of the website to be acquired and based on authentication parameters, downloads contents of target webpages, analyzes all target addresses to be acquired in the target webpages, transmits preset parameter values of the target addresses to task distribution and deduplication nodes, receives acquisition tasks issued by the task distribution and deduplication nodes, acquires data according to the acquisition tasks, and then further comprises the steps of:
and executing the operation on the data storage node according to the data acquisition result, and sending the data acquisition state result to the task distribution and deduplication node.
The method specifically includes the steps of executing an operation on the data storage node according to a data acquisition result, and sending a data acquisition state result to the task distribution and deduplication node, wherein the operation specifically includes:
if the data acquisition is judged to be successful, the acquired data is sent to the data storage node, and the successful state of the data acquisition is sent to the task distribution and deduplication node;
and if the data acquisition failure is judged, sending the failure state of the data acquisition to the task distribution and deduplication node.
According to the multi-node incremental data acquisition system and the acquisition method, the problems of task distribution, effective duplication removal and disk IO bottleneck caused by a large amount of data storage are solved by acquiring the website data needing to be logged on based on Pypreteer acquisition, redis task distribution and Kafka distributed storage technology, the acquisition performance of the Pypreteer is improved by multi-node asynchronous acquisition, the bottleneck of network requests is avoided by incremental crawling, and the memory space of a server is effectively utilized.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a block diagram of a multi-node incremental data acquisition system provided by an embodiment of the present invention;
FIG. 2 is a flowchart of a multi-node incremental data collection method according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In order to solve a series of problems in data acquisition in the prior art, the embodiment of the invention provides a multi-node incremental data acquisition system and an acquisition method. The Pypseteer realizes logging in a target website and acquires logging in Cookie data, provides logging in credentials for data acquisition, completes acquisition tasks issued by Redis, and finally stores the acquired data into a Kafka message queue; the Redis main tasks are collection tasks transmitted by the Pypeteer module, and the collection queues are maintained and mainly divided into task collection queues and history downloading queues. The task collection queue mainly completes collection tasks which are transmitted by the Pypeteer module in a parsing way, records collection logs and avoids repeated collection, and here, the Pypeteer provides an API interface with rich functions based on a web automation test frame of a chrome, is convenient to develop, can visualize a login page and can realize an interaction process between a person and an operation program, and can realize data of Ajax dynamic rendering; the Kafka data collection queue is mainly used for storing data generated by the Pypeteer acquisition task. According to the scheme, a plurality of Pypseteer acquisition nodes and Redis and Kafka clusters are used, so that the acquisition efficiency can be effectively improved, the stability of task distribution and data storage can be improved, and the incremental acquisition of data can be realized through a Redis message queue.
FIG. 1 is a block diagram of a multi-node incremental data acquisition system according to an embodiment of the present invention, as shown in FIG. 1, including:
the system comprises an acquisition node, a task distribution node, a deduplication node and a data storage node; wherein:
the acquisition node is used for acquiring cookie information of a website to be acquired according to the website of the website to be acquired, acquiring cookie information of the website to be acquired based on authentication parameters after logging in the website to be acquired, downloading contents of a target webpage, analyzing all target addresses to be acquired in the target webpage, transmitting preset parameter values of the target addresses to the task distribution and deduplication node, receiving acquisition tasks issued by the task distribution and deduplication node, and acquiring data according to the acquisition tasks;
the task distributing and deduplicating node is used for receiving the target address analyzed by the collecting node, maintaining a collecting task queue, acquiring a history downloading queue of all history webpage collecting records, judging whether the target address is added to the history downloading queue according to whether the history downloading queue has the record of the target address or not, and distributing the tasks in the collecting task queue to a plurality of idle collecting nodes;
the data storage node is used for receiving the acquired data of the acquisition node.
Specifically, the collection node is configured to log in to a website by using authentication parameters, such as a transfer user name, a password, and other parameters, according to a website address of the website to be collected, and then obtain Cookie information of the website for later website data collection; downloading the content of a target webpage, analyzing all network addresses to be acquired in the webpage, and transmitting HASH values of the addresses, namely preset parameter values, to task distribution and duplication removal nodes; and receiving a target address which needs to be acquired by the task distribution node, and acquiring data.
The task distributing and deduplicating node is used for receiving the destination webpage address analyzed by the collecting node, maintaining a collecting task queue { HASH, webpage address, whether the task is issued or not, failure times } and recording the historical download queues { HASH, webpage address, whether the collection is successful } of all the historical webpage collecting records, wherein HASH is a HASH value generated by the webpage address. Meanwhile, when the task distribution node receives task data sent by the acquisition node, whether a history task exists or not needs to be checked from a history downloading queue, so that the situation that a target address is repeatedly added into the downloading queue is avoided, and tasks in the acquisition task queue are sent to an idle acquisition node to be executed after judgment.
And the data storage node is used for receiving the acquired data of the acquisition node and storing the acquired data into the corresponding topic.
The embodiment of the invention realizes a high-efficiency and stable network data acquisition mode based on Pypeteer acquisition, redis task distribution and Kafka distributed storage technology, effectively solves the problems of login, asynchronous data loading and the like in the process of network data acquisition, can effectively utilize network bandwidth, avoids repeated acquisition of network resources, and solves the problems of bottleneck of disk IO and the like.
Based on the above embodiment, the collecting node is further configured to perform an operation on the data storage node according to a status result of data collection, and send the status result of data collection to the task distribution and deduplication node.
The collecting node is further configured to perform an operation on the data storage node according to a status result of data collection, and send the status result of data collection to the task distributing and deduplicating node, and specifically includes:
if the data acquisition is judged to be successful, the acquired data is sent to the data storage node, and the successful state of the data acquisition is sent to the task distribution and deduplication node;
and if the data acquisition failure is judged, sending the failure state of the data acquisition to the task distribution and deduplication node.
Specifically, when the acquisition node acquires data, if the acquisition is successful, the acquisition result is sent to the data storage node, and the successful task acquisition state is sent to the task distribution node; and if the acquisition fails, transmitting the state of the task acquisition failure to a task distribution node.
Based on any one of the above embodiments, the task distribution and deduplication node is further configured to wait for the plurality of idle acquisition nodes to return an acquisition result;
if the acquisition is successful, adding a webpage address corresponding to the acquisition result to the history downloading queue, and giving a first mark to the history downloading queue;
if the acquisition fails, increasing the failure times of the webpage address corresponding to the acquisition result in the task distribution and duplication removal node by 1, if the download failure times exceed a preset download time threshold, stopping acquisition, adding the webpage address corresponding to the acquisition result to the historical download queue, and giving a second mark to the historical download queue.
Specifically, when the task distribution node receives task data sent by the acquisition node, it is required to check whether a record exists in a history task from a history download queue, and if the record does not exist in the history task, the webpage address is added into the acquisition task queue; and prohibiting the addition of the web page address information to the acquisition task queue if the web page address already exists. The task distributing node distributes the task in the acquisition task queue to each idle acquisition node (whether the task of the task in the task queue issues an attribute marked as true) and waits for the return of the acquisition result of the acquisition node, if the acquisition is successful, the webpage address is added to the history downloading queue of the Redis, and whether the acquisition success field of the marking queue is marked as true, namely a first mark; if the downloading fails, the failure frequency of the web page in the Redis value is increased by 1, a default downloading frequency threshold value n is set, if the downloading failure frequency exceeds n, the downloading task of the web page is abandoned, the web page is maintained in a historical downloading queue, and whether the acquisition success field of the marking queue is marked as false, namely a second marking.
FIG. 2 is a flowchart of a multi-node incremental data collection method according to an embodiment of the present invention, as shown in FIG. 2, including:
s1, acquiring cookie information of a website to be acquired by an acquisition node according to the website of the website to be acquired, acquiring the cookie information of the website to be acquired based on authentication parameters, downloading the content of a target webpage, analyzing all target addresses to be acquired in the target webpage, transmitting preset parameter values of the target addresses to a task distribution and deduplication node, receiving acquisition tasks issued by the task distribution and deduplication node, and acquiring data according to the acquisition tasks;
s2, receiving the target address analyzed by the acquisition node by the task distribution and deduplication node, maintaining an acquisition task queue, acquiring a history downloading queue of all history webpage acquisition records, judging whether to add the target address to the history downloading queue according to whether the history downloading queue has the record of the target address or not, and distributing the tasks in the acquisition task queue to a plurality of idle acquisition nodes;
and S3, receiving the acquired data of the acquisition node by the data storage node.
Specifically, in step S1, at least one collection node uses authentication parameters, such as a transfer user name, a password, and the like, to log in to a website according to a website address of the website to be collected, and then obtains Cookie information of the website for later website data collection; downloading the content of a target webpage, analyzing all network addresses to be acquired in the webpage, and transmitting HASH values of the addresses, namely preset parameter values, to task distribution and duplication removal nodes; and receiving a target address which needs to be acquired by the task distribution node, and acquiring data. The Pypseteer collecting node collects the website addresses required to be collected of the target website, the collecting node tool uses the Pypseteer, a plurality of nodes asynchronously execute collecting tasks and then push the collecting tasks to the Redis cluster node, the Redis judges whether the collected website addresses required to be collected need to be added to a message queue or not, the repeated access to the same webpage address can be effectively avoided, meanwhile, the data collected by the collecting node is ensured not to be repeated, and the storage effectiveness of the Kafka cluster node is ensured;
in step S2, after the plurality of collection nodes collect the collection data, the collection data is stored in the data storage node, and the collection status is fed back to the task distribution node to update the status of the collection task, at least one task distribution and duplication removal node receives the destination web page address analyzed by the collection node, and maintains a collection task queue { HASH, web page address, whether the task is issued, the failure times } and records the historical download queues { HASH, web page address, whether the collection is successful }, where HASH is the HASH value generated by the web page address. Meanwhile, when the task distribution node receives task data sent by the acquisition node, whether a history task exists or not needs to be checked from a history downloading queue, so that the situation that a target address is repeatedly added into the downloading queue is avoided, and tasks in the acquisition task queue are sent to an idle acquisition node to be executed after judgment. Here, at least one task distribution and deduplication node can build a Redis cluster environment for master-slave backup, so that timeliness and stability of task distribution, deduplication node data reception and task distribution are improved; here, redis is an open-source log-type, key-Value database written in ANSI C language, supporting network, and capable of being based on memory and persistent, and provides APIs in multiple languages.
In step S3, at least one data storage node receives the collected data of a plurality of collection nodes, where the plurality of data storage nodes may build a kafka cluster environment for achieving high throughput and storage stability of data storage. Here, kafka is a high throughput distributed publish-subscribe messaging system that can handle all action flow data of consumers in websites, such actions as web browsing, searching and other user actions, which are a key factor for many social functions on modern networks, are usually solved by handling logs and log aggregations due to throughput requirements, and is a viable solution for log data and offline analysis systems like Hadoop, but requiring real-time processing limitations, the purpose of Kafka is to unify on-line and off-line message processing by Hadoop parallel loading mechanisms, and also to provide real-time messages by clustering.
According to the embodiment of the invention, the data acquisition is carried out in a multi-node manner, the task distribution de-duplication and the data storage are carried out, so that a high-efficiency and stable network data acquisition mode is realized, and the problems of login, asynchronous data loading and the like in the network data acquisition process are effectively solved.
Based on any of the above embodiments, the method further includes, after step S1:
and executing the operation on the data storage node according to the data acquisition result, and sending the data acquisition state result to the task distribution and deduplication node.
The method specifically includes the steps of executing an operation on the data storage node according to a data acquisition result, and sending a data acquisition state result to the task distribution and deduplication node, wherein the operation specifically includes:
if the data acquisition is judged to be successful, the acquired data is sent to the data storage node, and the successful state of the data acquisition is sent to the task distribution and deduplication node;
and if the data acquisition failure is judged, sending the failure state of the data acquisition to the task distribution and deduplication node.
Specifically, when the acquisition node acquires data, if the acquisition is successful, the acquisition result is sent to the data storage node, and the successful task acquisition state is sent to the task distribution node; and if the acquisition fails, transmitting the state of the task acquisition failure to a task distribution node.
The embodiment of the invention solves the problems of acquisition of a plurality of website data needing to be logged in, acquisition of Ajax dynamic rendering data, task distribution, effective de-duplication and bottleneck of disk IO caused by a large amount of data warehouse entry by using Pypeteer acquisition, redis task distribution and Kafka distributed storage technology.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A multi-node incremental data acquisition system, comprising: the system comprises an acquisition node, a task distribution node, a deduplication node and a data storage node; wherein:
the acquisition node is used for acquiring cookie information of a website to be acquired according to the website of the website to be acquired, acquiring cookie information of the website to be acquired based on authentication parameters after logging in the website to be acquired, downloading contents of a target webpage, analyzing all target addresses to be acquired in the target webpage, transmitting preset parameter values of the target addresses to the task distribution and deduplication node, receiving acquisition tasks issued by the task distribution and deduplication node, and acquiring data according to the acquisition tasks;
the task distributing and deduplicating node is used for receiving the target address analyzed by the collecting node, maintaining a collecting task queue, acquiring a history downloading queue of all history webpage collecting records, judging whether the target address is added to the history downloading queue according to whether the history downloading queue has the record of the target address or not, and distributing the tasks in the collecting task queue to a plurality of idle collecting nodes;
the data storage node is used for receiving the acquired data of the acquisition node.
2. The multi-node incremental data collection system of claim 1 wherein the collection node is further configured to perform operations on the data storage node based on status results of data collection and to send status results of the data collection to the task distribution and deduplication node.
3. The multi-node incremental data collection system of claim 2 wherein the collection node is further configured to perform operations on the data storage node based on status results of data collection and to send status results of data collection to the task distribution and deduplication node, and in particular comprising:
if the data acquisition is judged to be successful, the acquired data is sent to the data storage node, and the successful state of the data acquisition is sent to the task distribution and deduplication node;
and if the data acquisition failure is judged, sending the failure state of the data acquisition to the task distribution and deduplication node.
4. The multi-node incremental data collection system of claim 1 wherein the determining whether to add the destination address to the history download queue based on whether a record of the destination address exists in the history download queue comprises:
if the task distribution and deduplication node does not acquire the target address from the historical download queue, adding the target address into the acquisition task queue;
and if the task distribution and deduplication node acquires the target address from the historical download queue, prohibiting the target address from being added into the acquisition task queue.
5. The multi-node incremental data collection system of claim 1 wherein the task distribution and deduplication node is further configured to wait for the plurality of idle collection nodes to return a collection result;
if the acquisition is successful, adding a webpage address corresponding to the acquisition result to the history downloading queue, and giving a first mark to the history downloading queue;
if the acquisition fails, increasing the failure times of the webpage address corresponding to the acquisition result in the task distribution and duplication removal node by 1, if the download failure times exceed a preset download time threshold, stopping acquisition, adding the webpage address corresponding to the acquisition result to the historical download queue, and giving a second mark to the historical download queue.
6. The multi-node incremental data acquisition method is characterized by comprising the following steps of:
acquiring cookie information of a website to be acquired by an acquisition node according to the website of the website to be acquired, acquiring cookie information of the website to be acquired based on authentication parameters, downloading the content of a target webpage, analyzing all target addresses to be acquired in the target webpage, transmitting preset parameter values of the target addresses to a task distribution and deduplication node, receiving acquisition tasks issued by the task distribution and deduplication node, and acquiring data according to the acquisition tasks;
the task distributing and deduplicating node receives the target address analyzed by the collecting node, maintains a collecting task queue, acquires a history downloading queue of all history webpage collecting records, judges whether the target address is added to the history downloading queue according to whether the history downloading queue has the record of the target address or not, and distributes tasks in the collecting task queue to a plurality of idle collecting nodes;
and receiving the acquired data of the acquisition node by a data storage node.
7. The method for incremental data collection of multiple nodes according to claim 6, wherein the collecting node obtains cookie information of the website to be collected after logging in the website to be collected based on authentication parameters, downloads contents of target web pages, analyzes all target addresses to be collected in the target web pages, transmits preset parameter values of the target addresses to task distribution and deduplication nodes, receives collection tasks issued by the task distribution and deduplication nodes, and performs data collection according to the collection tasks, and further comprises:
and executing the operation on the data storage node according to the data acquisition result, and sending the data acquisition state result to the task distribution and deduplication node.
8. The multi-node incremental data collection method of claim 7 wherein the performing the operation on the data storage node based on the results of the data collection and sending the status results of the data collection to the task distribution and deduplication node comprises:
if the data acquisition is judged to be successful, the acquired data is sent to the data storage node, and the successful state of the data acquisition is sent to the task distribution and deduplication node;
and if the data acquisition failure is judged, sending the failure state of the data acquisition to the task distribution and deduplication node.
CN201911338747.7A 2019-12-23 2019-12-23 Multi-node incremental data acquisition system and acquisition method Active CN111078975B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911338747.7A CN111078975B (en) 2019-12-23 2019-12-23 Multi-node incremental data acquisition system and acquisition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911338747.7A CN111078975B (en) 2019-12-23 2019-12-23 Multi-node incremental data acquisition system and acquisition method

Publications (2)

Publication Number Publication Date
CN111078975A CN111078975A (en) 2020-04-28
CN111078975B true CN111078975B (en) 2023-04-28

Family

ID=70316843

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911338747.7A Active CN111078975B (en) 2019-12-23 2019-12-23 Multi-node incremental data acquisition system and acquisition method

Country Status (1)

Country Link
CN (1) CN111078975B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115794539B (en) * 2022-09-20 2023-09-01 北京世纪国源科技股份有限公司 Log incremental monitoring method, device and equipment for space-time data API service

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106484828A (en) * 2016-09-29 2017-03-08 西南科技大学 A kind of distributed interconnection data Fast Acquisition System and acquisition method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10230695B2 (en) * 2017-01-11 2019-03-12 Red Hat, Inc. Distribution of secure data with entitlement enforcement
US20180351816A1 (en) * 2017-06-02 2018-12-06 Yan Li Methods and apparatus for parameter tuning using a cloud service

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106484828A (en) * 2016-09-29 2017-03-08 西南科技大学 A kind of distributed interconnection data Fast Acquisition System and acquisition method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
齐卫宁 ; 王劲林 ; .大数据时代的安全云存储平台.网络新媒体技术.2016,(02),全文. *

Also Published As

Publication number Publication date
CN111078975A (en) 2020-04-28

Similar Documents

Publication Publication Date Title
US10560465B2 (en) Real time anomaly detection for data streams
CN107895009B (en) Distributed internet data acquisition method and system
US10691728B1 (en) Transforming a data stream into structured data
US20200372007A1 (en) Trace and span sampling and analysis for instrumented software
CN108228322B (en) Distributed link tracking and analyzing method, server and global scheduler
Kotenko et al. Aggregation of elastic stack instruments for collecting, storing and processing of security information and events
US11449371B1 (en) Indexing data at a data intake and query system based on a node capacity threshold
US20130185429A1 (en) Processing Store Visiting Data
Laboshin et al. The big data approach to collecting and analyzing traffic data in large scale networks
CN107580052B (en) Self-evolution network self-adaptive crawler method and system
CN112087520B (en) Data processing method, device, equipment and computer readable storage medium
CN111338893A (en) Process log processing method and device, computer equipment and storage medium
CN112130996A (en) Data monitoring control system, method and device, electronic equipment and storage medium
CN110727727A (en) Statistical method and device for database
CN111078975B (en) Multi-node incremental data acquisition system and acquisition method
CN111159135A (en) Data processing method and device, electronic equipment and storage medium
CN111506672A (en) Method, device, equipment and storage medium for analyzing environmental protection monitoring data in real time
CN111130882A (en) Monitoring system and method of network equipment
CN116501783A (en) Distributed database data importing method and system
CN114422253B (en) Distributed vulnerability scanning system, method and storage medium
CN113220530B (en) Data quality monitoring method and platform
WO2022187008A1 (en) Asynchronous replication of linked parent and child records across data storage regions
Racka Apache Nifi As A Tool For Stream Processing Of Measurement Data
Abead et al. A comparative study of hdfs replication approaches
Abe et al. Distributed hayabusa: Scalable syslog search engine optimized for time-dimensional search

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant