CN112199567A - Distributed data acquisition method, system, server and storage medium - Google Patents

Distributed data acquisition method, system, server and storage medium Download PDF

Info

Publication number
CN112199567A
CN112199567A CN202011035041.6A CN202011035041A CN112199567A CN 112199567 A CN112199567 A CN 112199567A CN 202011035041 A CN202011035041 A CN 202011035041A CN 112199567 A CN112199567 A CN 112199567A
Authority
CN
China
Prior art keywords
crawler
data acquisition
task
distributed
acquisition request
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011035041.6A
Other languages
Chinese (zh)
Inventor
豆兴捷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Yolanda Technology Co ltd
Original Assignee
Shenzhen Yolanda Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Yolanda Technology Co ltd filed Critical Shenzhen Yolanda Technology Co ltd
Priority to CN202011035041.6A priority Critical patent/CN112199567A/en
Publication of CN112199567A publication Critical patent/CN112199567A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/06Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a distributed data acquisition method, which is executed by a node-divided crawler engine of a distributed crawler system and comprises the following steps: reading a shared task queue from a sub-node server of a distributed crawler system to obtain one or more crawler tasks, wherein the shared task queue is obtained by the sub-node server from a main server; determining task parameters of the crawler task, wherein the task parameters comprise a target webpage link and a data acquisition request; distributing the target webpage link and the data acquisition request to one or more downloaders to acquire target data; and returning queue updating information to the sub-node server so that the sub-node server feeds the updated shared task queue back to the main server and performs synchronous updating. The invention realizes the sharing of the task queue at each sub-node in the distributed crawler system, so that each sub-node can acquire the updated task queue in real time, the read-write is rapid, the blockage is avoided, and the crawler speed is improved.

Description

Distributed data acquisition method, system, server and storage medium
Technical Field
The embodiment of the invention relates to the field of data crawling, in particular to a distributed data acquisition method, a distributed data acquisition system, a server and a storage medium.
Background
With the advent of the network big data age, enterprises need to track and understand market trends in order to maintain competitiveness, which often needs to analyze data. The data is used as the basis for analysis, and the source of the data is not limited to business data, and more data resources are required to be obtained from the internet. A large amount of human resources and time cost are required to be invested for acquiring data resources, and the human resources and the time cost can be greatly saved by automatically acquiring data. With the development of science and technology, the web crawler plays an important role in the automatic acquisition process. At present, a popular web crawler frame is Scapy, and the Scapy integrates functions of task scheduling, duplicate removal, webpage downloading, data analysis, data storage and the like.
However, in order to realize fast and efficient data acquisition, distributed crawling needs to be considered sometimes, and the script can only be operated on a single machine. Although the Scapy-Redis is distributed, the scheduling mechanism of the Scapy-Redis can cause the crawling speed to be reduced and a large amount of storage space of the Redis is occupied. Meanwhile, in order to meet the increasing crawling task, the original framework needs to be optimized in the aspects of rapid deployment and stability.
Disclosure of Invention
The invention provides a distributed data acquisition method, which realizes the rapidness and stability of data crawling and reduces the storage space occupying redis by sharing a task queue by each sub-server in a distributed crawler system.
In a first aspect, the present invention provides a distributed data collection method, executed by a node-divided crawler engine of a distributed crawler system, including:
reading a shared task queue from a sub-node server of a distributed crawler system to obtain one or more crawler tasks, wherein the shared task queue is obtained by the sub-node server from a main server;
determining task parameters of the crawler task, wherein the task parameters comprise a target webpage link and a data acquisition request;
distributing the target webpage link and the data acquisition request to one or more downloaders so that the downloaders initiate the data acquisition request on the target webpage link of the Internet to acquire target data;
and returning queue updating information to the sub-node server so that the sub-node server feeds the updated shared task queue back to the main server and performs synchronous updating.
Further, if the data obtaining request includes a key request parameter, the distributing the target web page link, the data obtaining request, and the key request parameter to one or more downloaders includes:
acquiring one or more first IPs from a preset IP agent pool;
assigning the one or more first IPs to the one or more downloaders;
and distributing the target webpage link, the data acquisition request and the key request parameters to one or more downloaders, so that each downloader initiates the data acquisition request on the target webpage link of the Internet based on the first IP to acquire target data.
Further, the distributing the target webpage link, the data acquisition request and the key request parameter to one or more downloaders so that each downloader initiates the data acquisition request at the target webpage link of the internet based on the first IP, and after acquiring the target data, the method further includes:
judging whether the downloader acquires the target data or not;
if not, acquiring a second IP different from the first IP from the IP agent pool;
and sending the second IP to the downloader so that the downloader initiates the data acquisition request on the basis of the target webpage link of the Internet of the second IP to acquire target data.
Further, the reading the shared task queue from the child node server of the distributed crawler system to obtain one or more crawler tasks further includes:
monitoring the shared task queue in real time to judge whether the crawler task exists in the shared task queue;
and if so, acquiring the crawler task.
Further, the distributing the target webpage link and the data acquisition request to one or more downloaders so that the downloaders initiate the data acquisition request at the target webpage link of the internet, and after acquiring the target data, the method further includes:
judging whether the target data comprises the newly added task parameters or not;
if yes, generating the crawler task based on the task parameters;
adding the crawler task to the shared task queue.
Further, the distributed crawler system is based on a docker architecture, and further includes: and executing expansion or dismantling the crawler service through the docker architecture based on the user requirement.
In a second aspect, the present invention provides a distributed data acquisition system, comprising:
the system comprises an acquisition module, a main server and a distributed crawler system, wherein the acquisition module is used for reading a shared task queue from a sub-node server of the distributed crawler system to acquire one or more crawler tasks, and the shared task queue is acquired from the main server by the sub-node server;
the task parameter determining module is used for determining task parameters of the crawler task, and the task parameters comprise a target webpage link and a data acquisition request;
the acquisition module is used for distributing the target webpage link and the data acquisition request to one or more downloaders so that the downloaders initiate the data acquisition request on the target webpage link of the Internet to acquire target data;
and the synchronous updating module is used for returning queue updating information to the sub-node server so that the sub-node server feeds the updated shared task queue back to the main server and performs synchronous updating.
Further, the data acquisition request includes a key request parameter, and the acquisition module is further configured to acquire a first IP from a preset IP proxy pool; assigning the first IP to the downloader; and distributing the target webpage link, the data acquisition request and the key request parameters to one or more downloaders, so that each downloader initiates the data acquisition request on the target webpage link of the Internet based on different first IPs to acquire target data.
In a third aspect, the present invention provides a server, including a memory, a processor, and a program stored in the memory and executable on the processor, wherein the processor executes the program to implement a distributed data acquisition method as described in any one of the above.
In a fourth aspect, the present invention provides a terminal readable storage medium, on which a program is stored, wherein the program, when executed by a processor, is capable of implementing a distributed data acquisition method as described in any one of the above.
The invention realizes the sharing of the task queue at each sub-node in the distributed crawler system, so that each sub-node can acquire the updated task queue in real time, the read-write is rapid, the blockage is avoided, and the crawler speed is improved.
Drawings
Fig. 1 is a flowchart of a distributed data acquisition method according to the first embodiment.
Fig. 2 is a flowchart of a distributed data acquisition method according to the second embodiment.
Fig. 3 is a flowchart of an alternative embodiment of the second embodiment.
Fig. 4 is a flowchart of a distributed data acquisition method according to a third embodiment.
Fig. 5 is a system block diagram of the third embodiment.
Fig. 6 is a system block diagram of three alternative embodiments of the present embodiment.
Fig. 7 is a block diagram of a server in the fourth embodiment.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the steps as a sequential process, many of the steps can be performed in parallel, concurrently or simultaneously. In addition, the order of the steps may be rearranged. A process may be terminated when its operations are completed, but may have additional steps not included in the figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc.
Furthermore, the terms "first," "second," and the like may be used herein to describe various orientations, actions, steps, elements, or the like, but the orientations, actions, steps, or elements are not limited by these terms. These terms are only used to distinguish one direction, action, step or element from another direction, action, step or element. For example, the first feature information may be the second feature information or the third feature information, and similarly, the second feature information and the third feature information may be the first feature information without departing from the scope of the present application. The first characteristic information, the second characteristic information and the third characteristic information are characteristic information of the distributed crawler system, but the first characteristic information, the second characteristic information and the third characteristic information are not the same characteristic information. The terms "first", "second", etc. are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "plurality", "batch" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
The terms and abbreviations used in the following examples have the following meanings:
docker: an open source application container engine, which allows developers to package their applications and dependencies into a portable container and then distribute them to any popular Linux machine, and also enables virtualization. The container completely uses a sandbox mechanism, no interface exists between the containers, rapid deployment of the distributed crawler service can be conveniently achieved, and expansion or removal of the crawler service can be easily achieved according to requirements due to the portability and the lightweight characteristic of docker.
Scapy: a Python-adapted screen and web crawling framework for crawling web sites and extracting structured data from pages. The Scapy has wide application, and can be used for data mining, monitoring, crawler service and the like.
Scapy-redis: a distributed crawler system using Redis as communication carrier features quick read and write, negligible influence to crawler speed and universal use.
URL: the Uniform Resource Locator, i.e. the web page address, is an address used to describe the standard Resource on the internet. Each file on the internet has a unique URL.
Example one
In this embodiment, a distributed data collection method is added on the basis of the above embodiments, and is executed by a child node crawler engine of a distributed crawler system. Optionally, the common crawler framework includes script, script-redis, and the like, and in this embodiment and the following embodiments, the distributed crawler system is improved based on the script framework, so that the crawler service is added to obtain task parameters from a shared task queue, and then a request is constructed and data is collected.
In this embodiment and the following embodiments, the application scenario is a distributed crawler system, where the distributed crawler system includes a main server and node servers of each sub-node, and each sub-node server includes a sub-node crawler engine. Before the crawling process starts, crawler tasks are added into the shared task queue by the main server based on data acquisition requirements, and the sub-node servers acquire the crawler tasks from the main server. As shown in fig. 1, the method comprises the following steps:
s101, reading a shared task queue from a sub-node server of a distributed crawler system to obtain one or more crawler tasks, wherein the shared task queue is obtained by the sub-node server from a main server;
in this step, obtaining the shared task queue includes: and when the calculation power is enough, the crawler engine reads the crawler tasks in sequence in the shared task queue through the intermediate plug-ins and executes the crawler tasks. Optionally, the shared task queue stores the task parameters of the crawler task through a key-value pair mode, and this step may also be: determining keywords based on the data acquisition requirements of the sub-nodes; and acquiring the crawler task from the task queue based on the keyword.
In this step, the task parameters of the crawler task may be the complete url, or may be the main request parameters.
In an alternative embodiment, step S101 further comprises: monitoring the shared task queue in real time to judge whether the crawler task exists in the shared task queue; and if so, acquiring the crawler task.
The step is implemented by setting real-time monitoring in a task queue and acquiring a newly added crawler task in time.
S102, determining task parameters of the crawler task, wherein the task parameters comprise a target webpage link and a data acquisition request.
In this step, the target web page link may be a URL link, for example. The data acquisition request is used for sending a data request to the target webpage link so as to acquire the fed back data.
S103, the target webpage link and the data acquisition request are distributed to one or more downloaders, so that the downloaders initiate the data acquisition request on the target webpage link of the Internet to acquire target data.
In this step, the crawler engine is further configured to monitor the shared task queue in real time through the custom middleware, so as to obtain a crawler task from the shared task queue, so as to ensure that the crawler service is not closed. The stability of the crawler service is guaranteed.
After the step, the method further comprises the following steps: and the crawler engine initiates a storage request to the project pipeline, wherein the storage request comprises the target data, so that the project pipeline performs persistent storage on the target data.
And S104, returning queue updating information to the sub-node servers so that the sub-node servers feed the updated shared task queue back to the main server and perform synchronous updating.
And the shared task queue of the sub-node server is obtained from the main server, and when the shared task queue of the main server is updated, the sub-node server is updated accordingly.
The embodiment realizes the sharing of the task queue at each sub-node in the distributed crawler system, so that each sub-node can acquire the updated task queue in real time, the reading and writing are rapid, the blockage is avoided, and the crawler speed is improved.
Example two
In this embodiment, a preset IP proxy pool is added on the basis of the above embodiment, and a crawler process may be blocked by a reverse crawling measure of a target web page, so that a website blocks an IP or intercepts a request, and data of the target web page cannot be acquired. In this embodiment, as shown in fig. 2, the specific steps are as follows:
s201, reading a shared task queue from a sub-node server of the distributed crawler system to obtain one or more crawler tasks, wherein the shared task queue is obtained by the sub-node server from a main server.
S202, determining task parameters of the crawler task, wherein the task parameters comprise a target webpage link and a data acquisition request.
S2031, one or more first IPs are obtained from a preset IP agent pool.
In the step, the IP agent pool is used for storing a plurality of available agent IPs, the agent IPs are extracted from the agent IPs when the crawler engine is used, meanwhile, the IP addresses in the IP agent pool are invalid quickly, and in order to guarantee timeliness of the agent IP addresses, the IP agent pool guarantees continuous generation of new IPs through dynamic dialing.
Specifically, the proxy pool includes four modules including: the device comprises a dialing module, a storage module, a detection module and an interface module. The IP proxy pool creation process is as follows: firstly, an ADSL host is set as a proxy host, and each host needs to clear the self proxy stored in a storage module before dialing; the dialing module is used for dialing regularly to extract a new IP address; the storage module stores the IP and the port extracted by the dialing module, and stores the IP and the port corresponding to each host by adopting a hash data type of redis; the detection module is used for detecting the validity of the IP, when the dialing module acquires a new IP, whether the IP can access the external network needs to be detected firstly, and if the access is successful, the acquired IP is the available proxy IP. After the IP and the port are determined, the interface module provides the crawler service to call an IP proxy pool, so that a downloader uses the proxy IP to initiate a request to a target page, and the stability of data acquisition is improved.
S2032, allocating the one or more first IPs to the one or more downloaders.
S2033, the target webpage link, the data acquisition request and the key request parameter are distributed to one or more downloaders, so that each downloader initiates the data acquisition request on the target webpage link of the Internet based on the first IP to acquire target data.
In an alternative embodiment, as shown in fig. 3, further comprising:
s2034, judging whether the downloader acquires the target data.
S2035, if not, obtaining a second IP different from the first IP from the IP proxy pool.
S2036, sending the second IP to the downloader, so that the downloader initiates the data acquisition request based on the second IP on a target webpage link of the Internet to acquire target data.
In steps S2034 to S2035, the downloader requests the internet to perform a crawler task according to the random IP acquired from the proxy pool, and if the target data is not acquired, it indicates that the IP is intercepted by the network firewall. A second IP, different from the first IP, is obtained from the IP proxy pool and the data crawling task is re-executed using the second IP.
And S204, returning queue updating information to the sub-node servers so that the sub-node servers feed the updated shared task queue back to the main server and perform synchronous updating.
In this embodiment, a preset IP proxy pool is added on the basis of the above embodiment, and when the initiated crawler request cannot obtain preset return data, a new IP initiation request is obtained from the proxy pool again, so as to avoid being shielded by a firewall of a website and being unable to obtain a preset value.
EXAMPLE III
The distributed crawler system is built on a docker cluster, and the docker architecture enables the crawler system to be more convenient when adding and deleting crawler services. As shown in fig. 4, the steps are as follows:
s301, reading a shared task queue from a sub-node server of the distributed crawler system to obtain one or more crawler tasks, wherein the shared task queue is obtained by the sub-node server from a main server.
S302, determining task parameters of the crawler task, wherein the task parameters comprise a target webpage link and a data acquisition request.
And S303, distributing the target webpage link and the data acquisition request to one or more downloaders so that the downloaders initiate the data acquisition request on the target webpage link of the Internet to acquire target data.
S304, returning queue updating information to the sub-node servers so that the sub-node servers feed the updated shared task queue back to the main server and perform synchronous updating.
S3051, judging whether the target data comprises the newly added task parameters.
S3052, if yes, generating the crawler task based on the task parameters.
S3053, adding the crawler task to the shared task queue.
In this embodiment, the distributed crawler system is deployed by a docker architecture. Specifically, the steps include: configuring a basic mirror image for the crawler task; reading codes of the crawler service, and generating a container based on the basic mirror image and the codes; and adding the container to realize the deployment and operation of the crawler service. In the step, modules such as an engine, a download, a pipeline, a schedule and/or spiders form a crawler service together, and a docker constructs a code through a basic mirror image and the crawler service to realize rapid expansion or removal of the crawler service according to needs, wherein the crawler service comprises tasks such as monitoring and acquiring a shared task queue, configuring an engine, downloading and acquiring data, persisting the data and/or adding or deleting a target task list.
According to the crawler task configuration method and device, the base mirror image is configured on the crawler task through the docker framework, so that the crawler system is simpler and more convenient to expand or remove the crawler service quickly, and the configuration efficiency is improved.
Example four
As shown in fig. 5, the present embodiment provides a distributed data acquisition system 4, which includes the following modules:
an obtaining module 401, configured to read a shared task queue from a child node server of a distributed crawler system to obtain one or more crawler tasks, where the shared task queue is obtained by the child node server from a main server. The module is further configured to: monitoring the shared task queue in real time to judge whether the crawler task exists in the shared task queue; and if so, acquiring the crawler task.
A task parameter determining module 402, configured to determine task parameters of the crawler task, where the task parameters include a target web page link and a data obtaining request.
The acquisition module 403 is configured to distribute the target webpage link and the data acquisition request to one or more downloaders, so that the downloaders initiate the data acquisition request on the target webpage link of the internet to acquire target data. The acquisition module further comprises: acquiring one or more first IPs from a preset IP agent pool; assigning the one or more first IPs to the one or more downloaders; and distributing the target webpage link, the data acquisition request and the key request parameters to one or more downloaders, so that each downloader initiates the data acquisition request on the target webpage link of the Internet based on the first IP to acquire target data.
And a synchronous update module 404, configured to return queue update information to the child node servers, so that the child node servers feed back the updated shared task queue to the main server and perform synchronous update.
In an alternative embodiment, as shown in fig. 6, further comprising:
an IP switching module 405, configured to determine whether a downloader acquires target data after the target web page link, the data acquisition request, and the key request parameter are distributed to one or more downloaders, so that each downloader initiates the data acquisition request at the target web page link of the internet based on the first IP and acquires the target data; if not, acquiring a second IP different from the first IP from the IP agent pool; and sending the second IP to the downloader so that the downloader initiates the data acquisition request on the basis of the target webpage link of the Internet of the second IP to acquire target data.
In an alternative embodiment, further comprising:
a task adding module 406, configured to send the target webpage link and the data obtaining request to one or more downloaders, so that the downloaders initiate the data obtaining request at the target webpage link of the internet, and after obtaining target data, determine whether the target data includes the newly added task parameter; if yes, generating the crawler task based on the task parameters; adding the crawler task to the shared task queue.
The terminal characteristic acquisition device provided by the embodiment of the invention can execute the distributed data acquisition system provided by any embodiment of the invention, and has corresponding execution methods and beneficial effects of the functional modules.
EXAMPLE five
The present embodiment provides a schematic structural diagram of a server, as shown in fig. 7, the server includes a processor 501, a memory 502, an input device 503, and an output device 504; the number of the processors 501 in the server may be one or more, and one processor 501 is taken as an example in the figure; the processor 501, the memory 502, the input device 503 and the output device 504 in the device/terminal/server may be linked by a bus or other means, which is exemplified in fig. 7.
The memory 502 is a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules (e.g., the first obtaining module 301, the first configuring module 302, etc.) corresponding to the gateway-based link generating method in the embodiment of the present invention. The processor 501 executes various functional applications of the device/terminal/server and data processing by executing software programs, instructions and modules stored in the memory 502, namely, implements the above-mentioned method.
The memory 502 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 502 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 502 may further include memory located remotely from the processor 501, which may be linked to a device/terminal/server through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input means 503 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the device/terminal/server. The output device 504 may include a display device such as a display screen.
Fifth, the embodiments of the present invention provide a server, which can execute the distributed data collection method provided in any embodiment of the present invention, and the server has functional modules and beneficial effects corresponding to the execution method.
EXAMPLE six
The sixth embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements a distributed data acquisition method according to any embodiment of the present invention:
reading a shared task queue from a sub-node server of a distributed crawler system to obtain one or more crawler tasks, wherein the shared task queue is obtained by the sub-node server from a main server;
determining task parameters of the crawler task, wherein the task parameters comprise a target webpage link and a data acquisition request;
distributing the target webpage link and the data acquisition request to one or more downloaders so that the downloaders initiate the data acquisition request on the target webpage link of the Internet to acquire target data;
and returning queue updating information to the sub-node server so that the sub-node server feeds the updated shared task queue back to the main server and performs synchronous updating.
The computer-readable storage media of embodiments of the invention may take any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical link having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a storage medium may be transmitted over any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, or the like, as well as conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or terminal. In the case of a remote computer, the remote computer may be linked to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the link may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. A distributed data acquisition method is executed by a sub-node crawler engine of a distributed crawler system, and is characterized by comprising the following steps:
reading a shared task queue from a sub-node server of a distributed crawler system to obtain one or more crawler tasks, wherein the shared task queue is obtained by the sub-node server from a main server;
determining task parameters of the crawler task, wherein the task parameters comprise a target webpage link and a data acquisition request;
distributing the target webpage link and the data acquisition request to one or more downloaders so that the downloaders initiate the data acquisition request on the target webpage link of the Internet to acquire target data;
and returning queue updating information to the sub-node server so that the sub-node server feeds the updated shared task queue back to the main server and performs synchronous updating.
2. The distributed data collection method of claim 1, wherein if the data acquisition request includes a key request parameter, the distributing the target web page link and the data acquisition request to one or more downloaders includes:
acquiring one or more first IPs from a preset IP agent pool;
assigning the one or more first IPs to the one or more downloaders;
and distributing the target webpage link, the data acquisition request and the key request parameters to one or more downloaders, so that each downloader initiates the data acquisition request on the target webpage link of the Internet based on the first IP to acquire target data.
3. The distributed data collection method according to claim 2, wherein the distributing the target webpage link, the data acquisition request and the key request parameter to one or more downloaders to make each downloader initiate the data acquisition request including the key request parameter at the target webpage link of the internet based on the first IP, and further comprises, after acquiring the target data:
judging whether the downloader acquires the target data or not;
if not, acquiring a second IP different from the first IP from the IP agent pool;
and sending the second IP to the downloader so that the downloader initiates the data acquisition request on the basis of the target webpage link of the Internet of the second IP to acquire target data.
4. The distributed data collection method of claim 1, wherein the reading of the shared task queue from the child node server of the distributed crawler system to obtain one or more crawler tasks further comprises:
monitoring the shared task queue in real time to judge whether the crawler task exists in the shared task queue;
and if so, acquiring the crawler task.
5. The distributed data collection method of claim 1, wherein the distributing the target webpage link and the data acquisition request to one or more downloaders to enable the downloaders to initiate the data acquisition request at the target webpage link of the internet, and after acquiring the target data, further comprises:
judging whether the target data comprises the newly added task parameters or not;
if yes, generating the crawler task based on the task parameters;
adding the crawler task to the shared task queue.
6. The distributed data collection method of claim 1, wherein the distributed crawler system is based on a docker architecture, and further comprising: and executing expansion and/or dismantling crawler service through a docker architecture based on user requirements.
7. A distributed data acquisition system, comprising:
the system comprises an acquisition module, a main server and a distributed crawler system, wherein the acquisition module is used for reading a shared task queue from a sub-node server of the distributed crawler system to acquire one or more crawler tasks, and the shared task queue is acquired from the main server by the sub-node server;
the task parameter determining module is used for determining task parameters of the crawler task, and the task parameters comprise a target webpage link and a data acquisition request;
the acquisition module is used for distributing the target webpage link and the data acquisition request to one or more downloaders so that the downloaders initiate the data acquisition request on the target webpage link of the Internet to acquire target data;
and the synchronous updating module is used for returning queue updating information to the sub-node server so that the sub-node server feeds the updated shared task queue back to the main server and performs synchronous updating.
8. The distributed data collection system of claim 7, wherein the data acquisition request includes a key request parameter, and the collection module is further configured to acquire a first IP from a preset IP proxy pool; assigning the first IP to the downloader; and distributing the target webpage link, the data acquisition request and the key request parameters to one or more downloaders, so that each downloader initiates the data acquisition request on the target webpage link of the Internet based on different first IPs to acquire target data.
9. A server comprising a memory, a processor and a program stored on the memory and executable on the processor, wherein the processor when executing the program implements a distributed data collection method according to any one of claims 1 to 6.
10. A terminal readable storage medium, on which a program is stored, which, when being executed by a processor, is capable of implementing a distributed data acquisition method according to any one of claims 1 to 6.
CN202011035041.6A 2020-09-27 2020-09-27 Distributed data acquisition method, system, server and storage medium Pending CN112199567A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011035041.6A CN112199567A (en) 2020-09-27 2020-09-27 Distributed data acquisition method, system, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011035041.6A CN112199567A (en) 2020-09-27 2020-09-27 Distributed data acquisition method, system, server and storage medium

Publications (1)

Publication Number Publication Date
CN112199567A true CN112199567A (en) 2021-01-08

Family

ID=74007442

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011035041.6A Pending CN112199567A (en) 2020-09-27 2020-09-27 Distributed data acquisition method, system, server and storage medium

Country Status (1)

Country Link
CN (1) CN112199567A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113282372A (en) * 2021-05-31 2021-08-20 平安国际智慧城市科技股份有限公司 Deployment method, device, equipment and storage medium of data collection cluster
CN113297449A (en) * 2021-05-21 2021-08-24 南京大学 Method and system for realizing streaming crawler
CN113821705A (en) * 2021-08-30 2021-12-21 湖南大学 Webpage content acquisition method, terminal equipment and readable storage medium
CN114417200A (en) * 2022-01-04 2022-04-29 马上消费金融股份有限公司 Network data acquisition method and device and electronic equipment
CN115174559A (en) * 2022-07-01 2022-10-11 抖音视界(北京)有限公司 Data acquisition method and device, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110307467A1 (en) * 2010-06-10 2011-12-15 Stephen Severance Distributed web crawler architecture
CN103856467A (en) * 2012-12-06 2014-06-11 百度在线网络技术(北京)有限公司 Method and distributed system for achieving safety scanning
CN106484886A (en) * 2016-10-17 2017-03-08 金蝶软件(中国)有限公司 A kind of method of data acquisition and its relevant device
CN109033195A (en) * 2018-06-28 2018-12-18 上海盛付通电子支付服务有限公司 The acquisition methods of webpage information obtain equipment and computer-readable medium
CN110442769A (en) * 2019-08-05 2019-11-12 深圳乐信软件技术有限公司 Distributed data crawls system, method, apparatus, equipment and storage medium
CN110929128A (en) * 2019-12-11 2020-03-27 北京启迪区块链科技发展有限公司 Data crawling method, device, equipment and medium
CN111104578A (en) * 2019-12-18 2020-05-05 北京阿尔山区块链联盟科技有限公司 Crawler system, method and server

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110307467A1 (en) * 2010-06-10 2011-12-15 Stephen Severance Distributed web crawler architecture
CN103856467A (en) * 2012-12-06 2014-06-11 百度在线网络技术(北京)有限公司 Method and distributed system for achieving safety scanning
CN106484886A (en) * 2016-10-17 2017-03-08 金蝶软件(中国)有限公司 A kind of method of data acquisition and its relevant device
CN109033195A (en) * 2018-06-28 2018-12-18 上海盛付通电子支付服务有限公司 The acquisition methods of webpage information obtain equipment and computer-readable medium
CN110442769A (en) * 2019-08-05 2019-11-12 深圳乐信软件技术有限公司 Distributed data crawls system, method, apparatus, equipment and storage medium
CN110929128A (en) * 2019-12-11 2020-03-27 北京启迪区块链科技发展有限公司 Data crawling method, device, equipment and medium
CN111104578A (en) * 2019-12-18 2020-05-05 北京阿尔山区块链联盟科技有限公司 Crawler system, method and server

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113297449A (en) * 2021-05-21 2021-08-24 南京大学 Method and system for realizing streaming crawler
CN113282372A (en) * 2021-05-31 2021-08-20 平安国际智慧城市科技股份有限公司 Deployment method, device, equipment and storage medium of data collection cluster
CN113282372B (en) * 2021-05-31 2022-08-26 平安国际智慧城市科技股份有限公司 Deployment method, device, equipment and storage medium of data collection cluster
CN113821705A (en) * 2021-08-30 2021-12-21 湖南大学 Webpage content acquisition method, terminal equipment and readable storage medium
CN113821705B (en) * 2021-08-30 2024-02-20 湖南大学 Webpage content acquisition method, terminal equipment and readable storage medium
CN114417200A (en) * 2022-01-04 2022-04-29 马上消费金融股份有限公司 Network data acquisition method and device and electronic equipment
CN114417200B (en) * 2022-01-04 2023-04-14 马上消费金融股份有限公司 Network data acquisition method and device and electronic equipment
CN115174559A (en) * 2022-07-01 2022-10-11 抖音视界(北京)有限公司 Data acquisition method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US10812566B2 (en) Distributed steam processing
CN112199567A (en) Distributed data acquisition method, system, server and storage medium
WO2018177260A1 (en) Application development method and tool, device, and storage medium thereof
CN108196915B (en) Code processing method and device based on application container engine and storage medium
CN107291481B (en) Component updating method, device and system
CN113761412B (en) Application page display method and device, electronic equipment, medium and application system
CN110990081A (en) Microservice registration and discovery method and device, storage medium and electronic equipment
CN112799663A (en) Page display method and device, computer readable storage medium and electronic equipment
US20130007184A1 (en) Message oriented middleware with integrated rules engine
US11500652B2 (en) Method and system for application loading
US10270886B1 (en) Method and system for dynamic optimization of a script library
US10255063B2 (en) Providing source control of custom code for a user without providing source control of host code for the user
CN114816672A (en) Virtual machine creation method and device, electronic equipment and storage medium
CN112887440A (en) IP address management method and device
Bhardwaj et al. Serving mobile apps: A slice at a time
CN112394907A (en) Container-based delivery system construction method, application delivery method and delivery system
CN112491940B (en) Request forwarding method and device of proxy server, storage medium and electronic equipment
US10284628B2 (en) Distribution method and resource acquisition method
US12079178B2 (en) Snapshot volume proxy for object storage interfaces
CN108139950B (en) Distributed extension execution method and computing system
Lewis et al. A tale of three systems: Case studies on the application of architectural tactics for cyber-foraging
CN112965747B (en) Method, apparatus, device and computer readable medium for mining code loopholes
CN114640585B (en) Resource updating method and device, electronic equipment and storage medium
US8387040B2 (en) Dynamic creation of client-side environment for problem analysis
CN112306324B (en) Information processing method, apparatus, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination