CN111092921B - Data acquisition method, device and storage medium - Google Patents

Data acquisition method, device and storage medium Download PDF

Info

Publication number
CN111092921B
CN111092921B CN201811240829.3A CN201811240829A CN111092921B CN 111092921 B CN111092921 B CN 111092921B CN 201811240829 A CN201811240829 A CN 201811240829A CN 111092921 B CN111092921 B CN 111092921B
Authority
CN
China
Prior art keywords
node
data acquisition
scheduling
nodes
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811240829.3A
Other languages
Chinese (zh)
Other versions
CN111092921A (en
Inventor
曹六一
张丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Beijing Founder Electronics Co Ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN201811240829.3A priority Critical patent/CN111092921B/en
Publication of CN111092921A publication Critical patent/CN111092921A/en
Application granted granted Critical
Publication of CN111092921B publication Critical patent/CN111092921B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1004Server selection for load balancing
    • H04L67/1008Server selection for load balancing based on parameters of servers, e.g. available memory or workload
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/12Protocols specially adapted for proprietary or special-purpose networking environments, e.g. medical networks, sensor networks, networks in vehicles or remote metering networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a data acquisition method, a device and a storage medium.A scheduling task is sent to a master control node by a scheduling node, the master control node receives running state information sent by each data acquisition node, and distributes the scheduling task to the data acquisition nodes according to a preset strategy, pre-acquired processing capacity information of each data acquisition node and the running state information of each data acquisition node, so that the data acquisition nodes execute the scheduling task. The data acquisition nodes are uniformly managed through the master control node, the load of each data acquisition node is balanced, the response capacity of data acquisition and the utilization rate of machine resources are improved, the data acquisition nodes can be distributed in different machine rooms, the advantages of bandwidths and multiple IP addresses of multiple machine rooms are fully utilized, and dynamic expansion and contraction of the nodes are supported.

Description

Data acquisition method, device and storage medium
Technical Field
The present invention relates to the field of communications technologies, and in particular, to a data acquisition method, an apparatus, and a storage medium.
Background
The most main resources on which data acquisition needs to depend are bandwidth, an IP address, a processor and a memory, the processor and the memory cannot become a bottleneck limiting the scale of an acquisition system under the condition that hardware resources are relatively low, and the real bottleneck is usually the bandwidth and the IP address. The network content such as large-scale downloading of web pages and the like needs to be supported by enough network bandwidth, and a website usually has a limit on the number of times that a certain IP address can be accessed in unit time, so that the website needs to be supported by enough IP addresses for large-scale and high-timeliness collection.
In the existing open-source distributed acquisition framework, based on the requirements of reducing the bandwidth of a public network and the transmission efficiency, a plurality of acquisition nodes need to be deployed in a machine of the same machine room, and based on the deployment requirement, the bandwidth and the multi-IP address advantages of a plurality of machine rooms cannot be conveniently utilized for users with the plurality of machine rooms. Meanwhile, if a plurality of acquisition programs are deployed on one machine, the consumption of system resources by each acquisition program cannot be controlled, and only one acquisition program is deployed on one machine, the problems of unbalanced utilization rate of machine resources and resource waste are brought.
Disclosure of Invention
The invention provides a data acquisition method, a data acquisition device and a storage medium, which are used for improving the resource utilization rate in the data acquisition process and realizing load balance.
The first aspect of the present invention provides a data acquisition method, which is applied to a distributed acquisition system, wherein the distributed acquisition system comprises a scheduling node, a master control node, and a plurality of data acquisition nodes, and the method comprises:
the master control node receives the scheduling tasks sent by the scheduling nodes;
the master control node receives the running state information sent by each data acquisition node;
and the master control node distributes the scheduling tasks to the data acquisition nodes according to a preset strategy, the pre-acquired processing capacity information of each data acquisition node and the running state information of each data acquisition node, so that the data acquisition nodes execute the scheduling tasks.
Further, the scheduling task comprises a target webpage and a scheduling sequence of the target webpage, and the scheduling sequence of the target webpage are acquired by the scheduling node according to the collection task and a preset scheduling strategy.
Further, the step of allocating, by the general control node, the scheduling task to the data acquisition nodes according to a preset policy, pre-acquired processing capability information of each data acquisition node, and operation state information of each data acquisition node, includes:
and the master control node allocates the scheduling tasks to the data acquisition nodes according to a webpage domain name scattering strategy, the processing capacity information and the running state information of each data acquisition node.
Further, the total control node is configured with at least one standby total control node, so that when the total control node fails, one standby total control node is selected from the at least one standby total control node to replace the total control node.
A second aspect of the present invention provides a data acquisition method, which is applied to a distributed acquisition system, where the distributed acquisition system includes a scheduling node, a master control node, and a plurality of data acquisition nodes, and the method includes:
the data acquisition node acquires the running state information of the data acquisition node;
the data acquisition nodes send the running state information to the master control node, so that the master control node distributes the scheduling tasks to the data acquisition nodes according to a preset strategy, pre-acquired processing capacity information of each data acquisition node and the running state information of each data acquisition node;
and the data acquisition node receives the scheduling task distributed by the master control node and executes the scheduling task.
Further, the data acquisition node receives the scheduling task distributed by the general control node and executes the scheduling task, including:
after the data acquisition node receives the scheduling task, distributing tasks for the download plug-in and the analysis plug-in of the data acquisition node according to the scheduling task, so that the download plug-in and the analysis plug-in can respectively and independently execute the distributed tasks.
Further, the data acquisition node sends the running state information to the general control node, and the method includes:
and the data acquisition node acquires the running state information of each download plug-in and the analysis plug-in, and sends the running state information of each plug-in to the master control node in a preset period.
The third aspect of the present invention provides a general control node, which is applied to a distributed acquisition system, where the distributed acquisition system includes a scheduling node, a general control node, and a plurality of data acquisition nodes, and the general control node includes:
a memory for storing a computer program;
a processor for executing the computer program stored in the memory to implement: receiving a scheduling task sent by the scheduling node; receiving operation state information sent by each data acquisition node; and distributing the scheduling task to the data acquisition nodes according to a preset strategy, the pre-acquired processing capacity information of each data acquisition node and the running state information of each data acquisition node so as to enable the data acquisition nodes to execute the scheduling task.
Further, the scheduling task comprises a target webpage and a scheduling sequence of the target webpage, and the scheduling sequence of the target webpage are acquired by the scheduling node according to the collection task and a preset scheduling strategy.
Further, when the processor allocates the scheduling task to the data acquisition nodes according to a preset policy, the pre-acquired processing capability information of each data acquisition node, and the operating state information of each data acquisition node, the processor is configured to:
and distributing the scheduling task to the data acquisition nodes according to a webpage domain name scattering strategy, the processing capacity information and the running state information of each data acquisition node.
Further, the total control node is configured with at least one standby total control node, so that when the total control node fails, one standby total control node is selected from the at least one standby total control node to replace the total control node.
A fourth aspect of the present invention provides a data acquisition node, which is applied to a distributed acquisition system, where the distributed acquisition system includes a scheduling node, a general control node, and a plurality of data acquisition nodes, and each data acquisition node includes:
a memory for storing a computer program;
a processor for executing the computer program stored in the memory to implement: acquiring running state information of the mobile terminal; sending the running state information to the master control node, so that the master control node allocates the scheduling task to the data acquisition nodes according to a preset strategy, pre-acquired processing capacity information of each data acquisition node and the running state information of each data acquisition node; and receiving the scheduling task distributed by the master control node and executing the scheduling task.
Further, when the processor receives the scheduling task distributed by the grandmaster node and executes the scheduling task, the processor is configured to:
after the scheduling task is received, distributing tasks for the downloading plug-in and the analysis plug-in of the data acquisition node according to the scheduling task, so that the downloading plug-in and the analysis plug-in can respectively and independently execute the distributed tasks.
Further, when the processor sends the running state information to the general control node, the processor is configured to:
and acquiring the running state information of each downloading plug-in and each analyzing plug-in, and sending the running state information of each plug-in to the master control node in a preset period.
A fifth aspect of the present invention is to provide a computer-readable storage medium having stored thereon a computer program;
which when executed by a processor implements the method according to the first aspect.
A sixth aspect of the present invention is to provide a computer-readable storage medium having stored thereon a computer program;
which when executed by a processor implements the method according to the second aspect.
According to the data acquisition method, the data acquisition device and the storage medium, the scheduling task is sent to the master control node through the scheduling node, the master control node receives the running state information sent by each data acquisition node, and the scheduling task is distributed to the data acquisition nodes according to a preset strategy, the pre-acquired processing capacity information of each data acquisition node and the running state information of each data acquisition node, so that the data acquisition nodes execute the scheduling task. According to the invention, the data acquisition nodes are uniformly managed through the master control node, and the load is balanced for each data acquisition node, so that the response capability of data acquisition and the utilization rate of machine resources are improved, the data acquisition nodes can be distributed in different machine rooms, the bandwidth and multi-IP address advantages of multiple machine rooms are fully utilized, and the dynamic expansion and contraction of the nodes are supported.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is an architecture diagram of a distributed acquisition system according to an embodiment of the present invention;
FIG. 2 is a flow chart of a data collection method according to an embodiment of the present invention;
FIG. 3 is a flow chart of a data collection method according to another embodiment of the present invention;
fig. 4 is a structural diagram of a master control node according to an embodiment of the present invention;
fig. 5 is a structural diagram of a data acquisition node according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The data acquisition method provided by the invention can be applied to the distributed acquisition system shown in figure 1. As shown in fig. 1, the distributed acquisition system includes a scheduling node, a general control node, and a plurality of data acquisition nodes. The scheduling node is used for scheduling the core task of each collected product, the node needs to calculate target webpages to be downloaded according to the collected tasks, and the scheduling sequence of each target webpage is determined according to a specific scheduling strategy; in addition, the refreshing cycle of the webpage can be decided according to the history of webpage downloading; the node also needs a retry strategy based on a time growth characteristic in order to ensure the success rate of webpage downloading. For example, the scheduling node 1 is a news scheduling node, the scheduling node 2 is an adaptive scheduling node, and different scheduling programs can be used for different information source collections. The master control node (SpiderService node) is a core node of the distributed acquisition system, acquires the processing capacity information of each data acquisition node in advance, receives the running state information sent by each data acquisition node, and distributes the scheduling task received from the scheduling node to each data acquisition node according to a preset strategy, the processing capacity information of each data acquisition node and the running state information. The data acquisition nodes (SpiderProxy nodes) have downloading and analyzing capabilities and can be arranged in different machine rooms, as shown in figure 1, the data acquisition nodes 1, 2 and 3 are arranged in a company machine room, the data acquisition nodes 4, 5 and 6 are arranged in a cloud server, the data acquisition nodes 7, 8 and 9 are arranged in a client machine room, the bandwidth and multi-IP address advantages of multiple machine rooms can be fully utilized, the downloading and analyzing capabilities are provided for the master control nodes after the data acquisition nodes are registered to the master control nodes, the running state information is sent to the master control nodes in preset periods, and the downloading and the analyzing are carried out according to scheduling tasks after the scheduling tasks distributed by the master control nodes are received.
Furthermore, the total control node can be configured with at least one standby total control node, so that when the total control node fails, one standby total control node is selected from the standby total control nodes to replace the total control node. In addition, the distributed acquisition system can also comprise an agent node (agent) which is responsible for high availability of the distributed acquisition system and provides an interface for inquiring the main node, all the main control nodes are registered in a reliable coordination system (zookeeper) of the distributed acquisition system, the agent node elects all the main control nodes according to node data in the zookeeper, a main node is elected, the rest are standby nodes, when the main node fails, the agent node elects again, and an election result is published to the outside through the interface.
The building process of the blockchain network is described in detail below with reference to specific embodiments.
Fig. 2 is a flowchart of a data acquisition method according to an embodiment of the present invention. The embodiment provides a data acquisition method, wherein an execution main body is a master control node, and the method comprises the following specific steps:
s101, the master control node receives the scheduling tasks sent by the scheduling nodes.
In this embodiment, when the scheduling node receives the collection task, the scheduling task may be obtained according to the collection task and a predetermined scheduling policy, and the scheduling task may include a target webpage and a scheduling order of the target webpage. Specifically, the target webpage may be obtained according to the collection task, and then the scheduling order of the target webpage may be obtained according to a predetermined scheduling policy, where the predetermined scheduling policy may be: processing the tasks according to the priority level, executing the tasks with the high priority level, executing the tasks after the low priority level, and executing the tasks with the same priority level according to a strategy that the tasks come first and are served first. Further, the scheduling task may further include a webpage refresh period, which may be obtained by the scheduling node according to the webpage download history, and specifically, the scheduling task is executed at least once in the refresh period, so as to obtain the latest task data. Further, in order to ensure the success rate of webpage downloading, the scheduling node may further add a retry strategy based on a time growth characteristic to the scheduling task, specifically, retry the task at an interval of 2s after the first failure, retry the task at an interval of 3s after the second failure, and so on, although the interval time may be set according to actual needs.
S102, the master control node receives the running state information sent by each data acquisition node.
In this embodiment, the data acquisition node may periodically send running state information, such as the utilization rate of physical resources of the data acquisition node, for example, the utilization rate of a memory, a CPU, and the like, to the master control node. Of course, the master control node may also actively acquire respective operation state information from each data acquisition node after receiving the scheduling task.
And S103, the master control node allocates the scheduling tasks to the data acquisition nodes according to a preset strategy, the pre-acquired processing capacity information of each data acquisition node and the running state information of each data acquisition node, so that the data acquisition nodes execute the scheduling tasks.
In this embodiment, the general control node may obtain processing capability information of each data acquisition node in advance, for example, downloading capability information and analysis capability information of the data acquisition node, and each data acquisition node may report its respective processing capability information to the general control node when each data acquisition node registers in the general control node, or certainly, the general control node may actively request each data acquisition node. Furthermore, the master control node can allocate the data acquisition nodes for the scheduling task according to a preset strategy, the processing capacity information and the running state information of each data acquisition node, send the scheduling task to the corresponding data acquisition node, and execute the scheduling task by the data acquisition node, for example, download and analyze data according to a target webpage contained in the scheduling task and the scheduling sequence of the target webpage, so as to realize balanced load of each data acquisition node and improve the response capacity of the distributed acquisition system and the resource utilization rate of the machine.
More specifically, the general control node may allocate the scheduling task to the data acquisition node according to a webpage domain name scattering strategy, processing capability information of each data acquisition node, and operating state information, where the webpage domain name scattering strategy is that two adjacent tasks cannot be the same domain name task on the premise of having other domain name tasks.
In the data acquisition method provided by this embodiment, the scheduling task is sent to the master control node by the scheduling node, the master control node receives the running state information sent by each data acquisition node, and allocates the scheduling task to the data acquisition node according to a preset strategy, the processing capability information of each data acquisition node obtained in advance, and the running state information of each data acquisition node, so that the data acquisition node executes the scheduling task. In this embodiment, the master control node manages the data acquisition nodes in a unified manner, and the load is balanced for each data acquisition node, so that the response capability of data acquisition and the utilization rate of machine resources are improved, the data acquisition nodes can be distributed in different machine rooms, the bandwidth and the multi-IP address advantages of multiple machine rooms are fully utilized, and dynamic expansion and contraction of the nodes are supported.
Fig. 3 is a flowchart of a data acquisition method according to an embodiment of the present invention. The embodiment provides a data acquisition method, wherein an execution main body is a data acquisition node, and the method comprises the following specific steps:
s201, the data acquisition node acquires the running state information of the data acquisition node.
In this embodiment, the data acquisition node may actively acquire or respond to the request of the master control node, and acquire the running state information of the data acquisition node, for example, the utilization rate of the physical resource of the data acquisition node, such as the utilization rates of the memory, the CPU, and the like. Specifically, the data acquisition node is provided with a download plug-in and an analysis plug-in, and can acquire running state information of each download plug-in and analysis plug-in.
S202, the data acquisition nodes send the running state information to the master control node, so that the master control node distributes the scheduling tasks to the data acquisition nodes according to a preset strategy, pre-acquired processing capacity information of the data acquisition nodes and the running state information of the data acquisition nodes.
In this embodiment, the data acquisition nodes may periodically send the running state information to the master control node, or the master control node may actively request the running state information from each data acquisition node after receiving the scheduling task, and the data acquisition nodes return the running state information according to the request. Specifically, the data acquisition node can send the data to the master control node through a data bus. And then the master control node distributes the scheduling tasks to the data acquisition nodes according to a preset strategy, the pre-acquired processing capacity information and the operating state information of the data acquisition nodes. The processing capability information of the data acquisition nodes can be, for example, downloading capability information, analysis capability information and the like of the data acquisition nodes, and each data acquisition node can report the respective processing capability information to the master control node when each data acquisition node registers to the master control node, or certainly, the master control node can actively request each data acquisition node.
And S203, the data acquisition node receives the scheduling task distributed by the master control node and executes the scheduling task.
In this embodiment, after receiving the scheduling task, the data acquisition node executes the scheduling task, for example, downloads and analyzes data according to a target webpage included in the scheduling task and a scheduling sequence of the target webpage.
Specifically, after receiving the scheduling task, the data acquisition node may allocate tasks to the download plug-in and the analysis plug-in of the data acquisition node according to the scheduling task, so that the download plug-in and the analysis plug-in execute the allocated tasks independently. The task allocation method includes that a data acquisition node allocates tasks for downloading plug-ins and analyzing plug-ins according to scheduling tasks, specifically, the tasks are allocated according to processing capacity and running state of each plug-in, and when the plug-ins have capacity of processing more tasks, the tasks are continuously allocated to the plug-ins. In this embodiment, each plug-in of the data acquisition node may be an independent system process, and each plug-in supports hot plug and upgrade.
In the data acquisition method provided by this embodiment, the scheduling task is sent to the master control node by the scheduling node, the master control node receives the running state information sent by each data acquisition node, and allocates the scheduling task to the data acquisition node according to a preset strategy, the processing capability information of each data acquisition node obtained in advance, and the running state information of each data acquisition node, so that the data acquisition node executes the scheduling task. In this embodiment, the master control node manages the data acquisition nodes in a unified manner, and the load is balanced for each data acquisition node, so that the response capability of data acquisition and the utilization rate of machine resources are improved, the data acquisition nodes can be distributed in different machine rooms, the bandwidth and the advantages of multiple IP addresses of multiple machine rooms are fully utilized, and dynamic expansion and contraction of the nodes are supported. And the data acquisition node provides downloading capacity, is a distributed downloading cluster, and only needs to develop a necessary data analysis plug-in, thereby reducing the research and development period. The data acquisition nodes can support mounting of various analysis plug-ins under the condition of permission of resources such as a memory, a CPU and the like, and each data acquisition node can have a plurality of data downloading and data analysis capabilities.
Fig. 4 is a structural diagram of a master control node according to an embodiment of the present invention. The general control node provided in this embodiment may execute the processing flow provided in the above data acquisition method at the general control node side, as shown in fig. 3, the general control node 40 includes a memory 41 and a processor 42. A communication interface 43 may also be included.
A memory 41 for storing a computer program;
a processor 42 for executing the computer program stored in the memory 41 to implement: receiving a scheduling task sent by the scheduling node; receiving operation state information sent by each data acquisition node; and distributing the scheduling task to the data acquisition nodes according to a preset strategy, the pre-acquired processing capacity information of each data acquisition node and the operating state information of each data acquisition node, so that the data acquisition nodes execute the scheduling task.
Further, the scheduling task comprises a target webpage and a scheduling sequence of the target webpage, and the scheduling sequence of the target webpage are acquired by the scheduling node according to the collection task and a preset scheduling strategy.
Further, when the processor 42 allocates the scheduling task to the data acquisition nodes according to a preset policy, the pre-acquired processing capability information of each data acquisition node, and the operating state information of each data acquisition node, the processor 42 is configured to:
and distributing the scheduling task to the data acquisition nodes according to a webpage domain name scattering strategy, the processing capacity information and the running state information of each data acquisition node.
Further, the total control node is configured with at least one standby total control node, so that when the total control node fails, one standby total control node is selected from the at least one standby total control node to replace the total control node.
The general control node provided in the embodiment of the present invention may be specifically configured to execute the method embodiment provided in fig. 2, and specific functions are not described herein again.
According to the master control node provided by the embodiment of the invention, the scheduling task is sent to the master control node through the scheduling node, the master control node receives the running state information sent by each data acquisition node, and the scheduling task is distributed to the data acquisition nodes according to a preset strategy, the pre-acquired processing capacity information of each data acquisition node and the running state information of each data acquisition node, so that the data acquisition nodes execute the scheduling task. In this embodiment, the master control node manages the data acquisition nodes in a unified manner, and the load is balanced for each data acquisition node, so that the response capability of data acquisition and the utilization rate of machine resources are improved, the data acquisition nodes can be distributed in different machine rooms, the bandwidth and the advantages of multiple IP addresses of multiple machine rooms are fully utilized, and dynamic expansion and contraction of the nodes are supported.
Fig. 5 is a structural diagram of a data acquisition node according to an embodiment of the present invention. The data collection node provided in this embodiment may execute the processing flow provided in the data collection method embodiment on the data collection node side, as shown in fig. 5, where the data collection node includes a memory 51 and a processor 52. A communication interface 53 may also be included.
A memory 51 for storing a computer program;
a processor 52 for executing the computer program stored in the memory 51 to implement: acquiring running state information of the mobile terminal; sending the running state information to the master control node, so that the master control node allocates the scheduling task to the data acquisition nodes according to a preset strategy, pre-acquired processing capacity information of each data acquisition node and the running state information of each data acquisition node; and receiving the scheduling task distributed by the master control node and executing the scheduling task.
Further, when the processor 52 receives the scheduling task assigned by the grandmaster node and executes the scheduling task, the processor 52 is configured to:
after the scheduling task is received, distributing tasks for the downloading plug-in and the analysis plug-in of the data acquisition node according to the scheduling task, so that the downloading plug-in and the analysis plug-in can respectively and independently execute the distributed tasks.
Further, when the processor 52 sends the operation state information to the general control node, the processor 52 is configured to:
and acquiring the running state information of each downloading plug-in and each analyzing plug-in, and sending the running state information of each plug-in to the master control node in a preset period.
The data acquisition node provided in the embodiment of the present invention may be specifically configured to execute the method embodiment provided in fig. 3, and specific functions are not described herein again.
According to the data acquisition node provided by the embodiment of the invention, the scheduling task is sent to the master control node through the scheduling node, the master control node receives the running state information sent by each data acquisition node, and the scheduling task is distributed to the data acquisition nodes according to a preset strategy, the pre-acquired processing capacity information of each data acquisition node and the running state information of each data acquisition node, so that the data acquisition nodes execute the scheduling task. In this embodiment, the master control node manages the data acquisition nodes in a unified manner, and the load is balanced for each data acquisition node, so that the response capability of data acquisition and the utilization rate of machine resources are improved, the data acquisition nodes can be distributed in different machine rooms, the bandwidth and the multi-IP address advantages of multiple machine rooms are fully utilized, and dynamic expansion and contraction of the nodes are supported. And the data acquisition node provides downloading capacity, is a distributed downloading cluster, and only needs to develop a necessary data analysis plug-in, thereby reducing the research and development period. The data acquisition nodes can support mounting of various analysis plug-ins under the condition of permission of resources such as a memory, a CPU and the like, and each data acquisition node can have a plurality of data downloading and data analysis capabilities.
In addition, the present embodiment further provides a computer-readable storage medium, where a computer program is stored, where the computer program is executed by a processor to implement the data acquisition method at the master control node side described in the foregoing embodiment, and the implementation principle and the technical effect are similar, and are not described herein again.
In addition, the present embodiment further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the data acquisition method on the data acquisition node side in the foregoing embodiment, and the implementation principle and the technical effect are similar, and are not described herein again.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
It is obvious to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working process of the device described above, reference may be made to the corresponding process in the foregoing method embodiment, which is not described herein again.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (12)

1. A data acquisition method is characterized by being applied to a distributed acquisition system, wherein the distributed acquisition system comprises a plurality of scheduling nodes, a master control node, a plurality of data acquisition nodes and an agent node; the plurality of data acquisition nodes are arranged in different machine rooms; the method comprises the following steps:
the master control node receives the scheduling tasks sent by the scheduling nodes; the scheduling task comprises a target webpage and a scheduling sequence of the target webpage, and the scheduling sequence of the target webpage and the target webpage is acquired by the scheduling node according to an acquisition task and a preset scheduling strategy;
the master control node receives the running state information sent by each data acquisition node;
the master control node distributes the scheduling tasks to the data acquisition nodes according to a preset strategy, pre-acquired processing capacity information of each data acquisition node and running state information of each data acquisition node, so that the data acquisition nodes execute the scheduling tasks;
the agent nodes register all the general control nodes in a reliable coordination system of the distributed acquisition system and then elect all the general control nodes according to node data in the reliable coordination system;
the method for distributing the scheduling tasks to the data acquisition nodes by the master control node according to a preset strategy, the pre-acquired processing capacity information of each data acquisition node and the running state information of each data acquisition node comprises the following steps:
the master control node distributes the scheduling tasks to the data acquisition nodes according to a webpage domain name scattering strategy, the processing capacity information and the running state information of each data acquisition node; the web page domain name breaking strategy is that two adjacent tasks cannot be the same domain name task on the premise of other domain name tasks.
2. A method according to claim 1 wherein the grandmaster node is configured with at least one alternate grandmaster node to elect one alternate grandmaster node from among the at least one alternate grandmaster node to replace the grandmaster node in the event of a failure of the grandmaster node.
3. A data acquisition method is characterized by being applied to a distributed acquisition system, wherein the distributed acquisition system comprises a plurality of scheduling nodes, a master control node, a plurality of data acquisition nodes and an agent node; the plurality of data acquisition nodes are arranged in different machine rooms; the method comprises the following steps:
the data acquisition node acquires the running state information of the data acquisition node;
the data acquisition nodes send the running state information to the master control node, so that the master control node allocates scheduling tasks to the data acquisition nodes according to a webpage domain name breaking strategy, pre-acquired processing capacity information of the data acquisition nodes and the running state information of the data acquisition nodes; the web page domain name breaking strategy is that two adjacent tasks cannot be the same domain name task on the premise of other domain name tasks; the scheduling task is a task which is sent to the master control node by the scheduling node and comprises a target webpage and a scheduling sequence of the target webpage, and the scheduling sequence of the target webpage are obtained by the scheduling node according to an acquisition task and a preset scheduling strategy;
the data acquisition node receives the scheduling task distributed by the master control node and executes the scheduling task;
and the agent nodes register all the total control nodes in the reliable coordination system of the distributed acquisition system and then select all the total control nodes according to the node data in the reliable coordination system.
4. The method according to claim 3, wherein the data acquisition node receives the scheduling task distributed by the general control node and executes the scheduling task, and the method comprises the following steps:
after the data acquisition node receives the scheduling task, distributing tasks for the download plug-in and the analysis plug-in of the data acquisition node according to the scheduling task, so that the download plug-in and the analysis plug-in can respectively and independently execute the distributed tasks.
5. The method of claim 4, wherein the data collection node sends the operating state information to the general control node, and the method comprises:
and the data acquisition node acquires the running state information of each download plug-in and the analysis plug-in, and sends the running state information of each plug-in to the master control node in a preset period.
6. The general control node is applied to a distributed acquisition system, wherein the distributed acquisition system comprises a plurality of scheduling nodes, a general control node, a plurality of data acquisition nodes and agent nodes; the plurality of data acquisition nodes are arranged in different machine rooms; the total control node comprises:
a memory for storing a computer program;
a processor for executing the computer program stored in the memory to implement: receiving a scheduling task sent by the scheduling node; receiving operation state information sent by each data acquisition node; distributing the scheduling task to the data acquisition nodes according to a preset strategy, pre-acquired processing capacity information of each data acquisition node and operation state information of each data acquisition node, so that the data acquisition nodes execute the scheduling task; the scheduling task comprises a target webpage and a scheduling sequence of the target webpage, and the scheduling sequence of the target webpage and the target webpage is acquired by the scheduling node according to an acquisition task and a preset scheduling strategy;
the agent nodes register all the general control nodes in a reliable coordination system of the distributed acquisition system and then elect all the general control nodes according to node data in the reliable coordination system;
when the processor allocates the scheduling task to the data acquisition nodes according to a preset strategy, the pre-acquired processing capacity information of each data acquisition node and the operating state information of each data acquisition node, the processor is configured to:
distributing the scheduling task to the data acquisition nodes according to a webpage domain name scattering strategy, the processing capacity information and the running state information of each data acquisition node; the web page domain name breaking strategy is that two adjacent tasks cannot be the same domain name task on the premise of other domain name tasks.
7. A general control node according to claim 6, wherein the general control node is configured with at least one standby general control node to select one standby general control node from the at least one standby general control node to replace the general control node in the event of a failure of the general control node.
8. A data acquisition node is applied to a distributed acquisition system, wherein the distributed acquisition system comprises a plurality of scheduling nodes, a master control node, a plurality of data acquisition nodes and an agent node; the plurality of data acquisition nodes are arranged in different machine rooms; the data acquisition node comprises:
a memory for storing a computer program;
a processor for executing the computer program stored in the memory to implement: acquiring running state information of the mobile terminal; sending the running state information to the master control node, so that the master control node allocates a scheduling task to the data acquisition nodes according to a webpage domain name breaking strategy, pre-acquired processing capacity information of each data acquisition node and the running state information of each data acquisition node; the web page domain name breaking strategy is that two adjacent tasks cannot be the same domain name task on the premise of other domain name tasks; the scheduling task is a task which is sent to the master control node by the scheduling node and comprises a target webpage and a scheduling sequence of the target webpage, and the scheduling sequence of the target webpage are obtained by the scheduling node according to an acquisition task and a preset scheduling strategy; receiving the scheduling task distributed by the master control node and executing the scheduling task;
and the agent nodes register all the total control nodes in the reliable coordination system of the distributed acquisition system and then select all the total control nodes according to the node data in the reliable coordination system.
9. The data acquisition node of claim 8, wherein when the processor receives the scheduling task assigned by the grandmaster node and executes the scheduling task, the processor is configured to:
after the scheduling task is received, distributing tasks for the downloading plug-in and the analysis plug-in of the data acquisition node according to the scheduling task, so that the downloading plug-in and the analysis plug-in can respectively and independently execute the distributed tasks.
10. The data collection node of claim 9, wherein when the processor sends the operational state information to the grandmaster node, the processor is configured to:
and acquiring the running state information of each downloading plug-in and each analyzing plug-in, and sending the running state information of each plug-in to the master control node in a preset period.
11. A computer-readable storage medium, having stored thereon a computer program;
the computer program, when executed by a processor, implements the method of claim 1 or 2.
12. A computer-readable storage medium, having stored thereon a computer program;
the computer program, when executed by a processor, implementing the method of any one of claims 3-4.
CN201811240829.3A 2018-10-24 2018-10-24 Data acquisition method, device and storage medium Active CN111092921B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811240829.3A CN111092921B (en) 2018-10-24 2018-10-24 Data acquisition method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811240829.3A CN111092921B (en) 2018-10-24 2018-10-24 Data acquisition method, device and storage medium

Publications (2)

Publication Number Publication Date
CN111092921A CN111092921A (en) 2020-05-01
CN111092921B true CN111092921B (en) 2022-05-10

Family

ID=70392420

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811240829.3A Active CN111092921B (en) 2018-10-24 2018-10-24 Data acquisition method, device and storage medium

Country Status (1)

Country Link
CN (1) CN111092921B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111708624B (en) * 2020-06-16 2023-09-29 北京百度网讯科技有限公司 Concurrency allocation method, device, equipment and storage medium based on multiple transmitters
CN111885159B (en) * 2020-07-22 2022-06-14 曙光信息产业(北京)有限公司 Data acquisition method and device, electronic equipment and storage medium
CN112765121A (en) * 2021-01-08 2021-05-07 北京虹信万达科技有限公司 Administration and application system based on big data service
CN112905336A (en) * 2021-02-04 2021-06-04 深圳融安网络科技有限公司 Data acquisition method, device, equipment and storage medium
CN112637368B (en) * 2021-03-10 2021-05-14 江苏金恒信息科技股份有限公司 Distributed industrial data acquisition system and method

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20040040115A (en) * 2002-11-06 2004-05-12 주식회사넷꼬모 Network-based question and answer service method and its system capable of load balancing
US7774782B1 (en) * 2003-12-18 2010-08-10 Google Inc. Limiting requests by web crawlers to a web host
CN102073683A (en) * 2010-12-22 2011-05-25 四川大学 Distributed real-time news information acquisition system
CN102339290A (en) * 2010-07-22 2012-02-01 北大方正集团有限公司 Method and device for directionally acquiring webpage data information
CN103310012A (en) * 2013-07-02 2013-09-18 北京航空航天大学 Distributed web crawler system
CN103631922A (en) * 2013-12-03 2014-03-12 南通大学 Hadoop cluster-based large-scale Web information extraction method and system
CN103870329A (en) * 2014-03-03 2014-06-18 同济大学 Distributed crawler task scheduling method based on weighted round-robin algorithm
CN105447088A (en) * 2015-11-06 2016-03-30 杭州掘数科技有限公司 Volunteer computing based multi-tenant professional cloud crawler
CN106897126A (en) * 2015-12-21 2017-06-27 北京奇虎科技有限公司 A kind of picture grasping means and server
CN107066569A (en) * 2017-04-07 2017-08-18 武汉大学 A kind of method of distributed network crawler system and information crawler
CN107203623A (en) * 2017-05-26 2017-09-26 山东省科学院情报研究所 The load balancing adjusting method of network crawler system
CN108205541A (en) * 2016-12-16 2018-06-26 北大方正集团有限公司 The dispatching method and device of distributed network reptile task

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101370024B (en) * 2007-08-15 2012-10-31 北京灵图软件技术有限公司 Distributed information collection method and system
CN105338028B (en) * 2014-07-30 2018-12-07 浙江宇视科技有限公司 Main and subordinate node electoral machinery and device in a kind of distributed server cluster
CN106202108B (en) * 2015-05-06 2019-09-06 阿里巴巴集团控股有限公司 Web crawlers grabs method for allocating tasks and device and data grab method and device
CN104965933B (en) * 2015-07-30 2018-12-25 北京奇虎科技有限公司 Distribution method, distributor and the URL detection system of URL Detection task
CN105095463B (en) * 2015-07-30 2018-09-11 北京奇虎科技有限公司 Visiting method, the apparatus and system of material chained address
CN106021005B (en) * 2016-05-10 2019-01-22 北京金山安全软件有限公司 Method and device for providing application service and electronic equipment
CN106126346B (en) * 2016-07-05 2019-02-26 东北大学 A kind of large-scale distributed data collection system and method
CN106375342A (en) * 2016-10-21 2017-02-01 用友网络科技股份有限公司 Zookeeper-technology-based system cluster method and system
CN106980678A (en) * 2017-03-30 2017-07-25 温馨港网络信息科技(苏州)有限公司 Data analysing method and system based on zookeeper technologies

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20040040115A (en) * 2002-11-06 2004-05-12 주식회사넷꼬모 Network-based question and answer service method and its system capable of load balancing
US7774782B1 (en) * 2003-12-18 2010-08-10 Google Inc. Limiting requests by web crawlers to a web host
CN102339290A (en) * 2010-07-22 2012-02-01 北大方正集团有限公司 Method and device for directionally acquiring webpage data information
CN102073683A (en) * 2010-12-22 2011-05-25 四川大学 Distributed real-time news information acquisition system
CN103310012A (en) * 2013-07-02 2013-09-18 北京航空航天大学 Distributed web crawler system
CN103631922A (en) * 2013-12-03 2014-03-12 南通大学 Hadoop cluster-based large-scale Web information extraction method and system
CN103870329A (en) * 2014-03-03 2014-06-18 同济大学 Distributed crawler task scheduling method based on weighted round-robin algorithm
CN105447088A (en) * 2015-11-06 2016-03-30 杭州掘数科技有限公司 Volunteer computing based multi-tenant professional cloud crawler
CN106897126A (en) * 2015-12-21 2017-06-27 北京奇虎科技有限公司 A kind of picture grasping means and server
CN108205541A (en) * 2016-12-16 2018-06-26 北大方正集团有限公司 The dispatching method and device of distributed network reptile task
CN107066569A (en) * 2017-04-07 2017-08-18 武汉大学 A kind of method of distributed network crawler system and information crawler
CN107203623A (en) * 2017-05-26 2017-09-26 山东省科学院情报研究所 The load balancing adjusting method of network crawler system

Also Published As

Publication number Publication date
CN111092921A (en) 2020-05-01

Similar Documents

Publication Publication Date Title
CN111092921B (en) Data acquisition method, device and storage medium
CN108924268B (en) Container cloud service system and pod creation method and device
CN109218355B (en) Load balancing engine, client, distributed computing system and load balancing method
CN102137014B (en) Resource management method, system and resource manager
CN108600300B (en) Log data processing method and device
CN109547517B (en) Method and device for scheduling bandwidth resources
CN105159775A (en) Load balancer based management system and management method for cloud computing data center
CN103051564B (en) The method and apparatus of dynamic resource allocation
CN103533063A (en) Method and device capable of realizing dynamic expansion of WEB (World Wide Web) application resource
CN103078965B (en) The IP address management method of virtual machine
JP2023532947A (en) Data transfer method, proxy server, storage medium and electronic device
CN107105013B (en) File processing method, server, terminal and system
CN107210924B (en) Method and apparatus for configuring a communication system
WO2018121334A1 (en) Web application service providing method, apparatus, electronic device and system
CN114244717B (en) Configuration method and device of virtual network card resources, computer equipment and medium
CN113037794A (en) Computing resource allocation scheduling method, device and system
CN108924203B (en) Data copy self-adaptive distribution method, distributed computing system and related equipment
Ma et al. vLocality: Revisiting data locality for MapReduce in virtualized clouds
US9594596B2 (en) Dynamically tuning server placement
CN106911741B (en) Method for balancing virtual network management file downloading load and network management server
CN109413117B (en) Distributed data calculation method, device, server and computer storage medium
CN111404828B (en) Method and device for realizing global flow control
CN109005071B (en) Decision deployment method and scheduling equipment
KR20130028554A (en) Large capacity distributed processing apparatus using a message bus and the method thereof
CN111124669A (en) Operation method, system, terminal and storage medium of distributed SaaS software

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230609

Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: BEIJING FOUNDER ELECTRONICS Co.,Ltd.

Address before: 100871, Beijing, Haidian District, Cheng Fu Road, No. 298, Zhongguancun Fangzheng building, 9 floor

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: BEIJING FOUNDER ELECTRONICS Co.,Ltd.