Disclosure of Invention
The invention provides a data acquisition method, a data acquisition device and a storage medium, which are used for improving the resource utilization rate in the data acquisition process and realizing load balance.
The first aspect of the present invention provides a data acquisition method, which is applied to a distributed acquisition system, wherein the distributed acquisition system comprises a scheduling node, a master control node, and a plurality of data acquisition nodes, and the method comprises:
the master control node receives the scheduling tasks sent by the scheduling nodes;
the master control node receives the running state information sent by each data acquisition node;
and the master control node distributes the scheduling tasks to the data acquisition nodes according to a preset strategy, the pre-acquired processing capacity information of each data acquisition node and the running state information of each data acquisition node, so that the data acquisition nodes execute the scheduling tasks.
Further, the scheduling task comprises a target webpage and a scheduling sequence of the target webpage, and the scheduling sequence of the target webpage are acquired by the scheduling node according to the collection task and a preset scheduling strategy.
Further, the step of allocating, by the general control node, the scheduling task to the data acquisition nodes according to a preset policy, pre-acquired processing capability information of each data acquisition node, and operation state information of each data acquisition node, includes:
and the master control node allocates the scheduling tasks to the data acquisition nodes according to a webpage domain name scattering strategy, the processing capacity information and the running state information of each data acquisition node.
Further, the total control node is configured with at least one standby total control node, so that when the total control node fails, one standby total control node is selected from the at least one standby total control node to replace the total control node.
A second aspect of the present invention provides a data acquisition method, which is applied to a distributed acquisition system, where the distributed acquisition system includes a scheduling node, a master control node, and a plurality of data acquisition nodes, and the method includes:
the data acquisition node acquires the running state information of the data acquisition node;
the data acquisition nodes send the running state information to the master control node, so that the master control node distributes the scheduling tasks to the data acquisition nodes according to a preset strategy, pre-acquired processing capacity information of each data acquisition node and the running state information of each data acquisition node;
and the data acquisition node receives the scheduling task distributed by the master control node and executes the scheduling task.
Further, the data acquisition node receives the scheduling task distributed by the general control node and executes the scheduling task, including:
after the data acquisition node receives the scheduling task, distributing tasks for the download plug-in and the analysis plug-in of the data acquisition node according to the scheduling task, so that the download plug-in and the analysis plug-in can respectively and independently execute the distributed tasks.
Further, the data acquisition node sends the running state information to the general control node, and the method includes:
and the data acquisition node acquires the running state information of each download plug-in and the analysis plug-in, and sends the running state information of each plug-in to the master control node in a preset period.
The third aspect of the present invention provides a general control node, which is applied to a distributed acquisition system, where the distributed acquisition system includes a scheduling node, a general control node, and a plurality of data acquisition nodes, and the general control node includes:
a memory for storing a computer program;
a processor for executing the computer program stored in the memory to implement: receiving a scheduling task sent by the scheduling node; receiving operation state information sent by each data acquisition node; and distributing the scheduling task to the data acquisition nodes according to a preset strategy, the pre-acquired processing capacity information of each data acquisition node and the running state information of each data acquisition node so as to enable the data acquisition nodes to execute the scheduling task.
Further, the scheduling task comprises a target webpage and a scheduling sequence of the target webpage, and the scheduling sequence of the target webpage are acquired by the scheduling node according to the collection task and a preset scheduling strategy.
Further, when the processor allocates the scheduling task to the data acquisition nodes according to a preset policy, the pre-acquired processing capability information of each data acquisition node, and the operating state information of each data acquisition node, the processor is configured to:
and distributing the scheduling task to the data acquisition nodes according to a webpage domain name scattering strategy, the processing capacity information and the running state information of each data acquisition node.
Further, the total control node is configured with at least one standby total control node, so that when the total control node fails, one standby total control node is selected from the at least one standby total control node to replace the total control node.
A fourth aspect of the present invention provides a data acquisition node, which is applied to a distributed acquisition system, where the distributed acquisition system includes a scheduling node, a general control node, and a plurality of data acquisition nodes, and each data acquisition node includes:
a memory for storing a computer program;
a processor for executing the computer program stored in the memory to implement: acquiring running state information of the mobile terminal; sending the running state information to the master control node, so that the master control node allocates the scheduling task to the data acquisition nodes according to a preset strategy, pre-acquired processing capacity information of each data acquisition node and the running state information of each data acquisition node; and receiving the scheduling task distributed by the master control node and executing the scheduling task.
Further, when the processor receives the scheduling task distributed by the grandmaster node and executes the scheduling task, the processor is configured to:
after the scheduling task is received, distributing tasks for the downloading plug-in and the analysis plug-in of the data acquisition node according to the scheduling task, so that the downloading plug-in and the analysis plug-in can respectively and independently execute the distributed tasks.
Further, when the processor sends the running state information to the general control node, the processor is configured to:
and acquiring the running state information of each downloading plug-in and each analyzing plug-in, and sending the running state information of each plug-in to the master control node in a preset period.
A fifth aspect of the present invention is to provide a computer-readable storage medium having stored thereon a computer program;
which when executed by a processor implements the method according to the first aspect.
A sixth aspect of the present invention is to provide a computer-readable storage medium having stored thereon a computer program;
which when executed by a processor implements the method according to the second aspect.
According to the data acquisition method, the data acquisition device and the storage medium, the scheduling task is sent to the master control node through the scheduling node, the master control node receives the running state information sent by each data acquisition node, and the scheduling task is distributed to the data acquisition nodes according to a preset strategy, the pre-acquired processing capacity information of each data acquisition node and the running state information of each data acquisition node, so that the data acquisition nodes execute the scheduling task. According to the invention, the data acquisition nodes are uniformly managed through the master control node, and the load is balanced for each data acquisition node, so that the response capability of data acquisition and the utilization rate of machine resources are improved, the data acquisition nodes can be distributed in different machine rooms, the bandwidth and multi-IP address advantages of multiple machine rooms are fully utilized, and the dynamic expansion and contraction of the nodes are supported.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The data acquisition method provided by the invention can be applied to the distributed acquisition system shown in figure 1. As shown in fig. 1, the distributed acquisition system includes a scheduling node, a general control node, and a plurality of data acquisition nodes. The scheduling node is used for scheduling the core task of each collected product, the node needs to calculate target webpages to be downloaded according to the collected tasks, and the scheduling sequence of each target webpage is determined according to a specific scheduling strategy; in addition, the refreshing cycle of the webpage can be decided according to the history of webpage downloading; the node also needs a retry strategy based on a time growth characteristic in order to ensure the success rate of webpage downloading. For example, the scheduling node 1 is a news scheduling node, the scheduling node 2 is an adaptive scheduling node, and different scheduling programs can be used for different information source collections. The master control node (SpiderService node) is a core node of the distributed acquisition system, acquires the processing capacity information of each data acquisition node in advance, receives the running state information sent by each data acquisition node, and distributes the scheduling task received from the scheduling node to each data acquisition node according to a preset strategy, the processing capacity information of each data acquisition node and the running state information. The data acquisition nodes (SpiderProxy nodes) have downloading and analyzing capabilities and can be arranged in different machine rooms, as shown in figure 1, the data acquisition nodes 1, 2 and 3 are arranged in a company machine room, the data acquisition nodes 4, 5 and 6 are arranged in a cloud server, the data acquisition nodes 7, 8 and 9 are arranged in a client machine room, the bandwidth and multi-IP address advantages of multiple machine rooms can be fully utilized, the downloading and analyzing capabilities are provided for the master control nodes after the data acquisition nodes are registered to the master control nodes, the running state information is sent to the master control nodes in preset periods, and the downloading and the analyzing are carried out according to scheduling tasks after the scheduling tasks distributed by the master control nodes are received.
Furthermore, the total control node can be configured with at least one standby total control node, so that when the total control node fails, one standby total control node is selected from the standby total control nodes to replace the total control node. In addition, the distributed acquisition system can also comprise an agent node (agent) which is responsible for high availability of the distributed acquisition system and provides an interface for inquiring the main node, all the main control nodes are registered in a reliable coordination system (zookeeper) of the distributed acquisition system, the agent node elects all the main control nodes according to node data in the zookeeper, a main node is elected, the rest are standby nodes, when the main node fails, the agent node elects again, and an election result is published to the outside through the interface.
The building process of the blockchain network is described in detail below with reference to specific embodiments.
Fig. 2 is a flowchart of a data acquisition method according to an embodiment of the present invention. The embodiment provides a data acquisition method, wherein an execution main body is a master control node, and the method comprises the following specific steps:
s101, the master control node receives the scheduling tasks sent by the scheduling nodes.
In this embodiment, when the scheduling node receives the collection task, the scheduling task may be obtained according to the collection task and a predetermined scheduling policy, and the scheduling task may include a target webpage and a scheduling order of the target webpage. Specifically, the target webpage may be obtained according to the collection task, and then the scheduling order of the target webpage may be obtained according to a predetermined scheduling policy, where the predetermined scheduling policy may be: processing the tasks according to the priority level, executing the tasks with the high priority level, executing the tasks after the low priority level, and executing the tasks with the same priority level according to a strategy that the tasks come first and are served first. Further, the scheduling task may further include a webpage refresh period, which may be obtained by the scheduling node according to the webpage download history, and specifically, the scheduling task is executed at least once in the refresh period, so as to obtain the latest task data. Further, in order to ensure the success rate of webpage downloading, the scheduling node may further add a retry strategy based on a time growth characteristic to the scheduling task, specifically, retry the task at an interval of 2s after the first failure, retry the task at an interval of 3s after the second failure, and so on, although the interval time may be set according to actual needs.
S102, the master control node receives the running state information sent by each data acquisition node.
In this embodiment, the data acquisition node may periodically send running state information, such as the utilization rate of physical resources of the data acquisition node, for example, the utilization rate of a memory, a CPU, and the like, to the master control node. Of course, the master control node may also actively acquire respective operation state information from each data acquisition node after receiving the scheduling task.
And S103, the master control node allocates the scheduling tasks to the data acquisition nodes according to a preset strategy, the pre-acquired processing capacity information of each data acquisition node and the running state information of each data acquisition node, so that the data acquisition nodes execute the scheduling tasks.
In this embodiment, the general control node may obtain processing capability information of each data acquisition node in advance, for example, downloading capability information and analysis capability information of the data acquisition node, and each data acquisition node may report its respective processing capability information to the general control node when each data acquisition node registers in the general control node, or certainly, the general control node may actively request each data acquisition node. Furthermore, the master control node can allocate the data acquisition nodes for the scheduling task according to a preset strategy, the processing capacity information and the running state information of each data acquisition node, send the scheduling task to the corresponding data acquisition node, and execute the scheduling task by the data acquisition node, for example, download and analyze data according to a target webpage contained in the scheduling task and the scheduling sequence of the target webpage, so as to realize balanced load of each data acquisition node and improve the response capacity of the distributed acquisition system and the resource utilization rate of the machine.
More specifically, the general control node may allocate the scheduling task to the data acquisition node according to a webpage domain name scattering strategy, processing capability information of each data acquisition node, and operating state information, where the webpage domain name scattering strategy is that two adjacent tasks cannot be the same domain name task on the premise of having other domain name tasks.
In the data acquisition method provided by this embodiment, the scheduling task is sent to the master control node by the scheduling node, the master control node receives the running state information sent by each data acquisition node, and allocates the scheduling task to the data acquisition node according to a preset strategy, the processing capability information of each data acquisition node obtained in advance, and the running state information of each data acquisition node, so that the data acquisition node executes the scheduling task. In this embodiment, the master control node manages the data acquisition nodes in a unified manner, and the load is balanced for each data acquisition node, so that the response capability of data acquisition and the utilization rate of machine resources are improved, the data acquisition nodes can be distributed in different machine rooms, the bandwidth and the multi-IP address advantages of multiple machine rooms are fully utilized, and dynamic expansion and contraction of the nodes are supported.
Fig. 3 is a flowchart of a data acquisition method according to an embodiment of the present invention. The embodiment provides a data acquisition method, wherein an execution main body is a data acquisition node, and the method comprises the following specific steps:
s201, the data acquisition node acquires the running state information of the data acquisition node.
In this embodiment, the data acquisition node may actively acquire or respond to the request of the master control node, and acquire the running state information of the data acquisition node, for example, the utilization rate of the physical resource of the data acquisition node, such as the utilization rates of the memory, the CPU, and the like. Specifically, the data acquisition node is provided with a download plug-in and an analysis plug-in, and can acquire running state information of each download plug-in and analysis plug-in.
S202, the data acquisition nodes send the running state information to the master control node, so that the master control node distributes the scheduling tasks to the data acquisition nodes according to a preset strategy, pre-acquired processing capacity information of the data acquisition nodes and the running state information of the data acquisition nodes.
In this embodiment, the data acquisition nodes may periodically send the running state information to the master control node, or the master control node may actively request the running state information from each data acquisition node after receiving the scheduling task, and the data acquisition nodes return the running state information according to the request. Specifically, the data acquisition node can send the data to the master control node through a data bus. And then the master control node distributes the scheduling tasks to the data acquisition nodes according to a preset strategy, the pre-acquired processing capacity information and the operating state information of the data acquisition nodes. The processing capability information of the data acquisition nodes can be, for example, downloading capability information, analysis capability information and the like of the data acquisition nodes, and each data acquisition node can report the respective processing capability information to the master control node when each data acquisition node registers to the master control node, or certainly, the master control node can actively request each data acquisition node.
And S203, the data acquisition node receives the scheduling task distributed by the master control node and executes the scheduling task.
In this embodiment, after receiving the scheduling task, the data acquisition node executes the scheduling task, for example, downloads and analyzes data according to a target webpage included in the scheduling task and a scheduling sequence of the target webpage.
Specifically, after receiving the scheduling task, the data acquisition node may allocate tasks to the download plug-in and the analysis plug-in of the data acquisition node according to the scheduling task, so that the download plug-in and the analysis plug-in execute the allocated tasks independently. The task allocation method includes that a data acquisition node allocates tasks for downloading plug-ins and analyzing plug-ins according to scheduling tasks, specifically, the tasks are allocated according to processing capacity and running state of each plug-in, and when the plug-ins have capacity of processing more tasks, the tasks are continuously allocated to the plug-ins. In this embodiment, each plug-in of the data acquisition node may be an independent system process, and each plug-in supports hot plug and upgrade.
In the data acquisition method provided by this embodiment, the scheduling task is sent to the master control node by the scheduling node, the master control node receives the running state information sent by each data acquisition node, and allocates the scheduling task to the data acquisition node according to a preset strategy, the processing capability information of each data acquisition node obtained in advance, and the running state information of each data acquisition node, so that the data acquisition node executes the scheduling task. In this embodiment, the master control node manages the data acquisition nodes in a unified manner, and the load is balanced for each data acquisition node, so that the response capability of data acquisition and the utilization rate of machine resources are improved, the data acquisition nodes can be distributed in different machine rooms, the bandwidth and the advantages of multiple IP addresses of multiple machine rooms are fully utilized, and dynamic expansion and contraction of the nodes are supported. And the data acquisition node provides downloading capacity, is a distributed downloading cluster, and only needs to develop a necessary data analysis plug-in, thereby reducing the research and development period. The data acquisition nodes can support mounting of various analysis plug-ins under the condition of permission of resources such as a memory, a CPU and the like, and each data acquisition node can have a plurality of data downloading and data analysis capabilities.
Fig. 4 is a structural diagram of a master control node according to an embodiment of the present invention. The general control node provided in this embodiment may execute the processing flow provided in the above data acquisition method at the general control node side, as shown in fig. 3, the general control node 40 includes a memory 41 and a processor 42. A communication interface 43 may also be included.
A memory 41 for storing a computer program;
a processor 42 for executing the computer program stored in the memory 41 to implement: receiving a scheduling task sent by the scheduling node; receiving operation state information sent by each data acquisition node; and distributing the scheduling task to the data acquisition nodes according to a preset strategy, the pre-acquired processing capacity information of each data acquisition node and the operating state information of each data acquisition node, so that the data acquisition nodes execute the scheduling task.
Further, the scheduling task comprises a target webpage and a scheduling sequence of the target webpage, and the scheduling sequence of the target webpage are acquired by the scheduling node according to the collection task and a preset scheduling strategy.
Further, when the processor 42 allocates the scheduling task to the data acquisition nodes according to a preset policy, the pre-acquired processing capability information of each data acquisition node, and the operating state information of each data acquisition node, the processor 42 is configured to:
and distributing the scheduling task to the data acquisition nodes according to a webpage domain name scattering strategy, the processing capacity information and the running state information of each data acquisition node.
Further, the total control node is configured with at least one standby total control node, so that when the total control node fails, one standby total control node is selected from the at least one standby total control node to replace the total control node.
The general control node provided in the embodiment of the present invention may be specifically configured to execute the method embodiment provided in fig. 2, and specific functions are not described herein again.
According to the master control node provided by the embodiment of the invention, the scheduling task is sent to the master control node through the scheduling node, the master control node receives the running state information sent by each data acquisition node, and the scheduling task is distributed to the data acquisition nodes according to a preset strategy, the pre-acquired processing capacity information of each data acquisition node and the running state information of each data acquisition node, so that the data acquisition nodes execute the scheduling task. In this embodiment, the master control node manages the data acquisition nodes in a unified manner, and the load is balanced for each data acquisition node, so that the response capability of data acquisition and the utilization rate of machine resources are improved, the data acquisition nodes can be distributed in different machine rooms, the bandwidth and the advantages of multiple IP addresses of multiple machine rooms are fully utilized, and dynamic expansion and contraction of the nodes are supported.
Fig. 5 is a structural diagram of a data acquisition node according to an embodiment of the present invention. The data collection node provided in this embodiment may execute the processing flow provided in the data collection method embodiment on the data collection node side, as shown in fig. 5, where the data collection node includes a memory 51 and a processor 52. A communication interface 53 may also be included.
A memory 51 for storing a computer program;
a processor 52 for executing the computer program stored in the memory 51 to implement: acquiring running state information of the mobile terminal; sending the running state information to the master control node, so that the master control node allocates the scheduling task to the data acquisition nodes according to a preset strategy, pre-acquired processing capacity information of each data acquisition node and the running state information of each data acquisition node; and receiving the scheduling task distributed by the master control node and executing the scheduling task.
Further, when the processor 52 receives the scheduling task assigned by the grandmaster node and executes the scheduling task, the processor 52 is configured to:
after the scheduling task is received, distributing tasks for the downloading plug-in and the analysis plug-in of the data acquisition node according to the scheduling task, so that the downloading plug-in and the analysis plug-in can respectively and independently execute the distributed tasks.
Further, when the processor 52 sends the operation state information to the general control node, the processor 52 is configured to:
and acquiring the running state information of each downloading plug-in and each analyzing plug-in, and sending the running state information of each plug-in to the master control node in a preset period.
The data acquisition node provided in the embodiment of the present invention may be specifically configured to execute the method embodiment provided in fig. 3, and specific functions are not described herein again.
According to the data acquisition node provided by the embodiment of the invention, the scheduling task is sent to the master control node through the scheduling node, the master control node receives the running state information sent by each data acquisition node, and the scheduling task is distributed to the data acquisition nodes according to a preset strategy, the pre-acquired processing capacity information of each data acquisition node and the running state information of each data acquisition node, so that the data acquisition nodes execute the scheduling task. In this embodiment, the master control node manages the data acquisition nodes in a unified manner, and the load is balanced for each data acquisition node, so that the response capability of data acquisition and the utilization rate of machine resources are improved, the data acquisition nodes can be distributed in different machine rooms, the bandwidth and the multi-IP address advantages of multiple machine rooms are fully utilized, and dynamic expansion and contraction of the nodes are supported. And the data acquisition node provides downloading capacity, is a distributed downloading cluster, and only needs to develop a necessary data analysis plug-in, thereby reducing the research and development period. The data acquisition nodes can support mounting of various analysis plug-ins under the condition of permission of resources such as a memory, a CPU and the like, and each data acquisition node can have a plurality of data downloading and data analysis capabilities.
In addition, the present embodiment further provides a computer-readable storage medium, where a computer program is stored, where the computer program is executed by a processor to implement the data acquisition method at the master control node side described in the foregoing embodiment, and the implementation principle and the technical effect are similar, and are not described herein again.
In addition, the present embodiment further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the data acquisition method on the data acquisition node side in the foregoing embodiment, and the implementation principle and the technical effect are similar, and are not described herein again.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
It is obvious to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working process of the device described above, reference may be made to the corresponding process in the foregoing method embodiment, which is not described herein again.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.