CN111722980B - Data collection systems and methods - Google Patents
Data collection systems and methods Download PDFInfo
- Publication number
- CN111722980B CN111722980B CN202010529803.1A CN202010529803A CN111722980B CN 111722980 B CN111722980 B CN 111722980B CN 202010529803 A CN202010529803 A CN 202010529803A CN 111722980 B CN111722980 B CN 111722980B
- Authority
- CN
- China
- Prior art keywords
- node
- telegraf
- leader
- data collection
- leader node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 93
- 238000013480 data collection Methods 0.000 title claims description 82
- 230000000903 blocking effect Effects 0.000 claims abstract description 43
- 230000001960 triggered effect Effects 0.000 claims description 7
- 238000012544 monitoring process Methods 0.000 abstract description 12
- 230000002159 abnormal effect Effects 0.000 abstract description 10
- 230000008569 process Effects 0.000 description 14
- 230000003862 health status Effects 0.000 description 7
- 238000012423 maintenance Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 238000013515 script Methods 0.000 description 4
- 230000005856 abnormality Effects 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000001514 detection method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000036541 health Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012797 qualification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3006—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3051—Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3089—Monitoring arrangements determined by the means or processing involved in sensing the monitored data, e.g. interfaces, connectors, sensors, probes, agents
- G06F11/3093—Configuration details thereof, e.g. installation, enabling, spatial arrangement of the probes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Quality & Reliability (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Telephonic Communication Services (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
技术领域Technical field
本发明涉及大数据领域,尤其涉及一种数据采集系统和方法。The present invention relates to the field of big data, and in particular to a data collection system and method.
背景技术Background technique
Telegraf是用于服务器端基础设施与中间件相关性能指标数据的实时采集工具。现有技术通常是在某台服务器上部署一个Telegraf进程,监控分布在多台服务器的各类基础设施与中间件。Telegraf is a real-time collection tool for performance indicator data related to server-side infrastructure and middleware. The existing technology usually deploys a Telegraf process on a certain server to monitor various infrastructure and middleware distributed on multiple servers.
如果该Telegraf进程所在的服务器出现宕机或网络故障等问题,造成了Telegraf进程无法对相关数据进行实时采集,则这些数据由于未被实时采集到而永久性地丢失,会对后续的数据分析工作造成影响。If the server where the Telegraf process is located experiences problems such as downtime or network failure, causing the Telegraf process to be unable to collect relevant data in real time, the data will be permanently lost because it is not collected in real time, which will affect subsequent data analysis work. cause impact.
发明内容Contents of the invention
针对现有技术存在的上述至少一个技术问题,本发明实施例提供一种数据采集系统和方法。To address at least one of the above technical problems existing in the prior art, embodiments of the present invention provide a data collection system and method.
第一方面,本发明实施例提供一种数据采集系统,包括多个服务器、ZooKeeper分布式框架模块和Redis数据库,其中:In the first aspect, embodiments of the present invention provide a data collection system, including multiple servers, a ZooKeeper distributed framework module and a Redis database, wherein:
所述多个服务器均部署有Telegraf节点;所述数据采集系统包含的Telegraf节点中,仅包括一个作为Leader节点的Telegraf节点,所述Leader节点用于启动Telegraf子进程进行数据采集;The multiple servers are all deployed with Telegraf nodes; the Telegraf nodes included in the data collection system include only one Telegraf node as a Leader node, and the Leader node is used to start a Telegraf sub-process for data collection;
所述Telegraf节点包括事件监听节点和消息获取节点;所述事件监听节点,用于监听Leader节点选举事件的触发条件是否成立,并用于在监听到所述Leader节点选举事件的触发条件成立时,向选举事件阻塞队列中发送触发Leader节点选举事件的消息;其中,所述Leader节点选举事件用于触发所述数据采集系统包含的Telegraf节点选举一个Telegraf节点,作为Leader节点;The Telegraf node includes an event listening node and a message acquisition node; the event listening node is used to monitor whether the triggering condition of the Leader node election event is established, and is used to monitor the triggering condition of the Leader node election event when the triggering condition is established. The message that triggers the Leader node election event is sent in the election event blocking queue; wherein the Leader node election event is used to trigger the Telegraf node included in the data collection system to elect a Telegraf node as the Leader node;
所述消息获取节点,用于若从所述选举事件阻塞队列中获取到所述触发Leader节点选举事件的消息,则作为被选举出的Leader节点,启动相应的Telegraf子进程进行数据采集;The message acquisition node is used to, if the message triggering the Leader node election event is obtained from the election event blocking queue, serve as the elected Leader node and start the corresponding Telegraf sub-process for data collection;
所述ZooKeeper分布式框架模块,包括所述Telegraf节点在所述ZooKeeper分布式框架模块上注册的临时字段;其中,所述临时字段用作所述事件监听节点监听所述Leader节点选举事件的触发条件是否成立的依据;The ZooKeeper distributed framework module includes a temporary field registered by the Telegraf node on the ZooKeeper distributed framework module; wherein the temporary field is used as a trigger condition for the event listening node to monitor the Leader node election event. The basis for whether it is established;
所述Redis数据库,包括选举事件阻塞队列,用于存储触发所述Leader节点选举事件的消息。The Redis database includes an election event blocking queue, which is used to store messages that trigger the Leader node election event.
可选地,所述Redis数据库还包括分布式锁,用于确定向所述选举事件阻塞队列发送所述选举事件的触发消息的Telegraf节点。Optionally, the Redis database further includes a distributed lock for determining the Telegraf node that sends the trigger message of the election event to the election event blocking queue.
可选地,所述ZooKeeper分布式框架模块还包括leader字段,用于存放所述Leader节点所在服务器的标识信息。Optionally, the ZooKeeper distributed framework module also includes a leader field, which is used to store the identification information of the server where the Leader node is located.
第二方面,本发明实施例提供一种数据采集方法,应用于第一方面所述的数据采集系统,包括:In a second aspect, embodiments of the present invention provide a data collection method, applied to the data collection system described in the first aspect, including:
监听Leader节点选举事件的触发条件是否成立,并在监听到所述Leader节点选举事件的触发条件成立时,向选举事件阻塞队列中发送触发Leader节点选举事件的消息;其中,所述Leader节点选举事件用于触发所述数据采集系统包含的Telegraf节点选举一个Telegraf节点,作为Leader节点;Monitor whether the triggering condition of the Leader node election event is established, and when the triggering condition of the Leader node election event is established, send a message triggering the Leader node election event to the election event blocking queue; wherein, the Leader node election event Used to trigger the Telegraf node included in the data collection system to elect a Telegraf node as the Leader node;
若从所述选举事件阻塞队列中获取到所述触发Leader节点选举事件的消息,则作为被选举出的作为Leader节点的获选节点,启动相应的Telegraf子进程进行数据采集。If the message triggering the Leader node election event is obtained from the election event blocking queue, the corresponding Telegraf sub-process will be started to collect data as the selected node that is elected as the Leader node.
可选地,所述Leader节点选举事件的触发条件具体为:Optionally, the triggering conditions for the Leader node election event are specifically:
所述Leader节点在ZooKeeper分布式框架模块上注册的临时字段不存在,或者,所述Leader节点在ZooKeeper分布式框架模块上注册的临时字段中存放的最后收报时间与当前时间的差值超过预设阈值。The temporary field registered by the Leader node on the ZooKeeper distributed framework module does not exist, or the difference between the last reporting time and the current time stored in the temporary field registered by the Leader node on the ZooKeeper distributed framework module exceeds the preset value. threshold.
可选地,所述向选举事件阻塞队列中发送触发Leader节点选举事件的消息,包括:Optionally, sending a message that triggers the Leader node election event to the election event blocking queue includes:
多个Telegraf节点争抢Redis数据库的分布式锁;Multiple Telegraf nodes compete for distributed locks of the Redis database;
争抢到所述分布式锁的Telegraf节点,向所述Redis数据库的选举事件阻塞队列发送触发Leader节点选举事件的消息。The Telegraf node that competes for the distributed lock sends a message that triggers the Leader node election event to the election event blocking queue of the Redis database.
可选地,所述作为被选举出的作为Leader节点的获选节点,启动相应的Telegraf子进程进行数据采集之后,还包括:Optionally, after the selected node that is elected as the Leader node starts the corresponding Telegraf sub-process for data collection, it also includes:
若所述Telegraf子进程启动异常,则再次触发所述Leader节点选举事件。If the Telegraf sub-process starts abnormally, the Leader node election event is triggered again.
可选地,所述方法还包括:Optionally, the method also includes:
所述ZooKeeper分布式框架模块向所述Leader节点发送关闭所述Leader节点对应的Telegraf子进程的事件通知。The ZooKeeper distributed framework module sends an event notification to the Leader node to close the Telegraf sub-process corresponding to the Leader node.
可选地,所述方法还包括:Optionally, the method also includes:
通过设置ZooKeeper分布式框架模块中的leader字段,在多个Telegraf节点中确定一个Telegraf节点为Leader节点进行数据采集。By setting the leader field in the ZooKeeper distributed framework module, one Telegraf node is determined as the Leader node among multiple Telegraf nodes for data collection.
可选地,所述通过设置ZooKeeper分布式框架模块中的leader字段,在多个Telegraf节点中确定一个Telegraf节点进行数据采集,包括:Optionally, by setting the leader field in the ZooKeeper distributed framework module, determine a Telegraf node among multiple Telegraf nodes for data collection, including:
设置ZooKeeper分布式框架模块中的leader字段为多个Telegraf节点中的一个Telegraf节点所在服务器的标识信息;Set the leader field in the ZooKeeper distributed framework module to the identification information of the server where one Telegraf node among multiple Telegraf nodes is located;
所述Telegraf节点读取ZooKeeper分布式框架模块中的leader字段,若所述leader字段为所述Telegraf节点所在服务器的标识信息,则所述Telegraf节点启动对应的Telegraf子进程进行数据采集。The Telegraf node reads the leader field in the ZooKeeper distributed framework module. If the leader field is the identification information of the server where the Telegraf node is located, the Telegraf node starts the corresponding Telegraf sub-process for data collection.
本发明实施例提供的数据采集系统,通过在多个服务器上各部署一个Telegraf节点,在当前Leader节点出现异常时选举出新的Leader节点进行数据采集,避免了实时监控数据的丢失。The data collection system provided by the embodiment of the present invention deploys one Telegraf node on each of multiple servers and elects a new Leader node for data collection when the current Leader node is abnormal, thereby avoiding the loss of real-time monitoring data.
附图说明Description of the drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts.
图1为本发明实施例中数据采集系统的结构示意图;Figure 1 is a schematic structural diagram of a data collection system in an embodiment of the present invention;
图2为本发明实施例中数据采集方法的流程示意图;Figure 2 is a schematic flow chart of a data collection method in an embodiment of the present invention;
图3为本发明实施例中数据采集方法的另一流程示意图;Figure 3 is another schematic flow chart of the data collection method in the embodiment of the present invention;
图4为本发明实施例中数据采集方法的又一流程示意图。Figure 4 is another schematic flow chart of the data collection method in the embodiment of the present invention.
具体实施方式Detailed ways
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, rather than all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of the present invention.
图1为本发明实施例提供的数据采集系统的结构示意图,如图1所示,该系统包括:Figure 1 is a schematic structural diagram of a data collection system provided by an embodiment of the present invention. As shown in Figure 1, the system includes:
多个服务器110,各服务器上部署有Telegraf节点111,所述数据采集系统包含的Telegraf节点111中,仅包括一个作为Leader节点的Telegraf节点,所述Leader节点用于启动Telegraf子进程112进行数据采集;There are multiple servers 110, and Telegraf nodes 111 are deployed on each server. Among the Telegraf nodes 111 included in the data collection system, there is only one Telegraf node as a leader node. The leader node is used to start the Telegraf sub-process 112 for data collection. ;
所述Telegraf节点111包括事件监听节点和消息获取节点;所述事件监听节点,用于监听Leader节点选举事件的触发条件是否成立,并用于在监听到所述Leader节点选举事件的触发条件成立时,向选举事件阻塞队列中发送触发Leader节点选举事件的消息;其中,所述Leader节点选举事件用于触发所述数据采集系统包含的Telegraf节点111选举一个Telegraf节点,作为Leader节点;The Telegraf node 111 includes an event listening node and a message acquisition node; the event listening node is used to monitor whether the triggering condition of the Leader node election event is established, and is used to monitor whether the triggering condition of the Leader node election event is established, Send a message triggering the Leader node election event to the election event blocking queue; wherein the Leader node election event is used to trigger the Telegraf node 111 included in the data collection system to elect a Telegraf node as the Leader node;
所述消息获取节点,用于若从所述选举事件阻塞队列中获取到所述触发Leader节点选举事件的消息,则作为被选举出的作为Leader节点的获选节点,启动相应的Telegraf子进程112进行数据采集;The message acquisition node is used to, if the message triggering the Leader node election event is obtained from the election event blocking queue, start the corresponding Telegraf sub-process 112 as the selected node that is elected as the Leader node. perform data collection;
ZooKeeper分布式框架模块120,包括所述Telegraf节点在所述ZooKeeper分布式框架模块上注册的多个临时字段121;其中,所述临时字段用作所述事件监听节点监听所述Leader节点选举事件的触发条件是否成立的依据;The ZooKeeper distributed framework module 120 includes a plurality of temporary fields 121 registered by the Telegraf node on the ZooKeeper distributed framework module; wherein, the temporary fields are used by the event listening node to monitor the Leader node election event. The basis for whether the triggering condition is established;
Redis数据库130,包括选举事件阻塞队列131,用于存储触发所述Leader节点选举事件的消息。The Redis database 130 includes an election event blocking queue 131, which is used to store messages that trigger the Leader node election event.
具体地,本发明实施应用于多服务器的环境中,每个服务器110上部署有Telegraf节点111,即多服务器环境中存在多个互相独立的Telegraf节点111。Telegraf节点111的重要功能之一是用于启动对应的Telegraf子进程112进行数据采集。更一般地,Telegraf节点111可以用于管理对应的Telegraf子进程112的生命周期,例如对Telegraf子进程112进行启动、停止等操作。Specifically, the present invention is applied in a multi-server environment, and a Telegraf node 111 is deployed on each server 110. That is, there are multiple independent Telegraf nodes 111 in a multi-server environment. One of the important functions of the Telegraf node 111 is to start the corresponding Telegraf sub-process 112 for data collection. More generally, the Telegraf node 111 can be used to manage the life cycle of the corresponding Telegraf sub-process 112, such as starting and stopping the Telegraf sub-process 112.
Telegraf子进程112是针对操作系统服务器端基础设施与中间件相关性能指标监控的进程,对操作系统服务器端的许多基础设施与中间件提供开箱即用的监控数据采集插件集合,用户可以按需配置需要的插件集合,并对各类插件进行参数调整,即可投入到正式环境使用。本发明实施例中的Telegraf节点111可以作为Telegraf子进程112的父进程。具体实现上,Telegraf节点111一般通过Bash或者Python等轻量级脚本来实现。Telegraf subprocess 112 is a process for monitoring performance indicators related to the operating system server-side infrastructure and middleware. It provides a set of out-of-the-box monitoring data collection plug-ins for many infrastructure and middleware on the operating system server side. Users can configure them as needed. Collect the required plug-ins and adjust the parameters of various plug-ins before they can be put into use in the formal environment. The Telegraf node 111 in the embodiment of the present invention can serve as the parent process of the Telegraf child process 112. In terms of specific implementation, Telegraf node 111 is generally implemented through lightweight scripts such as Bash or Python.
具体地,Telegraf节点111的还具有如下功能,即判断所述多个Telegraf节点111之间的选举事件的触发条件,并从Redis数据库130监听所述选举事件的触发消息。多个Telegraf节点111之间的选举事件是指:本发明实施例的多个Telegraf节点111之间仅有一个Telegraf节点启动Telegraf子进程负责数据采集,即本发明实施例中的Leader节点,在Leader节点运行出现各类异常的情况下,各个Telegraf节点111之间需要选举出新的Telegraf节点成为唯一的Leader节点,替代原有Leader节点继续数据采集任务。Specifically, the Telegraf node 111 also has the following function: determining the triggering conditions of the election event among the multiple Telegraf nodes 111 and monitoring the triggering message of the election event from the Redis database 130 . The election event between multiple Telegraf nodes 111 refers to: among the multiple Telegraf nodes 111 in the embodiment of the present invention, only one Telegraf node starts the Telegraf sub-process and is responsible for data collection, that is, the Leader node in the embodiment of the present invention. In the Leader When various abnormalities occur in node operation, each Telegraf node 111 needs to elect a new Telegraf node to become the only Leader node, replacing the original Leader node to continue the data collection task.
进一步地,在Leader节点运行出现各类异常的情况下,Telegraf节点111会判断选举事件的触发条件是否成立。当选举事件的触发条件成立时,本发明实施例中的Redis数据库130中会出现选举事件的触发消息,Telegraf节点111会对Redis数据库130中的选举事件的触发消息进行实时的监听,能够及时地得知选举事件的触发,从而参与选举事件。Furthermore, when various abnormalities occur in the operation of the Leader node, the Telegraf node 111 will determine whether the triggering conditions of the election event are established. When the triggering condition of the election event is established, the trigger message of the election event will appear in the Redis database 130 in the embodiment of the present invention. The Telegraf node 111 will monitor the trigger message of the election event in the Redis database 130 in real time, and can promptly Learn about the trigger of the election event and participate in the election event.
具体地,本发明实施例中的Redis数据库130作为轻量级消息队列,基于其共享内存的工作模式,在通信数据量较小的情况下具有更好的性能,适用于本发明实施例中作为选举事件通知与监听处理的目标容器。Redis数据库130具体包括选举事件阻塞队列131,选举事件阻塞队列131用于获取所述多个Telegraf节点之间的选举事件的触发消息,即由Telegraf节点向选举事件阻塞队列131发送选举事件的触发消息。Specifically, the Redis database 130 in the embodiment of the present invention serves as a lightweight message queue. Based on its shared memory working mode, it has better performance when the amount of communication data is small, and is suitable for use as a message queue in the embodiment of the present invention. The target container for election event notification and listening processing. The Redis database 130 specifically includes an election event blocking queue 131. The election event blocking queue 131 is used to obtain the trigger message of the election event between the multiple Telegraf nodes, that is, the Telegraf node sends the trigger message of the election event to the election event blocking queue 131. .
为保证触发消息的唯一性,在触发选举事件时仅需要一条选举事件的触发消息。因此,为避免多个Telegraf节点同时向选举事件阻塞队列发送选举事件的触发消息,Redis数据库130中还包含分布式锁132,多个Telegraf节点需要先抢占分布式锁,抢占到分布式锁的Telegraf节点才能够向选举事件阻塞队列发送触发消息,实现了触发消息的唯一性。To ensure the uniqueness of the trigger message, only one trigger message of the election event is required when triggering the election event. Therefore, in order to prevent multiple Telegraf nodes from sending election event trigger messages to the election event blocking queue at the same time, the Redis database 130 also contains a distributed lock 132. Multiple Telegraf nodes need to preempt the distributed lock first, and the Telegraf node that preempts the distributed lock Only then can the node send trigger messages to the election event blocking queue, achieving the uniqueness of the trigger messages.
本发明实施例中的数据采集系统还包括ZooKeeper分布式框架模块120,作为一种程序协调服务,ZooKeeper分布式框架模块120在本发明实施例中起到了系统中状态同步的目标容器的作用,并通过以下字段实现其功能:The data collection system in the embodiment of the present invention also includes the ZooKeeper distributed framework module 120. As a program coordination service, the ZooKeeper distributed framework module 120 plays the role of a target container for state synchronization in the system in the embodiment of the present invention, and Its functionality is achieved through the following fields:
初始化字段,用于标识系统是否完成了初始化流程;Initialization field, used to identify whether the system has completed the initialization process;
leader字段,用于存放所述Leader节点所在服务器的标识信息,例如服务器的主机名或者ip地址;The leader field is used to store the identification information of the server where the Leader node is located, such as the host name or IP address of the server;
临时字段,即所述Telegraf节点在所述ZooKeeper分布式框架模块上注册的多个临时字段。若所述Telegraf节点为Leader节点,则所述Leader节点所注册的临时字段用于存储所述Leader节点的最后收报时间;若所述Telegraf节点不是Leader节点,则Telegraf节点所注册的临时字段用于存储所述Telegraf节点的health状态,取值为true时,说明该节点目前运行正常,具备被选举为Leader节点的权力,否则取值为false。Temporary fields are multiple temporary fields registered by the Telegraf node on the ZooKeeper distributed framework module. If the Telegraf node is a Leader node, the temporary field registered by the Leader node is used to store the last reporting time of the Leader node; if the Telegraf node is not a Leader node, the temporary field registered by the Telegraf node is used to store Store the health status of the Telegraf node. When the value is true, it means that the node is currently running normally and has the right to be elected as the Leader node. Otherwise, the value is false.
进一步地,本发明实施例中Leader节点的最后收报时间,是指Leader节点在采集数据的过程中会以预设的频率上报其采集数据的最新时间,最后收报时间也就是Leader节点最后正常采集数据并上报的时间,表征着Leader节点的健康状态。Furthermore, in the embodiment of the present invention, the last reporting time of the Leader node refers to the latest time when the Leader node reports its collected data at a preset frequency during the process of collecting data. The last reporting time is also the last time the Leader node normally collects data. The time it is reported represents the health status of the Leader node.
本发明实施例提供的数据采集系统,通过在多个服务器上各部署一个Telegraf节点,在当前Leader节点出现异常时选举出新的Leader节点进行数据采集,避免了实时监控数据的丢失。The data collection system provided by the embodiment of the present invention deploys one Telegraf node on each of multiple servers and elects a new Leader node for data collection when the current Leader node is abnormal, thereby avoiding the loss of real-time monitoring data.
在上述实施例的基础上,图2为本发明实施例提供的数据采集方法的流程图,该方法应用于上述实施例所提供的数据采集系统,如图2所示,该方法包括:Based on the above embodiments, Figure 2 is a flow chart of the data collection method provided by the embodiment of the present invention. The method is applied to the data collection system provided by the above embodiment. As shown in Figure 2, the method includes:
S201,监听Leader节点选举事件的触发条件是否成立,并在监听到所述Leader节点选举事件的触发条件成立时,向选举事件阻塞队列中发送触发Leader节点选举事件的消息;其中,所述Leader节点选举事件用于触发所述数据采集系统包含的Telegraf节点选举一个Telegraf节点,作为Leader节点;S201, monitor whether the triggering condition of the Leader node election event is established, and when the triggering condition of the Leader node election event is established, send a message triggering the Leader node election event to the election event blocking queue; wherein, the Leader node The election event is used to trigger the Telegraf node included in the data collection system to elect a Telegraf node as the Leader node;
具体地,本发明实施例中的所应用的场景是本发明实施例的数据采集系统中,且该系统中部署在某一服务器上的Telegraf节点在采用对应的Telegraf子进程进行数据采集任务,即Leader节点。Specifically, the applied scenario in the embodiment of the present invention is the data collection system of the embodiment of the present invention, and the Telegraf node deployed on a certain server in the system uses the corresponding Telegraf sub-process to perform the data collection task, that is, Leader node.
为防止该服务器出现宕机或网络故障等问题从而造成数据采集失败,本发明实施例需要选举出新的Leader节点来进行数据采集。首先,需要控制各个候选节点来判断多个Telegraf节点之间的选举事件的触发条件是否成立。其中,所述候选节点是所述多个Telegraf节点中除Leader节点以外的其他Telegraf节点。同时,Leader节点和其他候选节点一样,实际上也具备判断选举事件触发条件的功能,然而选举事件所要选举出的新的Leader节点通常是在候选节点中产生的。In order to prevent data collection failure due to problems such as server downtime or network failure, embodiments of the present invention need to elect a new Leader node for data collection. First, each candidate node needs to be controlled to determine whether the triggering conditions for election events between multiple Telegraf nodes are established. Wherein, the candidate node is a Telegraf node other than the Leader node among the plurality of Telegraf nodes. At the same time, the Leader node, like other candidate nodes, actually has the function of determining the triggering conditions of the election event. However, the new Leader node to be elected by the election event is usually generated among the candidate nodes.
进一步地,本发明实施例中的选举事件的触发条件至少包括两个:所述Leader节点在ZooKeeper分布式框架模块上所注册的临时字段不存在,或者,所述Leader节点在ZooKeeper分布式框架模块上所注册的临时字段中存放的最后收报时间与当前时间的差值超过预设阈值。如果上述两个条件其中一个成立,代表着达到了选举事件的触发条件,需要选举出新的Leader节点。Further, the triggering conditions for the election event in the embodiment of the present invention include at least two: the temporary field registered by the Leader node on the ZooKeeper distributed framework module does not exist, or the Leader node is registered on the ZooKeeper distributed framework module. The difference between the last reporting time stored in the temporary field registered on and the current time exceeds the preset threshold. If one of the above two conditions is true, it means that the triggering condition of the election event has been reached and a new Leader node needs to be elected.
具体地,对于第一个触发条件而言,由于本发明实施例中包括Leader节点在内的每个Telegraf节点都会在ZooKeeper分布式框架模块中注册对应的临时字段,如果Leader节点对应的临时字段的状态从存在变为不存在,则根据ZooKeeper的事件通知机制,说明该Leader节点在系统中已经不存在,可能的原因包括:Leader节点所在服务器宕机、Leader节点由于网络故障与ZooKeeper分布式框架模块失联等,此时候选节点会收到来自ZooKeeper分布式框架模块的通知事件。Specifically, for the first trigger condition, since each Telegraf node including the Leader node in the embodiment of the present invention will register the corresponding temporary field in the ZooKeeper distributed framework module, if the temporary field corresponding to the Leader node If the status changes from existence to non-existence, according to the event notification mechanism of ZooKeeper, it means that the Leader node no longer exists in the system. Possible reasons include: the server where the Leader node is located is down, the Leader node is disconnected from the ZooKeeper distributed framework module due to network failure. Loss of contact, etc. At this time, the candidate node will receive notification events from the ZooKeeper distributed framework module.
具体地,对于第二个触发条件而言,由于Leader节点在ZooKeeper分布式框架模块对应的临时字段中存储的是最后收报时间,并且会随着数据采集任务的进行以预设的频率更新该最后收报时间。同时,候选节点会启动相应的定时检查线程,在ZooKeeper分布式框架模块中检查这个最后收报时间,如果最后收报时间与当前时间的差值超过预设阈值,说明Leader节点所掌管的Telegraf子进程上报存在网络拥塞、丢包等问题,也属于需要触发选举事件的情形。Specifically, for the second trigger condition, the Leader node stores the last reporting time in the temporary field corresponding to the ZooKeeper distributed framework module, and will update the last reporting time at a preset frequency as the data collection task proceeds. Closing time. At the same time, the candidate node will start the corresponding timing check thread to check the last reporting time in the ZooKeeper distributed framework module. If the difference between the last reporting time and the current time exceeds the preset threshold, it means that the Telegraf sub-process controlled by the Leader node reports There are problems such as network congestion and packet loss, which are also situations where election events need to be triggered.
进一步地,在候选节点判断多个Telegraf节点之间的选举事件的触发条件成立之后,需要向Redis数据库的选举事件阻塞队列发送所述选举事件的触发消息。选举事件阻塞队列是用于获取所述多个Telegraf节点之间的选举事件的触发消息的消息容器,即本发明实施例是由Telegraf节点向选举事件阻塞队列发送选举事件的触发消息,发送方具体可以是Leader节点,也可以是候选节点。Further, after the candidate node determines that the triggering condition of the election event between multiple Telegraf nodes is established, it needs to send the trigger message of the election event to the election event blocking queue of the Redis database. The election event blocking queue is a message container used to obtain the triggering message of the election event between the multiple Telegraf nodes. That is, in the embodiment of the present invention, the Telegraf node sends the triggering message of the election event to the election event blocking queue. The sender specifically It can be a Leader node or a candidate node.
为保证触发消息的唯一性,在触发选举事件时仅需要一条选举事件的触发消息。因此,为避免多个Telegraf节点同时向选举事件阻塞队列发送选举事件的触发消息,Redis数据库中还包含分布式锁,多个Telegraf节点需要先抢占分布式锁,抢占到分布式锁的Telegraf节点才能够向选举事件阻塞队列发送触发消息,实现了触发消息的唯一性。To ensure the uniqueness of the trigger message, only one trigger message of the election event is required when triggering the election event. Therefore, in order to prevent multiple Telegraf nodes from sending election event trigger messages to the election event blocking queue at the same time, the Redis database also contains distributed locks. Multiple Telegraf nodes need to preempt the distributed locks first, and then the Telegraf node that preempts the distributed locks can Ability to send trigger messages to the election event blocking queue to achieve uniqueness of trigger messages.
S202,若从所述选举事件阻塞队列中获取到所述触发Leader节点选举事件的消息,则作为被选举出的作为Leader节点的获选节点,启动相应的Telegraf子进程进行数据采集。S202: If the message triggering the Leader node election event is obtained from the election event blocking queue, as the selected node that is elected as the Leader node, start the corresponding Telegraf sub-process for data collection.
具体地,在由一个Telegraf节点向选举事件阻塞队列发送了选举事件的触发消息后,根据该触发消息监听方式的排他性,多个Telegraf节点中只有一个Telegraf节点能够监听到该触发消息。也就是说,当其中一个Telegraf节点监听到该触发消息后,其他节点无法再监听到该触发消息,也就是说,只有一个节点获得了在选举事件中获得了选票,成为获选节点,也就是新的Leader节点。Specifically, after a Telegraf node sends an election event trigger message to the election event blocking queue, according to the exclusivity of the trigger message listening method, only one Telegraf node among multiple Telegraf nodes can listen to the trigger message. That is to say, when one of the Telegraf nodes listens to the trigger message, other nodes can no longer listen to the trigger message. That is to say, only one node obtains votes in the election event and becomes the elected node, that is, New Leader node.
具体地,获选节点作为新的Leader节点替代了原Leader节点继续进程数据采集的工作。具体步骤可以是首先由获选节点启动对应的Telegraf子进程,然后使用所述获选节点对应的Telegraf子进程进行数据采集。Specifically, the selected node serves as a new Leader node to replace the original Leader node and continue the work of process data collection. The specific steps may be to first start the corresponding Telegraf sub-process by the selected node, and then use the Telegraf sub-process corresponding to the selected node to collect data.
本发明实施例提供的数据采集方法,通过在当前Leader节点出现异常时选举出新的Leader节点进行数据采集,避免了实时监控数据的丢失。The data collection method provided by the embodiment of the present invention avoids the loss of real-time monitoring data by electing a new Leader node for data collection when the current Leader node is abnormal.
在上述任一实施例的基础上,图3为本发明实施例提供的数据采集方法的流程图,如图3所示,该方法具体为初始化阶段的完整流程,包括:Based on any of the above embodiments, Figure 3 is a flow chart of the data collection method provided by the embodiment of the present invention. As shown in Figure 3, the method is specifically a complete process of the initialization phase, including:
S301,手工随机选举;S301, manual random election;
具体地,在整个系统环境中的服务器、基础设施、中间件准备就绪的情况下,网络运维人员会通过手工设置ZooKeeper分布式框架模块中leader字段的方式,指定任意一台服务器中的Telegraf节点为Leader节点。Specifically, when the servers, infrastructure, and middleware in the entire system environment are ready, network operation and maintenance personnel will manually set the leader field in the ZooKeeper distributed framework module to specify the Telegraf node in any server. is the Leader node.
其中,leader字段是ZooKeeper分布式框架模块中用于存放所述Leader节点所在服务器的标识信息,例如服务器的主机名或者ip地址。各Telegraf节点都可以通过该字段中的信息获知Leader节点的信息。Among them, the leader field is used in the ZooKeeper distributed framework module to store the identification information of the server where the Leader node is located, such as the host name or IP address of the server. Each Telegraf node can learn the information of the Leader node through the information in this field.
S302,启动Telegraf节点;S302, start Telegraf node;
具体地,ZooKeeper分布式框架模块中的初始化字段,用于标识系统是否完成了初始化流程。因此,在进行初始化任务的开始,需要手工设置ZooKeeper分布式框架模块中的初始化字段为false,用于标识初始化工作还未完成。Specifically, the initialization field in the ZooKeeper distributed framework module is used to identify whether the system has completed the initialization process. Therefore, at the beginning of the initialization task, you need to manually set the initialization field in the ZooKeeper distributed framework module to false to indicate that the initialization work has not been completed.
同时,部署在不同服务器中的Telegraf节点一般是通过Bash或者Python等轻量级脚本来实现的脚本程序。因此,要在初始化阶段同时启动多个Telegraf节点的脚本程序,可以借助于例如Ansible等自动化运维工具批量启动Telegraf节点。At the same time, Telegraf nodes deployed in different servers are generally script programs implemented through lightweight scripts such as Bash or Python. Therefore, to start scripts for multiple Telegraf nodes at the same time during the initialization phase, you can use automated operation and maintenance tools such as Ansible to start Telegraf nodes in batches.
S303,注册临时字段;S303, register temporary fields;
具体地,ZooKeeper分布式框架模块是系统中状态同步的目标容器,其中的临时字段是由Telegraf节点在所述ZooKeeper分布式框架模块上注册得到,实现了不同Telegraf节点的状态同步。因此,在Telegraf节点启动后,Telegraf节点会以主机名或者所在服务器的ip地址等标识信息到ZooKeeper分布式框架模块注册得到对应的临时字段。Specifically, the ZooKeeper distributed framework module is the target container for state synchronization in the system, and the temporary fields therein are registered by Telegraf nodes on the ZooKeeper distributed framework module, thereby realizing state synchronization of different Telegraf nodes. Therefore, after the Telegraf node is started, the Telegraf node will register with the ZooKeeper distributed framework module with identification information such as the host name or the IP address of the server to obtain the corresponding temporary fields.
S304,获取Leader信息;S304, obtain Leader information;
具体地,由于ZooKeeper分布式框架模块中的leader字段是由网络运维人员手工设置的,各Telegraf节点还不知道自己是否是初始化任务中被设置的Leader节点。因此,各Telegraf节点可以从ZooKeeper中读取Leader信息进行判断。Specifically, since the leader field in the ZooKeeper distributed framework module is manually set by network operation and maintenance personnel, each Telegraf node does not yet know whether it is the Leader node set in the initialization task. Therefore, each Telegraf node can read Leader information from ZooKeeper for judgment.
S305,Leader节点启动Telegraf子进程;S305, the Leader node starts the Telegraf sub-process;
具体地,在Telegraf节点从ZooKeeper中读取Leader信息之后,如果Telegraf节点发现自己是Leader节点,则该Telegraf节点启动对应的Telegraf子进程,实现了网络运维人员在初始化任务中指定的Telegraf节点进行数据采集工作。Specifically, after the Telegraf node reads the Leader information from ZooKeeper, if the Telegraf node finds that it is the Leader node, the Telegraf node starts the corresponding Telegraf sub-process, realizing the Telegraf node specified by the network operation and maintenance personnel in the initialization task. Data collection work.
S306,选举事件消息队列监听;S306, election event message queue monitoring;
具体地,初始化任务在指定当前Leader节点进行数据采集时,还需要为后续可能开启的选举事件进行准备,各Telegraf节点需要及时的知晓选举事件何时发生,因此需要各Telegraf节点监听选举事件的触发消息来实现。Specifically, when the initialization task specifies the current Leader node for data collection, it also needs to prepare for subsequent election events that may be started. Each Telegraf node needs to know in time when the election event occurs, so each Telegraf node needs to monitor the triggering of the election event. message to achieve.
具体地,各Telegraf节点都会开启消费者线程,具体可以是使用Redis中的LPOP/BRPOP的模式监听Redis中当前消息量为空的选举事件阻塞队列。一旦选举事件阻塞队列阻塞队列中有了选举事件的触发消息,就被会各Telegraf节点监听到。Specifically, each Telegraf node will start a consumer thread. Specifically, the LPOP/BRPOP mode in Redis can be used to monitor the election event blocking queue in Redis whose current message volume is empty. Once there is an election event trigger message in the election event blocking queue, it will be monitored by each Telegraf node.
S307,初始状态检查;S307, initial status check;
具体地,Telegraf子进程是直接用于数据采集的进程,为了保证Telegraf子进程在启动后无异常现象,在Leader节点启动Telegraf子进程后需要通过对该子进程的日志文件扫描,判定Telegraf子进程是否正常运行。Specifically, the Telegraf sub-process is a process directly used for data collection. In order to ensure that there are no abnormalities after the Telegraf sub-process is started, after the Leader node starts the Telegraf sub-process, it is necessary to scan the log file of the sub-process to determine the Telegraf sub-process. Whether it is running normally.
S308,更新临时字段状态;S308, update temporary field status;
具体地,在Telegraf节点到ZooKeeper分布式框架模块注册得到对应的临时字段后,需要对临时字段的状态进行更新。若所述Telegraf节点为Leader节点,则所述Leader节点所注册的临时字段用于存储所述Leader节点的最后收报时间;若所述Telegraf节点不是Leader节点,则Telegraf节点所注册的临时字段用于存储所述Telegraf节点的health状态,取值为true时,说明该节点目前运行正常,具备被选举为Leader节点的权力,否则取值为false。Specifically, after the Telegraf node registers with the ZooKeeper distributed framework module to obtain the corresponding temporary field, the status of the temporary field needs to be updated. If the Telegraf node is a Leader node, the temporary field registered by the Leader node is used to store the last reporting time of the Leader node; if the Telegraf node is not a Leader node, the temporary field registered by the Telegraf node is used to store Store the health status of the Telegraf node. When the value is true, it means that the node is currently running normally and has the right to be elected as the Leader node. Otherwise, the value is false.
因此,在更新临时字段状态的步骤中,Leader节点在初始状态检查完毕无误后即可更新ZooKeeper分布式框架模块上对应临时节点的health状态为true;对于其他节点,只需要判定选举事件消息队列监听成功,则置health状态为true;对于各节点无法正常监听和Leader节点无法正常启动Telegraf子进程的情况,对应的health状态置为false。Therefore, in the step of updating the temporary field status, the Leader node can update the health status of the corresponding temporary node on the ZooKeeper distributed framework module to true after the initial status check is correct; for other nodes, it only needs to determine the election event message queue monitoring If successful, the health status is set to true; if each node cannot listen normally and the Leader node cannot start the Telegraf child process normally, the corresponding health status is set to false.
S309,异常报警检测与执行。S309, abnormal alarm detection and execution.
具体地,各个Telegraf节点启动后的预设事件后,例如30秒,会通过Redis抢占分布式锁,得到分布式锁的那个Telegraf节点启动异常报警监测线程对ZooKeeper分布式框架模块的各个临时字段的health值进行扫描,发现存在false的则触发报警,需要人工干预系统环境问题,否则将ZooKeeper的初始化字段设置为true,标识完成数据采集系统的初始化。Specifically, after each Telegraf node starts a preset event, such as 30 seconds, the distributed lock will be preempted through Redis, and the Telegraf node that obtains the distributed lock will start an abnormal alarm monitoring thread to monitor each temporary field of the ZooKeeper distributed framework module. The health value is scanned. If false is found, an alarm is triggered, requiring manual intervention for system environment problems. Otherwise, the initialization field of ZooKeeper is set to true, indicating that the initialization of the data collection system is completed.
本发明实施例提供的数据采集方法,通过初始化任务指定了负责数据采集的Telegraf节点,并在多个Telegraf节点间实现了状态同步和消息监听机制,保证了数据采集任务正常运行的同时,可以避免实时监控数据的丢失。The data collection method provided by the embodiment of the present invention specifies the Telegraf node responsible for data collection through the initialization task, and implements status synchronization and message monitoring mechanisms among multiple Telegraf nodes, ensuring the normal operation of the data collection task while avoiding Monitor data loss in real time.
在上述任一实施例的基础上,图4为本发明实施例提供的数据采集方法的流程图,如图4所示,该方法具体为选举事件阶段的完整流程,包括:Based on any of the above embodiments, Figure 4 is a flow chart of the data collection method provided by the embodiment of the present invention. As shown in Figure 4, the method is specifically a complete process of the election event stage, including:
S401,判断是否触发选举事件;S401, determine whether an election event is triggered;
为防止原Leader节点所在的服务器出现宕机或网络故障等问题从而造成数据采集失败,本发明实施例在该情形下需要选举出新的Leader节点来进行数据采集。首先,需要控制各个候选节点来判断多个Telegraf节点之间的选举事件的触发条件是否成立。其中,所述候选节点是所述多个Telegraf节点中除Leader节点以外的其他Telegraf节点。同时,Leader节点和其他候选节点一样,实际上也具备判断选举事件触发条件的功能,然而选举事件所要选举出的新的Leader节点通常是在候选节点中产生的。In order to prevent data collection from failing due to problems such as downtime or network failure on the server where the original Leader node is located, the embodiment of the present invention needs to elect a new Leader node for data collection in this situation. First, each candidate node needs to be controlled to determine whether the triggering conditions for election events between multiple Telegraf nodes are established. Wherein, the candidate node is a Telegraf node other than the Leader node among the plurality of Telegraf nodes. At the same time, the Leader node, like other candidate nodes, actually has the function of determining the triggering conditions of the election event. However, the new Leader node to be elected by the election event is usually generated among the candidate nodes.
进一步地,本发明实施例中的选举事件的触发条件至少包括两个:所述Leader节点在ZooKeeper分布式框架模块上所注册的临时字段不存在,或者,所述Leader节点在ZooKeeper分布式框架模块上所注册的临时字段中存放的最后收报时间与当前时间的差值超过预设阈值。如果上述两个条件其中一个成立,代表着达到了选举事件的触发条件,需要选举出新的Leader节点。Further, the triggering conditions for the election event in the embodiment of the present invention include at least two: the temporary field registered by the Leader node on the ZooKeeper distributed framework module does not exist, or the Leader node is registered on the ZooKeeper distributed framework module. The difference between the last reporting time stored in the temporary field registered on and the current time exceeds the preset threshold. If one of the above two conditions is true, it means that the triggering condition of the election event has been reached and a new Leader node needs to be elected.
具体地,对于第一个触发条件而言,由于本发明实施例中包括Leader节点在内的每个Telegraf节点都会在ZooKeeper分布式框架模块中注册对应的临时字段,如果Leader节点对应的临时字段的状态从存在变为不存在,则根据ZooKeeper的事件通知机制,说明该Leader节点在系统中已经不存在,可能的原因包括:Leader节点所在服务器宕机、Leader节点由于网络故障与ZooKeeper分布式框架模块失联等,此时候选节点会收到来自ZooKeeper分布式框架模块的通知事件。Specifically, for the first trigger condition, since each Telegraf node including the Leader node in the embodiment of the present invention will register the corresponding temporary field in the ZooKeeper distributed framework module, if the temporary field corresponding to the Leader node If the status changes from existence to non-existence, according to the event notification mechanism of ZooKeeper, it means that the Leader node no longer exists in the system. Possible reasons include: the server where the Leader node is located is down, the Leader node is disconnected from the ZooKeeper distributed framework module due to network failure. Loss of contact, etc. At this time, the candidate node will receive notification events from the ZooKeeper distributed framework module.
具体地,对于第二个触发条件而言,由于Leader节点在ZooKeeper分布式框架模块对应的临时字段中存储的是最后收报时间,并且会随着数据采集任务的进行以预设的频率更新该最后收报时间。同时,候选节点会启动相应的定时检查线程,在ZooKeeper分布式框架模块中检查这个最后收报时间,如果最后收报时间与当前时间的差值超过预设阈值,说明Leader节点所掌管的Telegraf子进程上报存在网络拥塞、丢包等问题,也属于需要触发选举事件的情形。Specifically, for the second trigger condition, the Leader node stores the last reporting time in the temporary field corresponding to the ZooKeeper distributed framework module, and will update the last reporting time at a preset frequency as the data collection task proceeds. Closing time. At the same time, the candidate node will start the corresponding timing check thread to check the last reporting time in the ZooKeeper distributed framework module. If the difference between the last reporting time and the current time exceeds the preset threshold, it means that the Telegraf sub-process controlled by the Leader node reports There are problems such as network congestion and packet loss, which are also situations where election events need to be triggered.
S402,触发选举事件到Redis阻塞队列;S402, trigger the election event to the Redis blocking queue;
具体地,在候选节点判断多个Telegraf节点之间的选举事件的触发条件成立之后,需要向Redis数据库的选举事件阻塞队列发送所述选举事件的触发消息。选举事件阻塞队列是用于获取所述多个Telegraf节点之间的选举事件的触发消息的消息容器,即本发明实施例是由Telegraf节点向选举事件阻塞队列发送选举事件的触发消息,发送方具体可以是Leader节点,也可以是候选节点。Specifically, after the candidate node determines that the triggering condition of the election event between multiple Telegraf nodes is established, it needs to send the trigger message of the election event to the election event blocking queue of the Redis database. The election event blocking queue is a message container used to obtain the triggering message of the election event between the multiple Telegraf nodes. That is, in the embodiment of the present invention, the Telegraf node sends the triggering message of the election event to the election event blocking queue. The sender specifically It can be a Leader node or a candidate node.
为保证触发消息的唯一性,在触发选举事件时仅需要一条选举事件的触发消息。因此,为避免多个Telegraf节点同时向选举事件阻塞队列发送选举事件的触发消息,Redis数据库中还包含分布式锁,多个Telegraf节点需要先抢占分布式锁,抢占到分布式锁的Telegraf节点才能够向选举事件阻塞队列发送触发消息,发送完触发消息后释放分布式锁,实现了触发消息的唯一性。To ensure the uniqueness of the trigger message, only one trigger message of the election event is required when triggering the election event. Therefore, in order to prevent multiple Telegraf nodes from sending election event trigger messages to the election event blocking queue at the same time, the Redis database also contains distributed locks. Multiple Telegraf nodes need to preempt the distributed locks first, and then the Telegraf node that preempts the distributed locks can Ability to send trigger messages to the election event blocking queue, and release the distributed lock after sending the trigger message, achieving the uniqueness of the trigger message.
S403,满足候选资格的节点争夺选票;S403, nodes that meet the candidate qualifications compete for votes;
具体地,在由一个Telegraf节点向选举事件阻塞队列发送了选举事件的触发消息后,根据该触发消息监听方式的排他性,多个Telegraf节点中只有一个Telegraf节点能够监听到该触发消息。也就是说,当其中一个Telegraf节点监听到该触发消息后,其他节点无法再监听到该触发消息,也就是说,只有一个节点获得了在选举事件中获得了选票,成为获选节点,也就是新的Leader节点。Specifically, after a Telegraf node sends an election event trigger message to the election event blocking queue, according to the exclusivity of the trigger message listening method, only one Telegraf node among multiple Telegraf nodes can listen to the trigger message. That is to say, when one of the Telegraf nodes listens to the trigger message, other nodes can no longer listen to the trigger message. In other words, only one node obtains votes in the election event and becomes the elected node, that is, New Leader node.
S404,启动Telegraf子进程;S404, start the Telegraf sub-process;
具体地,在Telegraf节点获选为新的Leader节点之后,则该Telegraf节点启动对应的Telegraf子进程,在原Leader节点出现异常的情况下,代替原Leader节点中的Telegraf子进程进行数据采集工作,保证数据采集工作的持续进行。Specifically, after the Telegraf node is selected as the new Leader node, the Telegraf node starts the corresponding Telegraf sub-process. When the original Leader node is abnormal, it replaces the Telegraf sub-process in the original Leader node to perform data collection work, ensuring that Data collection continues.
S405,丢弃候选资格;S405, discard the candidacy;
具体地,在获选节点启动Telegraf子进程后,需要检测Telegraf子进程是否启动成功。如果Telegraf子进程启动异常,例如Telegraf子进程未正常启动或者Telegraf子进程启动后通过日志扫描发现存在无法正常收报的情况,则终止Telegraf子进程和对应的消费者线程,即丢弃候选资格。然后立即往Redis数据库中的选举事件阻塞队列中生成一条心的选举事件的触发消息,同时,更新对应Telegraf节点的临时字段的health状态为false。Specifically, after the selected node starts the Telegraf sub-process, it needs to be detected whether the Telegraf sub-process is started successfully. If the Telegraf sub-process starts abnormally, for example, the Telegraf sub-process does not start normally or log scanning finds that reports cannot be received normally after the Telegraf sub-process is started, the Telegraf sub-process and the corresponding consumer thread will be terminated, that is, the candidate will be discarded. Then immediately generate a trigger message for the election event in the election event blocking queue in the Redis database. At the same time, update the health status of the temporary field of the corresponding Telegraf node to false.
S406,同步Zookeeper状态信息;S406, synchronize Zookeeper status information;
具体地,如果获选节点检测到Telegraf子进程启动成功,代表获选节点将正式成为新的Leader节点开始进行数据采集工作。同时,还需要更新ZooKeeper分布式框架模块上的leader字段,将其内容更新为获选节点所在服务器的标识信息。此外,获选节点作为新的Leader节点,还需要更新对应的临时字段中的最后收报时间。Specifically, if the selected node detects that the Telegraf sub-process is successfully started, it means that the selected node will officially become a new Leader node and start data collection. At the same time, you also need to update the leader field on the ZooKeeper distributed framework module and update its content to the identification information of the server where the selected node is located. In addition, the selected node, as the new Leader node, also needs to update the last reporting time in the corresponding temporary field.
S407,原Leader节点平滑退位;S407, the original Leader node abdicates smoothly;
具体地,原Leader节点在获选节点更新ZooKeeper分布式框架模块上的leader字段后,会接到来自ZooKeeper分布式框架模块的事件通知,此时它会检查原Leader节点的Telegraf子进程如果还在运行,则进行关闭,实现系统中唯一Telegraf子进程的平滑切换。Specifically, after the selected node updates the leader field on the ZooKeeper distributed framework module, the original Leader node will receive an event notification from the ZooKeeper distributed framework module. At this time, it will check if the Telegraf sub-process of the original Leader node is still there. If it is running, it will be shut down to achieve smooth switching of the only Telegraf sub-process in the system.
S408,异常报警检测与执行;S408, abnormal alarm detection and execution;
具体地,在步骤S407之后,可选地执行本步骤,控制获选节点检查ZooKeeper分布式框架模块上的临时字段的个数和health值为true的个数,如果其中某个值小于指定阈值,表示系统中当前处于异常状态,经触发报警,提醒运维人员提前介入干预。Specifically, after step S407, this step is optionally executed to control the selected node to check the number of temporary fields on the ZooKeeper distributed framework module and the number of health values that are true. If one of the values is less than the specified threshold, Indicates that the system is currently in an abnormal state and an alarm is triggered to remind operation and maintenance personnel to intervene in advance.
本发明实施例提供的数据采集方法,通过在当前Leader节点出现异常时选举出新的Leader节点进行数据采集,避免了实时监控数据的丢失。The data collection method provided by the embodiment of the present invention avoids the loss of real-time monitoring data by electing a new Leader node for data collection when the current Leader node is abnormal.
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. Persons of ordinary skill in the art can understand and implement the method without any creative effort.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the part of the above technical solution that essentially contributes to the existing technology can be embodied in the form of a software product. The computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., including a number of instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments or certain parts of the embodiments.
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be used Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent substitutions are made to some of the technical features; however, these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010529803.1A CN111722980B (en) | 2020-06-11 | 2020-06-11 | Data collection systems and methods |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010529803.1A CN111722980B (en) | 2020-06-11 | 2020-06-11 | Data collection systems and methods |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111722980A CN111722980A (en) | 2020-09-29 |
CN111722980B true CN111722980B (en) | 2023-10-20 |
Family
ID=72567968
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010529803.1A Active CN111722980B (en) | 2020-06-11 | 2020-06-11 | Data collection systems and methods |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111722980B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104065741A (en) * | 2014-07-04 | 2014-09-24 | 用友软件股份有限公司 | Data collection system and method |
CN107895009A (en) * | 2017-11-10 | 2018-04-10 | 北京国信宏数科技有限责任公司 | One kind is based on distributed internet data acquisition method and system |
CN108512719A (en) * | 2018-03-02 | 2018-09-07 | 南京易捷思达软件科技有限公司 | A kind of Integrative resource monitoring system based on cloud platform of increasing income |
CN109088908A (en) * | 2018-06-06 | 2018-12-25 | 武汉酷犬数据科技有限公司 | A kind of the distributed general collecting method and system of network-oriented |
CN110247954A (en) * | 2019-05-15 | 2019-09-17 | 南京苏宁软件技术有限公司 | A kind of dispatching method and system of distributed task scheduling |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9596301B2 (en) * | 2006-09-18 | 2017-03-14 | Hewlett Packard Enterprise Development Lp | Distributed-leader-election service for a distributed computer system |
-
2020
- 2020-06-11 CN CN202010529803.1A patent/CN111722980B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104065741A (en) * | 2014-07-04 | 2014-09-24 | 用友软件股份有限公司 | Data collection system and method |
CN107895009A (en) * | 2017-11-10 | 2018-04-10 | 北京国信宏数科技有限责任公司 | One kind is based on distributed internet data acquisition method and system |
CN108512719A (en) * | 2018-03-02 | 2018-09-07 | 南京易捷思达软件科技有限公司 | A kind of Integrative resource monitoring system based on cloud platform of increasing income |
CN109088908A (en) * | 2018-06-06 | 2018-12-25 | 武汉酷犬数据科技有限公司 | A kind of the distributed general collecting method and system of network-oriented |
CN110247954A (en) * | 2019-05-15 | 2019-09-17 | 南京苏宁软件技术有限公司 | A kind of dispatching method and system of distributed task scheduling |
Also Published As
Publication number | Publication date |
---|---|
CN111722980A (en) | 2020-09-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108710544B (en) | Process monitoring method of database system and rail transit comprehensive monitoring system | |
US5634008A (en) | Method and system for threshold occurrence detection in a communications network | |
EP3761559A1 (en) | Fault detection method, apparatus, and system | |
CN113434327B (en) | Fault processing system, method, equipment and storage medium | |
CN109474685A (en) | Service monitoring method and system under a kind of framework based on micro services | |
EP1890454A1 (en) | Method and apparatus for carrying out a predetermined operation in a management device | |
CN109960634B (en) | Application program monitoring method, device and system | |
CN110618864A (en) | Interrupt task recovery method and device | |
CN110795264A (en) | Monitoring management method and system and intelligent management terminal | |
CN108833190A (en) | A kind of NFS service failure warning method, device and storage medium | |
CN106385343B (en) | Method and device for monitoring client under distributed system and distributed system | |
CN117221091A (en) | Isolation method and device for sub-health nodes in storage cluster and electronic equipment | |
CN103731315A (en) | Server failure detecting method | |
CN113179180A (en) | Basalt client disaster fault repairing method, basalt client disaster fault repairing device and basalt client disaster storage medium | |
CN111722980B (en) | Data collection systems and methods | |
CN113055203A (en) | Method and device for recovering abnormity of SDN control plane | |
CN114218050A (en) | A kind of cloud platform fault processing method and device | |
CN113765690B (en) | Cluster switching method, system, device, terminal, server and storage medium | |
CN116260707B (en) | Block chain node disaster recovery method, device and equipment based on consensus and storage medium | |
CN114363150B (en) | Network card connectivity monitoring method and device of server cluster | |
JP2006154991A (en) | Information processing system, information processing system control method, monitoring device, monitoring program, maintenance management program | |
CN113852984A (en) | A wireless terminal access monitoring system, method, electronic device and readable storage device | |
CN115174356B (en) | Cluster alarm reporting method, device, equipment and medium | |
WO2011054861A1 (en) | Monitoring and management of heterogeneous network events | |
CN112532525B (en) | Processing method, device and system for equipment recovery service |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |