WO2023123801A1 - Log aggregation system, and method for improving availability of log aggregation system - Google Patents

Log aggregation system, and method for improving availability of log aggregation system Download PDF

Info

Publication number
WO2023123801A1
WO2023123801A1 PCT/CN2022/091817 CN2022091817W WO2023123801A1 WO 2023123801 A1 WO2023123801 A1 WO 2023123801A1 CN 2022091817 W CN2022091817 W CN 2022091817W WO 2023123801 A1 WO2023123801 A1 WO 2023123801A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
nodes
broadcast message
abnormal
cluster
Prior art date
Application number
PCT/CN2022/091817
Other languages
French (fr)
Chinese (zh)
Inventor
陆玉平
邓瑞明
蔡攀龙
Original Assignee
上海川源信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海川源信息科技有限公司 filed Critical 上海川源信息科技有限公司
Publication of WO2023123801A1 publication Critical patent/WO2023123801A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/22Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks comprising specially adapted graphical user interfaces [GUI]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0823Errors, e.g. transmission errors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/12Network monitoring probes

Definitions

  • the present application relates to the technical field of cloud computing and storage, and in particular to a log aggregation system and a method for improving the availability of the log aggregation system.
  • Private Clouds are clouds built solely for a customer (such as a large enterprise) using technologies such as cloud computing.
  • Enterprise private cloud integrates various advanced technologies such as cloud computing and big data management, and belongs to a new service model. It can not only integrate resources, improve resource utilization, but also reduce resource consumption and enterprise costs. Therefore, it has gained rapid development in recent years. develop.
  • a log aggregation monitoring system (such as Loki) can be used to compress and store unstructured log data, and only index the metadata (metadata, including timestamps, labels, etc.) of the log data.
  • the inventor found in the process of implementing the solution of this application that the existing log aggregation monitoring system lacks guarantee measures for its own high availability.
  • the log aggregation monitoring system fails, it cannot find and report the problem in time, which may A large amount of log data is accumulated on the monitored object, which affects the performance of the monitored object and causes a waste of hard disk space.
  • the present application provides a log aggregation system and a method for improving the availability of the log aggregation system, so as to solve the reliability problem of the log aggregation system itself.
  • a log aggregation system is provided, the log aggregation system is used to collect log information of monitored objects; the log aggregation system includes a reverse proxy component and a core composed of multiple nodes service cluster;
  • the reverse proxy component is used to receive the log information of the monitored object, and, according to the first preset policy, select a node from the multiple nodes of the core service cluster as the target node, and send the log information to the the target node;
  • the nodes of the core service cluster are used to perform preset processing on the received log information, and each node monitors the node status of each node through mutual detection.
  • the multiple nodes include at least three nodes.
  • the node states are divided into active nodes, abnormal nodes, and unavailable nodes;
  • the nodes monitor each node's node status by detecting each other, including:
  • the first node If the first node is not an active node, mark the node status of the first node as an abnormal node in this node, and send a broadcast message that the first node is an abnormal node in the cluster;
  • the node status of the first node in this node is still an abnormal node, and the first node sent by other nodes has been received multiple times within the second preset time period. If the broadcast message that the node is an abnormal node, the node status of the first node is marked as an unavailable node, and the broadcast message that the first node is an unavailable node is sent in the cluster.
  • the detecting whether the first node is an active node includes:
  • receiving broadcast messages from other nodes that the first node is an abnormal node multiple times within the second preset duration includes:
  • the counter is incremented by 1;
  • each node monitors the node status of each node through mutual detection, and further includes:
  • each node monitors the node status of each node through mutual detection, and further includes:
  • the system also includes:
  • a storage component configured to store data processed by the core service cluster
  • An alarm component configured to send alarm information according to a second preset strategy when it is found that the log information of the monitored object is abnormal, abnormal nodes appear in the cluster and/or unavailable nodes appear in the cluster;
  • the data visualization component is used to display the log information and/or the alarm information.
  • a method for improving the availability of the log aggregation system is used for nodes in the log aggregation system;
  • the log aggregation system is used to collect the log information of the monitored object, including feedback To a proxy component and a core service cluster composed of a plurality of nodes;
  • the reverse proxy component is used to receive the log information of the monitored object, and select a node from a plurality of nodes according to a first preset policy as
  • the target node sends the log information to the target node;
  • the node is used to perform preset processing on the received log information, and each node monitors the node status of each node through mutual detection, and the node status is divided into Active nodes, abnormal nodes, and unavailable nodes;
  • the methods include:
  • the first node If the first node is not an active node, mark the node status of the first node as an abnormal node in this node, and send a broadcast message that the first node is an abnormal node in the cluster;
  • the node status of the first node in this node is still an abnormal node, and the first node sent by other nodes has been received multiple times within the second preset time period. If the broadcast message that the node is an abnormal node, the node status of the first node is marked as an unavailable node, and the broadcast message that the first node is an unavailable node is sent in the cluster.
  • the multiple nodes include at least three nodes.
  • the detecting whether the first node is an active node includes:
  • receiving broadcast messages from other nodes that the first node is an abnormal node multiple times within the second preset duration includes:
  • the counter is incremented by 1;
  • the method also includes:
  • the method also includes:
  • the log aggregation system further includes:
  • a storage component configured to store data processed by the core service cluster
  • An alarm component configured to send alarm information according to a second preset strategy when it is found that the log information of the monitored object is abnormal, abnormal nodes appear in the cluster and/or unavailable nodes appear in the cluster;
  • the data visualization component is used to display the log information and/or the alarm information.
  • a log aggregation system including a reverse proxy component and a cluster
  • the cluster adopts a distributed design and can include multiple nodes.
  • FIG. 1 is a schematic diagram of a log aggregation system provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of a scene of an embodiment of the present application.
  • Figure 3 is a schematic diagram of the interior of the Loki cluster in the embodiment of the present application.
  • Fig. 4 is a schematic diagram of a method for improving the availability of a log aggregation system provided by an embodiment of the present application.
  • Fig. 1 is a schematic diagram of a log aggregation system provided by an embodiment of the present application.
  • the log aggregation system can be used to collect log information of monitored objects.
  • the log aggregation system includes a reverse proxy component and a core service cluster composed of multiple nodes.
  • the reverse proxy component is used to receive the log information of the monitored object, and, according to the first preset policy, select a node from the multiple nodes of the core service cluster as the target node, and send the log information to the the target node.
  • the reverse proxy component specifically selects a node from the multiple nodes of the core service cluster as the target node according to what strategy, that is, the specific content of the first preset strategy is not limited in this embodiment.
  • the first preset strategy may specifically include one or more strategies such as round robin, node weight distribution, least link method, and performance optimal method.
  • the nodes of the core service cluster are used to perform preset processing on the received log information, and each node monitors the node status of each node through mutual detection.
  • the plurality of nodes may specifically include at least three nodes.
  • a node can be, for example, a server.
  • this embodiment does not limit the processing of the received log information by the node, that is, the specific content of the preset processing.
  • the preset processing may include parsing logs, indexing log content, adding tags, removing duplicates, compressing and storing, and so on.
  • the date of the log is extracted for indexing to facilitate subsequent retrieval of logs based on the date.
  • Another example is to extract the source of the log for indexing, such as distinguishing the logs of the Apache service and the logs of the database. You can also label the logs to distinguish between Matrics and Log, or to distinguish between Info level, Warning level, or Error level, and so on.
  • the node states can be specifically divided into three types: active nodes, abnormal nodes, and unavailable nodes;
  • the nodes monitor the node status of each node through mutual detection, which may specifically include:
  • a node may be selected as the first node among other nodes in the cluster other than the own node by polling or randomly.
  • the detecting whether the first node is an active node may include:
  • the first node can be PING
  • the first node If the first node is not an active node, mark the node status of the first node as an abnormal node in this node, and send a broadcast message that the first node is an abnormal node in the cluster.
  • the node status of the first node in this node is still an abnormal node, and the node sent by other nodes has been received multiple times within the second preset time period. If the broadcast message that the first node is an abnormal node, the node status of the first node is marked as an unavailable node, and the broadcast message that the first node is an unavailable node is sent in the cluster.
  • receiving multiple broadcast messages from other nodes that the first node is an abnormal node within the second preset time length may specifically include:
  • the counter is incremented by 1;
  • a node actively detects the status of other nodes.
  • a node may also be detected by other nodes, or receive messages broadcast by other nodes, so each node monitors the node status of each node through mutual detection, which can also include:
  • each node monitors the node state of each node through mutual detection, and may also include:
  • the system may also include:
  • a storage component configured to store data processed by the core service cluster
  • An alarm component configured to send alarm information according to a second preset strategy when it is found that the log information of the monitored object is abnormal, abnormal nodes appear in the cluster and/or unavailable nodes appear in the cluster;
  • the data visualization component is used to display the log information and/or the alarm information.
  • a log aggregation system including a reverse proxy component and a cluster
  • the cluster adopts a distributed design and may include multiple nodes.
  • the collection and processing of logs and load balancing, and the redundant deployment of multiple nodes and mutual monitoring between nodes improve the stability of the system and ensure the high availability of the system itself. If there is a problem with one or two nodes, it will not Affect the service functions provided by the system to the outside world. In addition, the monitored problems can be alerted, so that the operation and maintenance team or technical team can find and locate problems in time.
  • Loki as an example and further describe the solution of this application in combination with specific application scenarios.
  • the Loki application scenario as an example is only exemplary, and it can also be applied to other application scenarios in actual applications.
  • the three-node deployment method of this application can also be adopted to ensure the high availability of the Prometheus cluster.
  • Loki is a log aggregation system, which mainly compresses and stores unstructured log data and only indexes the metadata of log data (including: timestamp, labels, etc.).
  • the Loki service receives the log data pushed from the Promtail component, and distributes the log data to the internal Ingester component.
  • the index data and log content data of the log data are stored by the Ingester component.
  • Loki can only passively accept log pushes from Promtail components.
  • Loki main service cannot guarantee its own high availability.
  • the monitored object also fails, Loki cannot find out in time. And report problems, and will cause a large amount of log data to be backlogged on the monitored object, affecting the performance of the monitored object and causing a waste of hard disk space.
  • a Loki service core cluster is constructed, which can be composed of three nodes, and the three nodes monitor each other's status. If a problem occurs in one or two nodes in the Loki service core cluster, the cluster will not be affected Provide external log collection and aggregation services, and alert the operation and maintenance team or technical team of faults through the alarm component, requesting manual intervention.
  • Fig. 2 is a schematic diagram of a scenario of the embodiment of the present application
  • Fig. 3 is a schematic diagram of the interior of the Loki cluster in the embodiment of the present application.
  • the Loki cluster is externally connected to components such as Promtail, Node Exporter, alarm component, data visualization component, and storage pool (including index Index, log Log, and indicator Metrics).
  • the Prometheus cluster is a tool cluster used to capture the Metrics (indicators) data of the monitoring target.
  • the Loki cluster may include reverse proxy components and Loki core service clusters. The following is a detailed introduction:
  • Loki core service cluster connect to the monitoring target, collect and store various logs of the monitoring target. It is also possible to provide visual components to query logs.
  • 3Node nodes
  • reverse proxy components are used for reverse proxy.
  • Reverse proxy component The client (visualization component and Promtail component) sends a request to the reverse proxy component, and then the reverse proxy component forwards the request to the appropriate Loki node in the backend according to the preset policy.
  • the active and standby disaster recovery can be implemented inside the reverse proxy component.
  • the active and standby are connected by a heartbeat line. For backend Loki nodes.
  • Promtail component responsible for collecting logs of monitored objects, and sending the logs to the reverse proxy component of the Loki cluster, so that the reverse proxy component sends the log transmission request to the appropriate Loki node according to the preset policy, and then stores it after processing to the storage pool.
  • Monitoring object that is, the monitored object, or monitoring target.
  • the Promtail and Node Exporter components need to be installed in the monitoring object to facilitate the Loki cluster and Prometheus cluster to fetch data.
  • the Promtail component actively pushes the log data to the Loki cluster, while the Node Export component exposes the Http interface of the monitored object to Prometheus, so that Prometheus can regularly capture indicators.
  • Storage pool used to store log information and indicators obtained by the Loki cluster. Considering the read and write performance of log and indicator data, it is recommended to use the storage pool of the object storage service. In order to improve the retrieval efficiency of Loki logs, it is recommended that the Index of the log data retrieved by Loki be placed in a high-performance memory database.
  • Alarm component used to receive alarm information and send the alarm information to preset contacts. By making different Hooks, the alarm information can be sent to various IM communication tools (such as WeChat, Slack, Teams, Lync, etc.), email, phone, SMS, etc.
  • IM communication tools such as WeChat, Slack, Teams, Lync, etc.
  • the alert component may specifically use the Alertmanager component, which supports query of logs and indicators, and provides flexible alert methods.
  • monitor object logs or indicators are detected (such as Matrics information, including CPU usage/memory usage/hard disk usage/network card throughput/hard disk iops/API Response performance/network request error rate, etc. etc.) when abnormal, push the alarm information to the Alertmanager component.
  • Alertmanager receives an alert, it can perform configuration, aggregation, deduplication, noise reduction, and finally send an alert.
  • the Altermanager component can send alarm information to the operation and maintenance team or technical team by email/sms/phone, so that they can find and locate problems in time. You can also link to third-party operation and maintenance monitoring platforms, such as xMatters/Sumologc/Splunk, etc.
  • Data visualization component used to aggregate and display the monitoring data (such as the Matrics data of the system, such as the CPU usage/memory usage/hard disk usage/network card throughput of a server collected every 5 minutes, etc., according to These values draw a line chart and display), log data and alarm information.
  • the visualization component needs to display log or alarm information, it will initiate a query request for log or alarm information to the reverse proxy component. After receiving the request, the reverse proxy component forwards the request to the Query-frontend component, which is responsible for communicating with each Loki The Querier service on the node communicates to obtain various logs or alarm information stored in the storage pool, and replies to the reverse proxy component, which is finally displayed on the interface of the visualization component.
  • the above Loki cluster uses a redundant three-node cluster internally, and uses a reverse proxy component including active and standby disaster recovery externally to provide reverse proxy functions, and only exposes one interface to users.
  • a complete set of Loki core services Distributor, Ingester, and Querier can be deployed on each node. Since these three components belong to the existing technology, they will not be described in detail.
  • a node monitoring component is newly deployed on each node, so that each node can monitor the node status of each node through mutual detection.
  • the node monitoring component runs in the memory, it is possible to install a UPS power supply in the storage device, and force the UPS power supply to perform a safe shutdown of the system when the storage device detects that the storage device has been in the state of mains power failure for more than 10 minutes. , the strategy for dumping data in memory to disk.
  • the status of nodes can be divided into three types: active nodes, abnormal nodes and unavailable nodes.
  • the node failure detection principle of the node monitoring component may include:
  • node A After a node (for example, node A) is started, select (for example, poll) another node (for example, node B) to send a PING message to it at regular time intervals.
  • PING message fails, you can send the PING message to node B again, or randomly select other nodes (such as node C) to initiate an indirect PING request, and node C that receives the indirect PING request will initiate a PING to node B according to the address in the request message, and return the result of PING to the source node of the indirect request, that is, node A.
  • node A does not receive any ACK message from node B after the detection timeout, it will mark the status of node B as an abnormal node.
  • node A starts a timer and sends out a broadcast of an abnormal warning for node B. During this period, if it receives the same abnormal warning information for node B from other nodes, node A will locally The number of abnormal nodes + 1, when the timer expires, the state of node B is still inactive, and the number of abnormal nodes reaches the requirement, then node A will mark node B locally as an unavailable node.
  • the suspected failure node (such as node B) receives the message of abnormal warning sent by other nodes, it will immediately send the broadcast that the node is the active node, thereby clearing the other nodes on the node. is the label of the abnormal node.
  • node such as node A
  • node A When a node (such as node A) leaves the cluster, it will send a broadcast to the cluster that the node is an unavailable node; when node A marks other nodes (such as node B) as an unavailable node, it will also send a broadcast to the cluster The node is a broadcast of an unavailable node.
  • nodes When other nodes (such as node C) receive a broadcast message that a node (such as node B) is an unavailable node, they will compare it with the local record, and ignore the message when node B is also an unavailable node in the local record , when node B in the local record is not an unavailable node, the original local record will be deleted and node B will be marked as an unavailable node, and the message that node B is an unavailable node will be broadcast again to form another propagation.
  • node C When other nodes (such as node C) receive a broadcast message that a node (such as node B) is an unavailable node, they will compare it with the local record, and ignore the message when node B is also an unavailable node in the local record , when node B in the local record is not an unavailable node, the original local record will be deleted and node B will be marked as an unavailable node, and the message that node B is an unavailable node will be broadcast again to form another propagation.
  • a node for example, node B
  • receives a broadcast message that it is an unavailable node it means that the node is partitioned compared to other nodes.
  • the node will initiate a broadcast that the node is an active node to correct the error on other nodes. The stored state flag of this node.
  • each node in the cluster is connected to each other, and the status of the node can be confirmed through the node monitoring component. If other nodes or their own services fail, the alarm component can send the alarm information to third-party operation and maintenance monitoring platforms such as xMatters/Sumologc/Splunk, or send the alarm information to the operation and maintenance directly by email/sms/phone team or technical team, so that they can discover and locate problems in a timely manner.
  • third-party operation and maintenance monitoring platforms such as xMatters/Sumologc/Splunk
  • an arbitration mechanism can also be added to the node monitoring component.
  • the arbitration mechanism is realized by mutual monitoring of each node, so as to avoid the risk of data split brain in the cluster under severe conditions.
  • a log aggregation system including a reverse proxy component and a cluster
  • the cluster adopts a distributed design and may include multiple nodes.
  • the collection and processing of logs and load balancing, and the redundant deployment of multiple nodes and mutual monitoring between nodes improve the stability of the system and ensure the high availability of the system itself. If there is a problem with one or two nodes, it will not Affect the service functions provided by the system to the outside world. In addition, the monitored problems can be alerted, so that the operation and maintenance team or technical team can find and locate problems in time.
  • Fig. 4 is a schematic diagram of a method for improving the availability of a log aggregation system provided by an embodiment of the present application.
  • the method can be used for nodes in a log aggregation system; the log aggregation system is used to collect log information of monitored objects, including a reverse proxy component and a core service cluster composed of a plurality of nodes; the reverse proxy The component is used to receive the log information of the monitored object, and select a node from a plurality of nodes as a target node according to a first preset policy and send the log information to the target node; the node is used to The received log information is pre-processed, and each node monitors the node status of each node through mutual detection.
  • the node status is divided into active nodes, abnormal nodes, and unavailable nodes;
  • the method may include:
  • Step S401 select another node as the first node every first preset time interval, and detect whether the first node is an active node.
  • Step S402 if the first node is not an active node, mark the node status of the first node as an abnormal node in this node, and send a broadcast message that the first node is an abnormal node in the cluster .
  • Step S403 if the broadcast message that the first node is an active node is received within a second preset time period after sending the broadcast message that the first node is an abnormal node, mark the node status of the first node as for the active node.
  • Step S404 if when the second preset time period expires, the node status of the first node in this node is still an abnormal node, and the node has received all the messages sent by other nodes multiple times within the second preset time period. If the broadcast message that the first node is an abnormal node is used, the node status of the first node is marked as an unavailable node, and a broadcast message that the first node is an unavailable node is sent in the cluster.
  • the multiple nodes may specifically include at least three nodes.
  • the detecting whether the first node is an active node may specifically include:
  • the broadcast message that the first node is an abnormal node sent by other nodes is received multiple times within the second preset time period may specifically include:
  • the counter is incremented by 1;
  • the method may further include:
  • the method may further include:
  • the log aggregation system further includes:
  • a storage component configured to store data processed by the core service cluster
  • An alarm component configured to send alarm information according to a second preset strategy when it is found that the log information of the monitored object is abnormal, abnormal nodes appear in the cluster and/or unavailable nodes appear in the cluster;
  • the data visualization component is used to display the log information and/or the alarm information.
  • a method for improving the availability of the log aggregation system is provided, the method is used for nodes in the log aggregation system, and the log aggregation system includes a reverse proxy component and a core composed of a plurality of nodes Service cluster.
  • the cluster adopts a distributed design and can include multiple nodes.
  • the reverse proxy component and the cluster cooperate with each other. Not only can the collection, processing and load balancing of the logs of the monitored objects be realized, but also through the redundant deployment of multiple nodes and Mutual monitoring between nodes improves the stability of the system and ensures the high availability of the system itself. If there is a problem with one or two nodes, it will not affect the service functions provided by the system to the outside world. In addition, the detected problems can be alerted. It is convenient for the operation and maintenance team or technical team to find and locate problems in time.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Human Computer Interaction (AREA)
  • Debugging And Monitoring (AREA)
  • Hardware Redundancy (AREA)

Abstract

Provided in the present application are a log aggregation system, etc. The system is used for collecting log information of a monitored object, wherein a reverse proxy assembly is used for receiving the log information of the monitored object, selecting, according to a first preset policy, one node from a plurality of nodes of a core service cluster as a target node, and sending the log information to the target node; and the nodes of the core service cluster are used for performing preset processing on the received log information, and each node monitors the node state of another node by means of mutual detection. By means of the present application, a distributed design is used, and a reverse proxy assembly and a cluster cooperate with each other, such that not only can the log collection processing and load balancing of a monitored object be realized, but the stability of the system is improved by means of redundant deployment of a plurality of nodes and mutual monitoring between the nodes, thereby ensuring the high availability of the system itself. In addition, an alarm can be raised regarding a detected problem, thereby facilitating an operation and maintenance team or a technical team to discover and position the problem in a timely manner.

Description

一种日志聚合系统及一种提高日志聚合系统可用性的方法A log aggregation system and a method for improving the availability of the log aggregation system 技术领域technical field
本申请涉及云计算和存储技术领域,尤其涉及一种日志聚合系统及一种提高日志聚合系统可用性的方法。The present application relates to the technical field of cloud computing and storage, and in particular to a log aggregation system and a method for improving the availability of the log aggregation system.
背景技术Background technique
私有云(Private Clouds)是利用云计算等技术为一个客户(如大型企业)单独使用而构建的云。企业私有云集合了云计算、大数据管理等多种先进技术,属于一种新的服务模式,不但可以整合资源、提高资源利用率,还可以降低资源消耗、降低企业成本,因此近年来得到了快速发展。Private Clouds (Private Clouds) are clouds built solely for a customer (such as a large enterprise) using technologies such as cloud computing. Enterprise private cloud integrates various advanced technologies such as cloud computing and big data management, and belongs to a new service model. It can not only integrate resources, improve resource utilization, but also reduce resource consumption and enterprise costs. Therefore, it has gained rapid development in recent years. develop.
在企业私有云中,日志数据的处理\管理是一项重要工作。例如,可以使用日志聚合监控系统(例如Loki)对非结构化的日志数据进行压缩存储,并只对日志数据的metadata(元数据,包括时间戳、labels等)建立索引。In the enterprise private cloud, the processing\management of log data is an important task. For example, a log aggregation monitoring system (such as Loki) can be used to compress and store unstructured log data, and only index the metadata (metadata, including timestamps, labels, etc.) of the log data.
然而,发明人在实现本申请方案的过程中发现,现有的日志聚合监控系统缺乏对自身高可用性的保障措施,当日志聚合监控系统出现故障时,其自身无法及时发现并报告问题,从而可能造成大量的日志数据在被监控对象上积压,影响被监控对象的性能并造成硬盘空间的浪费。However, the inventor found in the process of implementing the solution of this application that the existing log aggregation monitoring system lacks guarantee measures for its own high availability. When the log aggregation monitoring system fails, it cannot find and report the problem in time, which may A large amount of log data is accumulated on the monitored object, which affects the performance of the monitored object and causes a waste of hard disk space.
发明内容Contents of the invention
本申请提供一种日志聚合系统及一种提高日志聚合系统可用性的方法,以解决日志聚合系统自身的可靠性问题。The present application provides a log aggregation system and a method for improving the availability of the log aggregation system, so as to solve the reliability problem of the log aggregation system itself.
根据本申请实施例的第一方面,提供一种日志聚合系统,所述日志聚合系统用于采集被监控对象的日志信息;所述日志聚合系统包括反向代理组件和由多个节点组成的核心服务集群;According to the first aspect of the embodiments of the present application, a log aggregation system is provided, the log aggregation system is used to collect log information of monitored objects; the log aggregation system includes a reverse proxy component and a core composed of multiple nodes service cluster;
所述反向代理组件用于接收被监控对象的日志信息,以及,根据第一预设 策略从所述核心服务集群多个节点中选择一个节点作为目标节点,并将所述日志信息发送给所述目标节点;The reverse proxy component is used to receive the log information of the monitored object, and, according to the first preset policy, select a node from the multiple nodes of the core service cluster as the target node, and send the log information to the the target node;
所述核心服务集群的节点用于对收到的日志信息进行预设处理,各节点通过相互探测以监控每个节点的节点状态。The nodes of the core service cluster are used to perform preset processing on the received log information, and each node monitors the node status of each node through mutual detection.
可选的,所述多个节点包括至少三个节点。Optionally, the multiple nodes include at least three nodes.
可选的,所述节点状态分为活动节点、异常节点、不可用节点;Optionally, the node states are divided into active nodes, abnormal nodes, and unavailable nodes;
所述各节点通过相互探测以监控每个节点的节点状态,包括:The nodes monitor each node's node status by detecting each other, including:
对于每个节点:For each node:
每隔第一预设时长选取其他一个节点作为第一节点,探测所述第一节点是否是活动节点;selecting another node as the first node every first preset time length, and detecting whether the first node is an active node;
如果所述第一节点不是活动节点,则在本节点中将所述第一节点的节点状态标记为异常节点,并在所述集群内发送所述第一节点为异常节点的广播消息;If the first node is not an active node, mark the node status of the first node as an abnormal node in this node, and send a broadcast message that the first node is an abnormal node in the cluster;
如果在发送所述第一节点为异常节点的广播消息后的第二预设时长内收到所述第一节点为活动节点的广播消息,则将所述第一节点的节点状态标记为活动节点;If a broadcast message that the first node is an active node is received within a second preset time period after sending the broadcast message that the first node is an abnormal node, mark the node status of the first node as an active node ;
如果所述第二预设时长到期时,本节点中所述第一节点的节点状态仍为异常节点,并且在所述第二预设时长内多次收到其他节点发送的所述第一节点为异常节点的广播消息,则将所述第一节点的节点状态标记为不可用节点,并在所述集群内发送所述第一节点为不可用节点的广播消息。If the second preset time period expires, the node status of the first node in this node is still an abnormal node, and the first node sent by other nodes has been received multiple times within the second preset time period. If the broadcast message that the node is an abnormal node, the node status of the first node is marked as an unavailable node, and the broadcast message that the first node is an unavailable node is sent in the cluster.
可选的,所述探测所述第一节点是否是活动节点,包括:Optionally, the detecting whether the first node is an active node includes:
向所述第一节点发送探测消息;sending a probe message to the first node;
如果未收到所述第一节点的正确响应,则再次向所述第一节点发送探测消息,或者,随机选取另外的节点作为第二节点,并向所述第二节点发送间接探测请求,以使所述第二节点向所述第一节点发送探测消息并将探测结果返回给本节点,其中所述间接探测请求中包括所述第一节点的地址;If no correct response from the first node is received, send a probe message to the first node again, or randomly select another node as the second node, and send an indirect probe request to the second node, to Make the second node send a detection message to the first node and return the detection result to the own node, wherein the indirect detection request includes the address of the first node;
如果仍未收到所述第一节点的正确响应,则判定所述第一节点不是活动节点。If the correct response from the first node is still not received, it is determined that the first node is not an active node.
可选的,在所述第二预设时长内多次收到其他节点发送的所述第一节点为异常节点的广播消息,包括:Optionally, receiving broadcast messages from other nodes that the first node is an abnormal node multiple times within the second preset duration includes:
在发送所述第一节点为异常节点的广播消息后启动计数器;Start a counter after sending the broadcast message that the first node is an abnormal node;
在所述第二预设时长内,每当收到其他节点发送的所述第一节点为异常节点的广播消息则所述计数器加1;Within the second preset duration, whenever receiving a broadcast message from other nodes that the first node is an abnormal node, the counter is incremented by 1;
当所述计数器大于预设数值时,判定为在所述第二预设时长内多次收到其他节点发送的所述第一节点为异常节点的广播消息。When the counter is greater than the preset value, it is determined that the broadcast message that the first node is an abnormal node is received multiple times from other nodes within the second preset time period.
可选的,所述各节点通过相互探测以监控每个节点的节点状态,还包括:Optionally, each node monitors the node status of each node through mutual detection, and further includes:
对于每个节点:For each node:
当接收到第三节点为不可用节点的广播消息时,如果所述第三节点未在本节点上被标记为不可用节点,则在本节点上将所述第三节点标记为不可用节点,并在所述集群内发送所述第三节点为不可用节点的广播消息以形成再次传播;When receiving a broadcast message that the third node is an unavailable node, if the third node is not marked as an unavailable node on the current node, marking the third node as an unavailable node on the current node, and sending a broadcast message that the third node is an unavailable node in the cluster to form re-propagation;
当接收到本节点为异常节点的广播消息,或者本节点为不可用节点的广播消息时,在所述集群内发送本节点为活动节点的广播消息,以修正其他节点对本节点的节点状态标记。When receiving a broadcast message that the node is an abnormal node, or a broadcast message that the node is an unavailable node, send a broadcast message that the node is an active node in the cluster, so as to modify other nodes' node status marks for the node.
可选的,所述各节点通过相互探测以监控每个节点的节点状态,还包括:Optionally, each node monitors the node status of each node through mutual detection, and further includes:
对于每个节点:For each node:
当本节点离开所述集群时,在所述集群内发送本节点为不可用节点的广播消息。When the current node leaves the cluster, a broadcast message that the current node is an unavailable node is sent in the cluster.
可选的,所述系统还包括:Optionally, the system also includes:
存储组件,用于存储经所述核心服务集群处理后的数据;a storage component, configured to store data processed by the core service cluster;
告警组件,用于当发现被监控对象的日志信息异常、所述集群中出现异常节点和\或所述集群中出现不可用节点时,根据第二预设策略发送告警信息;An alarm component, configured to send alarm information according to a second preset strategy when it is found that the log information of the monitored object is abnormal, abnormal nodes appear in the cluster and/or unavailable nodes appear in the cluster;
数据可视化组件,用于展示所述日志信息和\或所述告警信息。The data visualization component is used to display the log information and/or the alarm information.
根据本申请实施例的第二方面,提供一种提高日志聚合系统可用性的方法,所述方法用于日志聚合系统中的节点;所述日志聚合系统用于采集被监控对象的日志信息,包括反向代理组件和由多个所述节点组成的核心服务集群;所述反向代理组件用于接收被监控对象的日志信息,以及根据第一预设策略从多个所述节点中选择一个节点作为目标节点并将所述日志信息发送给所述目标节点;所述节点用于对收到的日志信息进行预设处理,各节点通过相互探测以监控每个节点的节点状态,所述节点状态分为活动节点、异常节点、不可用 节点;According to the second aspect of the embodiment of the present application, there is provided a method for improving the availability of the log aggregation system, the method is used for nodes in the log aggregation system; the log aggregation system is used to collect the log information of the monitored object, including feedback To a proxy component and a core service cluster composed of a plurality of nodes; the reverse proxy component is used to receive the log information of the monitored object, and select a node from a plurality of nodes according to a first preset policy as The target node sends the log information to the target node; the node is used to perform preset processing on the received log information, and each node monitors the node status of each node through mutual detection, and the node status is divided into Active nodes, abnormal nodes, and unavailable nodes;
所述方法包括:The methods include:
每隔第一预设时长选取其他一个节点作为第一节点,探测所述第一节点是否是活动节点;selecting another node as the first node every first preset time length, and detecting whether the first node is an active node;
如果所述第一节点不是活动节点,则在本节点中将所述第一节点的节点状态标记为异常节点,并在所述集群内发送所述第一节点为异常节点的广播消息;If the first node is not an active node, mark the node status of the first node as an abnormal node in this node, and send a broadcast message that the first node is an abnormal node in the cluster;
如果在发送所述第一节点为异常节点的广播消息后的第二预设时长内收到所述第一节点为活动节点的广播消息,则将所述第一节点的节点状态标记为活动节点;If a broadcast message that the first node is an active node is received within a second preset time period after sending the broadcast message that the first node is an abnormal node, mark the node status of the first node as an active node ;
如果所述第二预设时长到期时,本节点中所述第一节点的节点状态仍为异常节点,并且在所述第二预设时长内多次收到其他节点发送的所述第一节点为异常节点的广播消息,则将所述第一节点的节点状态标记为不可用节点,并在所述集群内发送所述第一节点为不可用节点的广播消息。If the second preset time period expires, the node status of the first node in this node is still an abnormal node, and the first node sent by other nodes has been received multiple times within the second preset time period. If the broadcast message that the node is an abnormal node, the node status of the first node is marked as an unavailable node, and the broadcast message that the first node is an unavailable node is sent in the cluster.
可选的,所述多个节点包括至少三个节点。Optionally, the multiple nodes include at least three nodes.
可选的,所述探测所述第一节点是否是活动节点,包括:Optionally, the detecting whether the first node is an active node includes:
向所述第一节点发送探测消息;sending a probe message to the first node;
如果未收到所述第一节点的正确响应,则再次向所述第一节点发送探测消息,或者,随机选取另外的节点作为第二节点,并向所述第二节点发送间接探测请求,以使所述第二节点向所述第一节点发送探测消息并将探测结果返回给本节点,其中所述间接探测请求中包括所述第一节点的地址;If no correct response from the first node is received, send a probe message to the first node again, or randomly select another node as the second node, and send an indirect probe request to the second node, to Make the second node send a detection message to the first node and return the detection result to the own node, wherein the indirect detection request includes the address of the first node;
如果仍未收到所述第一节点的正确响应,则判定所述第一节点不是活动节点。If the correct response from the first node is still not received, it is determined that the first node is not an active node.
可选的,在所述第二预设时长内多次收到其他节点发送的所述第一节点为异常节点的广播消息,包括:Optionally, receiving broadcast messages from other nodes that the first node is an abnormal node multiple times within the second preset duration includes:
在发送所述第一节点为异常节点的广播消息后启动计数器;Start a counter after sending the broadcast message that the first node is an abnormal node;
在所述第二预设时长内,每当收到其他节点发送的所述第一节点为异常节点的广播消息则所述计数器加1;Within the second preset duration, whenever receiving a broadcast message from other nodes that the first node is an abnormal node, the counter is incremented by 1;
当所述计数器大于预设数值时,判定为在所述第二预设时长内多次收到其他节点发送的所述第一节点为异常节点的广播消息。When the counter is greater than the preset value, it is determined that the broadcast message that the first node is an abnormal node is received multiple times from other nodes within the second preset time period.
可选的,所述方法还包括:Optionally, the method also includes:
当接收到第三节点为不可用节点的广播消息时,如果所述第三节点未在本节点上被标记为不可用节点,则在本节点上将所述第三节点标记为不可用节点,并在所述集群内发送所述第三节点为不可用节点的广播消息以形成再次传播;When receiving a broadcast message that the third node is an unavailable node, if the third node is not marked as an unavailable node on the current node, marking the third node as an unavailable node on the current node, and sending a broadcast message that the third node is an unavailable node in the cluster to form re-propagation;
当接收到本节点为异常节点的广播消息,或者本节点为不可用节点的广播消息时,在所述集群内发送本节点为活动节点的广播消息,以修正其他节点对本节点的节点状态标记。When receiving a broadcast message that the node is an abnormal node, or a broadcast message that the node is an unavailable node, send a broadcast message that the node is an active node in the cluster, so as to modify other nodes' node status marks for the node.
可选的,所述方法还包括:Optionally, the method also includes:
当本节点离开所述集群时,在所述集群内发送本节点为不可用节点的广播消息。When the current node leaves the cluster, a broadcast message that the current node is an unavailable node is sent in the cluster.
可选的,所述日志聚合系统还包括:Optionally, the log aggregation system further includes:
存储组件,用于存储经所述核心服务集群处理后的数据;a storage component, configured to store data processed by the core service cluster;
告警组件,用于当发现被监控对象的日志信息异常、所述集群中出现异常节点和\或所述集群中出现不可用节点时,根据第二预设策略发送告警信息;An alarm component, configured to send alarm information according to a second preset strategy when it is found that the log information of the monitored object is abnormal, abnormal nodes appear in the cluster and/or unavailable nodes appear in the cluster;
数据可视化组件,用于展示所述日志信息和\或所述告警信息。The data visualization component is used to display the log information and/or the alarm information.
本申请实施例提供的技术方案可以包括以下有益效果:The technical solutions provided by the embodiments of the present application may include the following beneficial effects:
在本申请实施例中,提供了一种包括反向代理组件和集群的日志聚合系统,集群采用分布式设计,可以包括多个节点,反向代理组件和集群相互配合,不但可以实现对被监控对象的日志的采集处理以及负载均衡,而且通过多个节点的冗余部署和节点间的相互监测提高了系统的稳定性,确保了系统自身的高可用性,如果一两个节点发生问题,都不会影响系统对外提供的服务功能,此外还可以将监测到的问题进行告警,便于运维团队或技术团队及时发现并定位问题。In the embodiment of this application, a log aggregation system including a reverse proxy component and a cluster is provided. The cluster adopts a distributed design and can include multiple nodes. The collection and processing of object logs and load balancing, and through the redundant deployment of multiple nodes and mutual monitoring between nodes, the stability of the system has been improved, ensuring the high availability of the system itself. If a problem occurs in one or two nodes, no It will affect the service functions provided by the system to the outside world. In addition, the monitored problems can be alerted, so that the operation and maintenance team or technical team can find and locate problems in time.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
附图说明Description of drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,对于本领域普通技术人员而言,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。此外,这些介绍并不构成对实施例的限定,附图中具有相同参考数字标号的元件表示为类似的元件,除非有特别申明,附图中的图不构成比例限制。In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, for those of ordinary skill in the art, In other words, other drawings can also be obtained from these drawings without paying creative labor. In addition, these descriptions do not constitute limitations on the embodiments. Elements with the same reference numerals in the drawings indicate similar elements. Unless otherwise specified, the figures in the drawings do not constitute scale limitations.
图1是本申请实施例提供的一种日志聚合系统的示意图;FIG. 1 is a schematic diagram of a log aggregation system provided by an embodiment of the present application;
图2是本申请实施例的场景示意图;FIG. 2 is a schematic diagram of a scene of an embodiment of the present application;
图3是本申请实施例中Loki集群内部示意图;Figure 3 is a schematic diagram of the interior of the Loki cluster in the embodiment of the present application;
图4是本申请实施例提供的一种提高日志聚合系统可用性方法的示意图。Fig. 4 is a schematic diagram of a method for improving the availability of a log aggregation system provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行详细描述。当涉及附图时,除非另有说明,否则不同附图中的相同数字表示相同或相似的要素。显然,以下所描述的实施例仅仅是本申请的一部分实施例,而不是全部的实施例,或者说以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described in detail below with reference to the drawings in the embodiments of the present application. When referring to the drawings, unless otherwise stated, the same numerals in different drawings identify the same or similar elements. Apparently, the embodiments described below are only some of the embodiments of the application, but not all of the embodiments, or the implementations described in the following exemplary embodiments do not represent all implementations consistent with the application. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.
当本申请实施例的说明书、权利要求书及上述附图中若出现术语“第一”、“第二”、“第三”等时,是用于区别不同对象,而不是用于限定特定顺序。在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”等的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。When the terms "first", "second", "third", etc. appear in the specification, claims and above-mentioned drawings of the embodiments of the present application, they are used to distinguish different objects, rather than to limit a specific order . In the embodiments of the present application, words such as "exemplary" or "for example" are used as examples, illustrations or illustrations. Any embodiment or design scheme described as "exemplary" or "for example" in the embodiments of the present application shall not be construed as being more preferred or more advantageous than other embodiments or design schemes. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete manner.
图1是本申请实施例提供的一种日志聚合系统的示意图。所述日志聚合系统可用于采集被监控对象的日志信息。所述日志聚合系统包括反向代理组件和由多个节点组成的核心服务集群。Fig. 1 is a schematic diagram of a log aggregation system provided by an embodiment of the present application. The log aggregation system can be used to collect log information of monitored objects. The log aggregation system includes a reverse proxy component and a core service cluster composed of multiple nodes.
所述反向代理组件用于接收被监控对象的日志信息,以及,根据第一预设策略从所述核心服务集群多个节点中选择一个节点作为目标节点,并将所述日 志信息发送给所述目标节点。The reverse proxy component is used to receive the log information of the monitored object, and, according to the first preset policy, select a node from the multiple nodes of the core service cluster as the target node, and send the log information to the the target node.
其中,反向代理组件具体根据什么样的策略从所述核心服务集群多个节点中选择一个节点作为目标节点,也即对于第一预设策略的具体内容,本实施例并不进行限制,本领域技术人员可以根据不同需求\不同场景而自行选择、设计,可以在此处使用的这些选择和设计都没有背离本申请的精神和保护范围。Wherein, the reverse proxy component specifically selects a node from the multiple nodes of the core service cluster as the target node according to what strategy, that is, the specific content of the first preset strategy is not limited in this embodiment. Those skilled in the art can make their own selections and designs according to different requirements/different scenarios, and these selections and designs that can be used here do not deviate from the spirit and protection scope of the present application.
作为示例,所述第一预设策略具体可以包括轮询、根据节点的权重分配、根据最少链接方式、根据性能最优方式等一种或多种策略。As an example, the first preset strategy may specifically include one or more strategies such as round robin, node weight distribution, least link method, and performance optimal method.
所述核心服务集群的节点用于对收到的日志信息进行预设处理,各节点通过相互探测以监控每个节点的节点状态。The nodes of the core service cluster are used to perform preset processing on the received log information, and each node monitors the node status of each node through mutual detection.
作为示例,所述多个节点具体可以包括至少三个节点。一个节点例如可以就是一台服务器。As an example, the plurality of nodes may specifically include at least three nodes. A node can be, for example, a server.
另外,对于节点对收到的日志信息所作的处理,也即所述预设处理的具体内容,本实施例也并不进行限制。In addition, this embodiment does not limit the processing of the received log information by the node, that is, the specific content of the preset processing.
作为示例,所述预设处理可以包括解析日志,给日志内容编制索引,添加标签,去重,压缩存储,等等。例如将日志的日期提取出来做索引,方便后续根据日期检索日志。再例如将日志的来源提取出来做索引,比如区分出Apach服务的日志和数据库的日志。还可以给日志打上标签,区分出Matrics和Log,或者区分出Info级别的,还是Warning级别的,或是Error级别的,等等。As an example, the preset processing may include parsing logs, indexing log content, adding tags, removing duplicates, compressing and storing, and so on. For example, the date of the log is extracted for indexing to facilitate subsequent retrieval of logs based on the date. Another example is to extract the source of the log for indexing, such as distinguishing the logs of the Apache service and the logs of the database. You can also label the logs to distinguish between Matrics and Log, or to distinguish between Info level, Warning level, or Error level, and so on.
在本实施例或本申请其他某些实施例中,所述节点状态具体可以分为活动节点、异常节点、不可用节点三类;In this embodiment or some other embodiments of this application, the node states can be specifically divided into three types: active nodes, abnormal nodes, and unavailable nodes;
所述各节点通过相互探测以监控每个节点的节点状态,具体可以包括:The nodes monitor the node status of each node through mutual detection, which may specifically include:
对于每个节点:For each node:
1)每隔第一预设时长选取其他一个节点作为第一节点,探测所述第一节点是否是活动节点。1) Selecting another node as the first node every first preset time interval, and detecting whether the first node is an active node.
例如可以在集群内本节点外的其他节点中,采取轮询的方式或者随机等方式,选取一个节点作为第一节点。For example, a node may be selected as the first node among other nodes in the cluster other than the own node by polling or randomly.
对于具体如何探测所述第一节点是否是活动节点,本实施例并不进行限制。作为示例,所述探测所述第一节点是否是活动节点,可以包括:This embodiment does not limit how to specifically detect whether the first node is an active node. As an example, the detecting whether the first node is an active node may include:
向所述第一节点发送探测消息(如可以PING第一节点);Send a probe message to the first node (for example, the first node can be PING);
如果未收到所述第一节点的正确响应,则再次向所述第一节点发送探测消 息,或者,随机选取另外的节点作为第二节点,并向所述第二节点发送间接探测请求,以使所述第二节点向所述第一节点发送探测消息并将探测结果返回给本节点,其中所述间接探测请求中包括所述第一节点的地址;If no correct response from the first node is received, send a probe message to the first node again, or randomly select another node as the second node, and send an indirect probe request to the second node, to Make the second node send a detection message to the first node and return the detection result to the own node, wherein the indirect detection request includes the address of the first node;
如果仍未收到所述第一节点的正确响应,则判定所述第一节点不是活动节点。If the correct response from the first node is still not received, it is determined that the first node is not an active node.
2)如果所述第一节点不是活动节点,则在本节点中将所述第一节点的节点状态标记为异常节点,并在所述集群内发送所述第一节点为异常节点的广播消息。2) If the first node is not an active node, mark the node status of the first node as an abnormal node in this node, and send a broadcast message that the first node is an abnormal node in the cluster.
3)如果在发送所述第一节点为异常节点的广播消息后的第二预设时长内收到所述第一节点为活动节点的广播消息,则将所述第一节点的节点状态标记为活动节点。3) If a broadcast message that the first node is an active node is received within a second preset time period after sending the broadcast message that the first node is an abnormal node, mark the node status of the first node as active node.
4)如果所述第二预设时长到期时,本节点中所述第一节点的节点状态仍为异常节点,并且在所述第二预设时长内多次收到其他节点发送的所述第一节点为异常节点的广播消息,则将所述第一节点的节点状态标记为不可用节点,并在所述集群内发送所述第一节点为不可用节点的广播消息。4) If the second preset time period expires, the node status of the first node in this node is still an abnormal node, and the node sent by other nodes has been received multiple times within the second preset time period. If the broadcast message that the first node is an abnormal node, the node status of the first node is marked as an unavailable node, and the broadcast message that the first node is an unavailable node is sent in the cluster.
作为示例,在所述第二预设时长内多次收到其他节点发送的所述第一节点为异常节点的广播消息,具体可以包括:As an example, receiving multiple broadcast messages from other nodes that the first node is an abnormal node within the second preset time length may specifically include:
在发送所述第一节点为异常节点的广播消息后启动计数器;Start a counter after sending the broadcast message that the first node is an abnormal node;
在所述第二预设时长内,每当收到其他节点发送的所述第一节点为异常节点的广播消息则所述计数器加1;Within the second preset duration, whenever receiving a broadcast message from other nodes that the first node is an abnormal node, the counter is incremented by 1;
当所述计数器大于预设数值时,判定为在所述第二预设时长内多次收到其他节点发送的所述第一节点为异常节点的广播消息。When the counter is greater than the preset value, it is determined that the broadcast message that the first node is an abnormal node is received multiple times from other nodes within the second preset time period.
对于以上第一预设时长、第二预设时长、“多次”、预设数值等的具体内容,本实施例并不进行限制,本领域技术人员可以根据不同需求\不同场景而自行选择、设计,可以在此处使用的这些选择和设计都没有背离本申请的精神和保护范围。This embodiment does not limit the specific content of the above first preset duration, second preset duration, "multiple times", preset values, etc., and those skilled in the art can choose, Designs, these options and designs may be used herein without departing from the spirit and scope of the application.
以上内容主要是描述了一个节点主动探测其他节点状态的一些操作。另外在实际中,一个节点还可能被其他节点探测,或者收到其他节点广播的消息,因此各节点通过相互探测以监控每个节点的节点状态,还可以包括:The above content mainly describes some operations that a node actively detects the status of other nodes. In addition, in practice, a node may also be detected by other nodes, or receive messages broadcast by other nodes, so each node monitors the node status of each node through mutual detection, which can also include:
对于每个节点:For each node:
当接收到第三节点为不可用节点的广播消息时,如果所述第三节点未在本节点上被标记为不可用节点,则在本节点上将所述第三节点标记为不可用节点,并在所述集群内发送所述第三节点为不可用节点的广播消息以形成再次传播;When receiving a broadcast message that the third node is an unavailable node, if the third node is not marked as an unavailable node on the current node, marking the third node as an unavailable node on the current node, and sending a broadcast message that the third node is an unavailable node in the cluster to form re-propagation;
当接收到本节点为异常节点的广播消息,或者本节点为不可用节点的广播消息时,在所述集群内发送本节点为活动节点的广播消息,以修正其他节点对本节点的节点状态标记。When receiving a broadcast message that the node is an abnormal node, or a broadcast message that the node is an unavailable node, send a broadcast message that the node is an active node in the cluster, so as to modify other nodes' node status marks for the node.
此外,当一个节点离开所述集群时,可以告知集群内其他节点本节点已是不可用节点。因此所述各节点通过相互探测以监控每个节点的节点状态,还可以包括:In addition, when a node leaves the cluster, it can inform other nodes in the cluster that the node is unavailable. Therefore, each node monitors the node state of each node through mutual detection, and may also include:
对于每个节点:For each node:
当本节点离开所述集群时,在所述集群内发送本节点为不可用节点的广播消息。When the current node leaves the cluster, a broadcast message that the current node is an unavailable node is sent in the cluster.
另外,为便于存储、告警、监视等,所述系统还可以包括:In addition, for the convenience of storage, alarm, monitoring, etc., the system may also include:
存储组件,用于存储经所述核心服务集群处理后的数据;a storage component, configured to store data processed by the core service cluster;
告警组件,用于当发现被监控对象的日志信息异常、所述集群中出现异常节点和\或所述集群中出现不可用节点时,根据第二预设策略发送告警信息;An alarm component, configured to send alarm information according to a second preset strategy when it is found that the log information of the monitored object is abnormal, abnormal nodes appear in the cluster and/or unavailable nodes appear in the cluster;
数据可视化组件,用于展示所述日志信息和\或所述告警信息。The data visualization component is used to display the log information and/or the alarm information.
在本实施例中,提供了一种包括反向代理组件和集群的日志聚合系统,集群采用分布式设计,可以包括多个节点,反向代理组件和集群相互配合,不但可以实现对被监控对象的日志的采集处理以及负载均衡,而且通过多个节点的冗余部署和节点间的相互监测提高了系统的稳定性,确保了系统自身的高可用性,如果有一两个节点发生问题,都不会影响系统对外提供的服务功能,此外还可以将监测到的问题进行告警,便于运维团队或技术团队及时发现并定位问题。In this embodiment, a log aggregation system including a reverse proxy component and a cluster is provided. The cluster adopts a distributed design and may include multiple nodes. The collection and processing of logs and load balancing, and the redundant deployment of multiple nodes and mutual monitoring between nodes improve the stability of the system and ensure the high availability of the system itself. If there is a problem with one or two nodes, it will not Affect the service functions provided by the system to the outside world. In addition, the monitored problems can be alerted, so that the operation and maintenance team or technical team can find and locate problems in time.
下面再以Loki为例,并结合具体应用场景对本申请方案作进一步描述。当然以Loki应用场景为例仅为示例性的,在实际应用中,也可以适用于其它应用场景,例如还可以采取本申请这种三节点等部署方式确保Prometheus集群的高可用性。Let's take Loki as an example and further describe the solution of this application in combination with specific application scenarios. Of course, taking the Loki application scenario as an example is only exemplary, and it can also be applied to other application scenarios in actual applications. For example, the three-node deployment method of this application can also be adopted to ensure the high availability of the Prometheus cluster.
Loki是一个日志聚合系统,主要对非结构化的日志数据进行压缩存储并且只对日志数据的metadata(包括:时间戳、labels等)建立索引。Loki服务接收从Promtail组件推送过来的日志数据,并把日志数据分发给内部的Ingester组件。由Ingester组件存储日志数据的索引数据以及日志内容数据。和Prometheus不同的是,Loki只能被动接受Promtail组件的日志推送。Loki is a log aggregation system, which mainly compresses and stores unstructured log data and only indexes the metadata of log data (including: timestamp, labels, etc.). The Loki service receives the log data pushed from the Promtail component, and distributes the log data to the internal Ingester component. The index data and log content data of the log data are stored by the Ingester component. Unlike Prometheus, Loki can only passively accept log pushes from Promtail components.
然而,发明人在实现本申请方案的过程中发现,Loki的主服务无法实现对自身的高可用的确保,当Loki自身出现故障时,若被监控对象也出现故障,则此时Loki无法及时发现并报告问题,并且会造成大量的日志数据在被监控对象上积压,影响被监控对象的性能及造成硬盘空间的浪费。However, the inventor found in the process of implementing the application solution that Loki’s main service cannot guarantee its own high availability. When Loki itself fails, if the monitored object also fails, Loki cannot find out in time. And report problems, and will cause a large amount of log data to be backlogged on the monitored object, affecting the performance of the monitored object and causing a waste of hard disk space.
因此在本实施例中,构建了Loki服务核心集群,该集群可由三个节点构成,三个节点相互监控彼此的状态,如果Loki服务核心集群中1个或2个节点发生问题,都不影响集群对外提供日志收集聚合的服务,而且可将障害通过告警组件向运维团队或技术团队报警,请求人工介入处理。Therefore, in this embodiment, a Loki service core cluster is constructed, which can be composed of three nodes, and the three nodes monitor each other's status. If a problem occurs in one or two nodes in the Loki service core cluster, the cluster will not be affected Provide external log collection and aggregation services, and alert the operation and maintenance team or technical team of faults through the alarm component, requesting manual intervention.
图2是本申请实施例的场景示意图,图3是本申请实施例中Loki集群内部示意图。在图2中,Loki集群对外与Promtail、Node Exporter、告警组件、数据可视化组件、存储池(包含索引Index、日志Log、指标Metrics)等组件相连。Prometheus集群是用于抓取监控目标的Metrics(指标)数据的工具集群,对于Promtail、Node Exporter、Prometheus等,均为现有技术,此处不再赘述。在图3中,Loki集群内部可以包括反向代理组件和Loki核心服务集群等。下面进行具体介绍:Fig. 2 is a schematic diagram of a scenario of the embodiment of the present application, and Fig. 3 is a schematic diagram of the interior of the Loki cluster in the embodiment of the present application. In Figure 2, the Loki cluster is externally connected to components such as Promtail, Node Exporter, alarm component, data visualization component, and storage pool (including index Index, log Log, and indicator Metrics). The Prometheus cluster is a tool cluster used to capture the Metrics (indicators) data of the monitoring target. For Promtail, Node Exporter, Prometheus, etc., they are all existing technologies and will not be described here. In Figure 3, the Loki cluster may include reverse proxy components and Loki core service clusters. The following is a detailed introduction:
Loki核心服务集群:与监控目标连接,收集存储监控目标的各种日志。也可以提供可视化组件对日志的查询。集群内采用3Node(节点)进行冗余组网,都使用反向代理组件进行反向代理。Loki core service cluster: connect to the monitoring target, collect and store various logs of the monitoring target. It is also possible to provide visual components to query logs. In the cluster, 3Node (nodes) are used for redundant networking, and reverse proxy components are used for reverse proxy.
反向代理组件:客户端(可视化组件和Promtail组件)向反向代理组件发送请求,然后反向代理组件根据预设策略,将请求转发至后端合适的Loki节点。另外,为了确保反向代理组件的高可用性,可以在反向代理组件内部做主备容灾,主备之间使用心跳线相连,主机宕机后,备机立刻开始工作,并将自身的状况报告给后端的Loki节点。Reverse proxy component: The client (visualization component and Promtail component) sends a request to the reverse proxy component, and then the reverse proxy component forwards the request to the appropriate Loki node in the backend according to the preset policy. In addition, in order to ensure the high availability of the reverse proxy component, the active and standby disaster recovery can be implemented inside the reverse proxy component. The active and standby are connected by a heartbeat line. For backend Loki nodes.
Promtail组件:负责收集监控对象的日志,并将日志发送给Loki集群的反向代理组件,以使反向代理组件根据预设策略将日志传送请求发给合适的Loki节点上,经过处理,再存放到存储池中。Promtail component: responsible for collecting logs of monitored objects, and sending the logs to the reverse proxy component of the Loki cluster, so that the reverse proxy component sends the log transmission request to the appropriate Loki node according to the preset policy, and then stores it after processing to the storage pool.
监控对象:也即被监控对象,或称监控目标。需要在监控对象中分别安装 Promtail和Node Exporter组件,方便Loki集群和Prometheus集群取数据。Promtail组件是主动将日志数据推送到Loki集群中,而Node Export组件则是将被监控对象的Http接口暴露给Prometheus,方便Prometheus定期抓取指标。Monitoring object: that is, the monitored object, or monitoring target. The Promtail and Node Exporter components need to be installed in the monitoring object to facilitate the Loki cluster and Prometheus cluster to fetch data. The Promtail component actively pushes the log data to the Loki cluster, while the Node Export component exposes the Http interface of the monitored object to Prometheus, so that Prometheus can regularly capture indicators.
存储池:用于存放Loki集群获取到的日志信息和指标。考虑到日志和指标数据读写性能,推荐使用对象存储服务的存储池。为提高Loki日志的检索效率,Loki取到的日志数据的Index推荐放在高性能的内存数据库中。Storage pool: used to store log information and indicators obtained by the Loki cluster. Considering the read and write performance of log and indicator data, it is recommended to use the storage pool of the object storage service. In order to improve the retrieval efficiency of Loki logs, it is recommended that the Index of the log data retrieved by Loki be placed in a high-performance memory database.
告警组件:用于接收告警信息,并将告警信息发送给预设的联系人。通过制作不同的Hook,可以将告警信息发送到各种IM通信工具(如微信、Slack、Teams、Lync等)、邮件、电话、短信等等。Alarm component: used to receive alarm information and send the alarm information to preset contacts. By making different Hooks, the alarm information can be sent to various IM communication tools (such as WeChat, Slack, Teams, Lync, etc.), email, phone, SMS, etc.
进一步的,作为示例,告警组件具体可以采用Alertmanager组件,支持日志和指标的查询,并且提供灵活的报警方式。根据事先设定好的规则,检测到监控对象日志或指标(例如Matrics信息,包含CPU使用率/内存使用率/硬盘使用率/网卡吞吐量/硬盘iops/API的Response性能/网络请求错误率等等)异常时,推送告警信息至Alertmanager组件。Alertmanager收到警告时,可以根据配置、聚合、去重、降噪,最后发送警告。Altermanager组件可以通过电子邮件/短信/电话等方式将告警信息发送给运维团队或技术团队,便于他们及时发现并定位问题。也可以链接第三方运维监管平台,比如xMatters/Sumologc/Splunk等。Further, as an example, the alert component may specifically use the Alertmanager component, which supports query of logs and indicators, and provides flexible alert methods. According to pre-set rules, monitor object logs or indicators are detected (such as Matrics information, including CPU usage/memory usage/hard disk usage/network card throughput/hard disk iops/API Response performance/network request error rate, etc. etc.) when abnormal, push the alarm information to the Alertmanager component. When Alertmanager receives an alert, it can perform configuration, aggregation, deduplication, noise reduction, and finally send an alert. The Altermanager component can send alarm information to the operation and maintenance team or technical team by email/sms/phone, so that they can find and locate problems in time. You can also link to third-party operation and maintenance monitoring platforms, such as xMatters/Sumologc/Splunk, etc.
数据可视化组件:用于聚合展示所述监控数据(比如系统的Matrics数据,比如某台服务器的每5分钟采集一次的CPU使用率/内存使用率/硬盘使用率/网卡吞吐量等等,可以根据这些值画一个折线图,展示出来)、日志数据以及告警信息。可视化组件需要展示日志或告警信息的时候,会向反向代理组件发起日志或告警信息的查询请求,反向代理组件接到请求后,将请求转发至Query-frontend组件,该组件负责与各个Loki节点上的Querier服务通信,获取存储池中存放的各种日志或告警信息,并回复给反向代理组件,最终展示在可视化组件的界面上。Data visualization component: used to aggregate and display the monitoring data (such as the Matrics data of the system, such as the CPU usage/memory usage/hard disk usage/network card throughput of a server collected every 5 minutes, etc., according to These values draw a line chart and display), log data and alarm information. When the visualization component needs to display log or alarm information, it will initiate a query request for log or alarm information to the reverse proxy component. After receiving the request, the reverse proxy component forwards the request to the Query-frontend component, which is responsible for communicating with each Loki The Querier service on the node communicates to obtain various logs or alarm information stored in the storage pool, and replies to the reverse proxy component, which is finally displayed on the interface of the visualization component.
以上Loki集群内部使用冗余的三节点集群,对外使用包含主备灾复的反向代理组件提供反向代理功能,只对用户暴露一个接口。The above Loki cluster uses a redundant three-node cluster internally, and uses a reverse proxy component including active and standby disaster recovery externally to provide reverse proxy functions, and only exposes one interface to users.
在具体实施时,可以在每个节点上都部署一整套的Loki核心服务Distributor、Ingester、Querier,由于这三个组件属于现有技术,故不再赘述。本实施例中,在每个节点上新增部署了节点监测组件,以实现各节点通过相互探测以监控每个节点的节点状态。In specific implementation, a complete set of Loki core services Distributor, Ingester, and Querier can be deployed on each node. Since these three components belong to the existing technology, they will not be described in detail. In this embodiment, a node monitoring component is newly deployed on each node, so that each node can monitor the node status of each node through mutual detection.
另外由于节点监测组件是运行在内存中的,所以可以在存储装置中加装UPS电源,并强制设置UPS电源检测到存储装置连续维持市电断电状态超过10分钟等时间时,执行系统安全关机,将内存内数据落盘的策略。In addition, since the node monitoring component runs in the memory, it is possible to install a UPS power supply in the storage device, and force the UPS power supply to perform a safe shutdown of the system when the storage device detects that the storage device has been in the state of mains power failure for more than 10 minutes. , the strategy for dumping data in memory to disk.
节点的状态可分为3种:活动节点、异常节点和不可用节点。节点监测组件的节点失效探测原理可以包括:The status of nodes can be divided into three types: active nodes, abnormal nodes and unavailable nodes. The node failure detection principle of the node monitoring component may include:
i)当一个节点(例如节点A)启动后,每隔一定时间间隔,选取(例如轮询)另一个节点(例如节点B)对其发送PING消息。当PING消息失败后,可以再次向节点B发送PING消息,或者随机选取其他节点(例如节点C)发起间接PING请求,收到间接PING请求的节点C会根据请求中的地址向节点B发起一个PING消息,并将PING的结果返回给间接请求的源节点即节点A,如果探测超时后节点A没有收到节点B的任何ACK消息,则标记节点B的状态为异常节点。然后节点A启动一个定时器,并发出一个针对节点B的异常警告的广播,此期间内如果收到其他节点发来的相同的针对节点B的异常警告信息,则节点A在本地将节点B的异常节点次数+1,当定时器超时后,节点B的状态仍然不是活动的,且异常节点次数达到要求,则节点A在本地将节点B标记为不可用节点。i) After a node (for example, node A) is started, select (for example, poll) another node (for example, node B) to send a PING message to it at regular time intervals. When the PING message fails, you can send the PING message to node B again, or randomly select other nodes (such as node C) to initiate an indirect PING request, and node C that receives the indirect PING request will initiate a PING to node B according to the address in the request message, and return the result of PING to the source node of the indirect request, that is, node A. If node A does not receive any ACK message from node B after the detection timeout, it will mark the status of node B as an abnormal node. Then node A starts a timer and sends out a broadcast of an abnormal warning for node B. During this period, if it receives the same abnormal warning information for node B from other nodes, node A will locally The number of abnormal nodes + 1, when the timer expires, the state of node B is still inactive, and the number of abnormal nodes reaches the requirement, then node A will mark node B locally as an unavailable node.
ii)当该被怀疑失效的节点(例如节点B)收到别的节点发来的针对自己的异常警告的消息时,会立刻发送本节点为活动节点的广播,从而清除其他节点上的对本节点为异常节点的标记。ii) When the suspected failure node (such as node B) receives the message of abnormal warning sent by other nodes, it will immediately send the broadcast that the node is the active node, thereby clearing the other nodes on the node. is the label of the abnormal node.
iii)当某节点(例如节点A)离开集群时,会向集群发送本节点为不可用节点的广播;当节点A将其他节点(例如节点B)标记为不可用节点时,也会向集群发送该节点为不可用节点的广播。iii) When a node (such as node A) leaves the cluster, it will send a broadcast to the cluster that the node is an unavailable node; when node A marks other nodes (such as node B) as an unavailable node, it will also send a broadcast to the cluster The node is a broadcast of an unavailable node.
iv)当其他节点(例如节点C)收到某节点(例如节点B)为不可用节点的广播消息后,会与本地的记录比较,当本地记录中节点B也是不可用节点时会忽略该消息,当本地记录中节点B不是不可用节点时,则会删除本地原来的记录并将节点B标记为不可用节点,还会将节点B为不可用节点的消息再次广播出去,以形成再次传播。iv) When other nodes (such as node C) receive a broadcast message that a node (such as node B) is an unavailable node, they will compare it with the local record, and ignore the message when node B is also an unavailable node in the local record , when node B in the local record is not an unavailable node, the original local record will be deleted and node B will be marked as an unavailable node, and the message that node B is an unavailable node will be broadcast again to form another propagation.
v)如果某节点(例如节点B)接收到自身为不可用节点的广播消息时,说明本节点相对于其他节点网络分区,此时本节点会发起本节点为活动节点的广播以修正其他节点上所存储的本节点的状态标记。v) If a node (for example, node B) receives a broadcast message that it is an unavailable node, it means that the node is partitioned compared to other nodes. At this time, the node will initiate a broadcast that the node is an active node to correct the error on other nodes. The stored state flag of this node.
这样,集群中的各节点互相连接,可以通过节点监测组件确认节点状态。如果其他节点或自身服务发生故障,可由告警组件将告警信息发往 xMatters/Sumologc/Splunk等第三方运维监管平台,也可以用直接通过电子邮件/短信/电话等方式将告警信息发送给运维团队或技术团队,便于他们及时发现并定位问题。In this way, each node in the cluster is connected to each other, and the status of the node can be confirmed through the node monitoring component. If other nodes or their own services fail, the alarm component can send the alarm information to third-party operation and maintenance monitoring platforms such as xMatters/Sumologc/Splunk, or send the alarm information to the operation and maintenance directly by email/sms/phone team or technical team, so that they can discover and locate problems in a timely manner.
此外,为了保证数据一致性,在节点监测组件中还可以加入仲裁机制,仲裁机制通过各个节点互相监视来实现,从而能够在恶劣情况下规避集群出现的数据脑裂风险。In addition, in order to ensure data consistency, an arbitration mechanism can also be added to the node monitoring component. The arbitration mechanism is realized by mutual monitoring of each node, so as to avoid the risk of data split brain in the cluster under severe conditions.
在本实施例中,提供了一种包括反向代理组件和集群的日志聚合系统,集群采用分布式设计,可以包括多个节点,反向代理组件和集群相互配合,不但可以实现对被监控对象的日志的采集处理以及负载均衡,而且通过多个节点的冗余部署和节点间的相互监测提高了系统的稳定性,确保了系统自身的高可用性,如果有一两个节点发生问题,都不会影响系统对外提供的服务功能,此外还可以将监测到的问题进行告警,便于运维团队或技术团队及时发现并定位问题。In this embodiment, a log aggregation system including a reverse proxy component and a cluster is provided. The cluster adopts a distributed design and may include multiple nodes. The collection and processing of logs and load balancing, and the redundant deployment of multiple nodes and mutual monitoring between nodes improve the stability of the system and ensure the high availability of the system itself. If there is a problem with one or two nodes, it will not Affect the service functions provided by the system to the outside world. In addition, the monitored problems can be alerted, so that the operation and maintenance team or technical team can find and locate problems in time.
下述为本申请方法实施例,可以用于执行本申请系统实施例。对于本申请方法实施例中未披露的细节,请参照本申请系统实施例。The following are method embodiments of the present application, which can be used to implement the system embodiments of the present application. For details not disclosed in the method embodiments of the present application, please refer to the system embodiments of the present application.
图4是本申请实施例提供的一种提高日志聚合系统可用性的方法的示意图。所述方法可用于日志聚合系统中的节点;所述日志聚合系统用于采集被监控对象的日志信息,包括反向代理组件和由多个所述节点组成的核心服务集群;所述反向代理组件用于接收被监控对象的日志信息,以及根据第一预设策略从多个所述节点中选择一个节点作为目标节点并将所述日志信息发送给所述目标节点;所述节点用于对收到的日志信息进行预设处理,各节点通过相互探测以监控每个节点的节点状态,所述节点状态分为活动节点、异常节点、不可用节点;Fig. 4 is a schematic diagram of a method for improving the availability of a log aggregation system provided by an embodiment of the present application. The method can be used for nodes in a log aggregation system; the log aggregation system is used to collect log information of monitored objects, including a reverse proxy component and a core service cluster composed of a plurality of nodes; the reverse proxy The component is used to receive the log information of the monitored object, and select a node from a plurality of nodes as a target node according to a first preset policy and send the log information to the target node; the node is used to The received log information is pre-processed, and each node monitors the node status of each node through mutual detection. The node status is divided into active nodes, abnormal nodes, and unavailable nodes;
参见图4所示,所述方法可以包括:Referring to Figure 4, the method may include:
步骤S401,每隔第一预设时长选取其他一个节点作为第一节点,探测所述第一节点是否是活动节点。Step S401, select another node as the first node every first preset time interval, and detect whether the first node is an active node.
步骤S402,如果所述第一节点不是活动节点,则在本节点中将所述第一节点的节点状态标记为异常节点,并在所述集群内发送所述第一节点为异常节点的广播消息。Step S402, if the first node is not an active node, mark the node status of the first node as an abnormal node in this node, and send a broadcast message that the first node is an abnormal node in the cluster .
步骤S403,如果在发送所述第一节点为异常节点的广播消息后的第二预设 时长内收到所述第一节点为活动节点的广播消息,则将所述第一节点的节点状态标记为活动节点。Step S403, if the broadcast message that the first node is an active node is received within a second preset time period after sending the broadcast message that the first node is an abnormal node, mark the node status of the first node as for the active node.
步骤S404,如果所述第二预设时长到期时,本节点中所述第一节点的节点状态仍为异常节点,并且在所述第二预设时长内多次收到其他节点发送的所述第一节点为异常节点的广播消息,则将所述第一节点的节点状态标记为不可用节点,并在所述集群内发送所述第一节点为不可用节点的广播消息。Step S404, if when the second preset time period expires, the node status of the first node in this node is still an abnormal node, and the node has received all the messages sent by other nodes multiple times within the second preset time period. If the broadcast message that the first node is an abnormal node is used, the node status of the first node is marked as an unavailable node, and a broadcast message that the first node is an unavailable node is sent in the cluster.
在本实施例或本申请其他某些实施例中,所述多个节点具体可以包括至少三个节点。In this embodiment or some other embodiments of the present application, the multiple nodes may specifically include at least three nodes.
在本实施例或本申请其他某些实施例中,所述探测所述第一节点是否是活动节点,具体可以包括:In this embodiment or some other embodiments of the present application, the detecting whether the first node is an active node may specifically include:
向所述第一节点发送探测消息;sending a probe message to the first node;
如果未收到所述第一节点的正确响应,则再次向所述第一节点发送探测消息,或者,随机选取另外的节点作为第二节点,并向所述第二节点发送间接探测请求,以使所述第二节点向所述第一节点发送探测消息并将探测结果返回给本节点,其中所述间接探测请求中包括所述第一节点的地址;If no correct response from the first node is received, send a probe message to the first node again, or randomly select another node as the second node, and send an indirect probe request to the second node, to Make the second node send a detection message to the first node and return the detection result to the own node, wherein the indirect detection request includes the address of the first node;
如果仍未收到所述第一节点的正确响应,则判定所述第一节点不是活动节点。If the correct response from the first node is still not received, it is determined that the first node is not an active node.
在本实施例或本申请其他某些实施例中,在所述第二预设时长内多次收到其他节点发送的所述第一节点为异常节点的广播消息,具体可以包括:In this embodiment or some other embodiments of the present application, the broadcast message that the first node is an abnormal node sent by other nodes is received multiple times within the second preset time period may specifically include:
在发送所述第一节点为异常节点的广播消息后启动计数器;Start a counter after sending the broadcast message that the first node is an abnormal node;
在所述第二预设时长内,每当收到其他节点发送的所述第一节点为异常节点的广播消息则所述计数器加1;Within the second preset duration, whenever receiving a broadcast message from other nodes that the first node is an abnormal node, the counter is incremented by 1;
当所述计数器大于预设数值时,判定为在所述第二预设时长内多次收到其他节点发送的所述第一节点为异常节点的广播消息。When the counter is greater than the preset value, it is determined that the broadcast message that the first node is an abnormal node is received multiple times from other nodes within the second preset time period.
在本实施例或本申请其他某些实施例中,所述方法还可以包括:In this embodiment or some other embodiments of the present application, the method may further include:
当接收到第三节点为不可用节点的广播消息时,如果所述第三节点未在本节点上被标记为不可用节点,则在本节点上将所述第三节点标记为不可用节点,并在所述集群内发送所述第三节点为不可用节点的广播消息以形成再次传播;When receiving a broadcast message that the third node is an unavailable node, if the third node is not marked as an unavailable node on the current node, marking the third node as an unavailable node on the current node, and sending a broadcast message that the third node is an unavailable node in the cluster to form re-propagation;
当接收到本节点为异常节点的广播消息,或者本节点为不可用节点的广播 消息时,在所述集群内发送本节点为活动节点的广播消息,以修正其他节点对本节点的节点状态标记。When receiving a broadcast message that this node is an abnormal node, or a broadcast message that this node is an unavailable node, send a broadcast message that this node is an active node in the cluster to modify other nodes' node status marks for this node.
在本实施例或本申请其他某些实施例中,所述方法还可以包括:In this embodiment or some other embodiments of the present application, the method may further include:
当本节点离开所述集群时,在所述集群内发送本节点为不可用节点的广播消息。When the current node leaves the cluster, a broadcast message that the current node is an unavailable node is sent in the cluster.
在本实施例或本申请其他某些实施例中,所述日志聚合系统还包括:In this embodiment or some other embodiments of this application, the log aggregation system further includes:
存储组件,用于存储经所述核心服务集群处理后的数据;a storage component, configured to store data processed by the core service cluster;
告警组件,用于当发现被监控对象的日志信息异常、所述集群中出现异常节点和\或所述集群中出现不可用节点时,根据第二预设策略发送告警信息;An alarm component, configured to send alarm information according to a second preset strategy when it is found that the log information of the monitored object is abnormal, abnormal nodes appear in the cluster and/or unavailable nodes appear in the cluster;
数据可视化组件,用于展示所述日志信息和\或所述告警信息。The data visualization component is used to display the log information and/or the alarm information.
关于上述实施例中的方法,其中各个步骤执行操作的具体方式已经在相关系统的实施例中进行了详细描述,此处不再赘述。With regard to the methods in the above embodiments, the specific manner of performing operations in each step has been described in detail in the embodiments of related systems, and will not be repeated here.
在本实施例中,提供了一种提高日志聚合系统可用性的方法,所述方法用于日志聚合系统中的节点,所述日志聚合系统包括反向代理组件和由多个所述节点组成的核心服务集群,集群采用分布式设计,可以包括多个节点,反向代理组件和集群相互配合,不但可以实现对被监控对象的日志的采集处理以及负载均衡,而且通过多个节点的冗余部署和节点间的相互监测提高了系统的稳定性,确保了系统自身的高可用性,如果有一两个节点发生问题,都不会影响系统对外提供的服务功能,此外还可以将监测到的问题进行告警,便于运维团队或技术团队及时发现并定位问题。In this embodiment, a method for improving the availability of the log aggregation system is provided, the method is used for nodes in the log aggregation system, and the log aggregation system includes a reverse proxy component and a core composed of a plurality of nodes Service cluster. The cluster adopts a distributed design and can include multiple nodes. The reverse proxy component and the cluster cooperate with each other. Not only can the collection, processing and load balancing of the logs of the monitored objects be realized, but also through the redundant deployment of multiple nodes and Mutual monitoring between nodes improves the stability of the system and ensures the high availability of the system itself. If there is a problem with one or two nodes, it will not affect the service functions provided by the system to the outside world. In addition, the detected problems can be alerted. It is convenient for the operation and maintenance team or technical team to find and locate problems in time.
以上所述,仅是本申请的较佳实施例而已,并非对本申请作任何形式上的限制,虽然本申请已以较佳实施例揭露如上,然而并非用以限定本申请,任何熟悉本专业的技术人员,在不脱离本申请技术方案范围内,当可利用上述揭示的技术内容做出些许更动或修饰为等同变化的等效实施例,但凡是未脱离本申请技术方案内容,依据本申请技术方案的技术实质,在本申请技术方案的精神和原则之内,对以上实施例所作的任何简单的修改、等同替换与改进等,均仍属于本申请技术方案的保护范围之内。The above is only a preferred embodiment of the application, and does not limit the application in any form. Although the application has disclosed the above with the preferred embodiment, it is not used to limit the application. Anyone who is familiar with this field Those skilled in the art, without departing from the scope of the technical solution of this application, can use the technical content disclosed above to make some changes or modify them into equivalent embodiments with equivalent changes, but if they do not deviate from the technical solution of this application, according to The technical essence of the technical solution is within the spirit and principles of the technical solution of the present application. Any simple modifications, equivalent replacements and improvements made to the above embodiments still fall within the protection scope of the technical solution of the present application.
本领域技术人员在考虑说明书及实践这里公开的方案后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开 的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由所附的权利要求指出。Other embodiments of the present application will be readily apparent to those skilled in the art from consideration of the specification and practice of the approaches disclosed herein. This application is intended to cover any modification, use or adaptation of the application, these modifications, uses or adaptations follow the general principles of the application and include common knowledge or conventional technical means in the technical field not disclosed in the application . The specification and examples are to be considered exemplary only, with a true scope and spirit of the application indicated by the appended claims.
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限制。It should be understood that the present application is not limited to the precise constructions which have been described above and shown in the accompanying drawings, and various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (15)

  1. 一种日志聚合系统,其特征在于,所述日志聚合系统用于采集被监控对象的日志信息;所述日志聚合系统包括反向代理组件和由多个节点组成的核心服务集群;A log aggregation system, characterized in that, the log aggregation system is used to collect log information of monitored objects; the log aggregation system includes a reverse proxy component and a core service cluster composed of a plurality of nodes;
    所述反向代理组件用于接收被监控对象的日志信息,以及,根据第一预设策略从所述核心服务集群多个节点中选择一个节点作为目标节点,并将所述日志信息发送给所述目标节点;The reverse proxy component is used to receive the log information of the monitored object, and, according to the first preset policy, select a node from the multiple nodes of the core service cluster as the target node, and send the log information to the the target node;
    所述核心服务集群的节点用于对收到的日志信息进行预设处理,各节点通过相互探测以监控每个节点的节点状态。The nodes of the core service cluster are used to perform preset processing on the received log information, and each node monitors the node status of each node through mutual detection.
  2. 根据权利要求1所述的系统,其特征在于,所述多个节点包括至少三个节点。The system of claim 1, wherein the plurality of nodes comprises at least three nodes.
  3. 根据权利要求1所述的系统,其特征在于,所述节点状态分为活动节点、异常节点、不可用节点;The system according to claim 1, wherein the node states are divided into active nodes, abnormal nodes, and unavailable nodes;
    所述各节点通过相互探测以监控每个节点的节点状态,包括:The nodes monitor each node's node status by detecting each other, including:
    对于每个节点:For each node:
    每隔第一预设时长选取其他一个节点作为第一节点,探测所述第一节点是否是活动节点;selecting another node as the first node every first preset time length, and detecting whether the first node is an active node;
    如果所述第一节点不是活动节点,则在本节点中将所述第一节点的节点状态标记为异常节点,并在所述集群内发送所述第一节点为异常节点的广播消息;If the first node is not an active node, mark the node status of the first node as an abnormal node in this node, and send a broadcast message that the first node is an abnormal node in the cluster;
    如果在发送所述第一节点为异常节点的广播消息后的第二预设时长内收到所述第一节点为活动节点的广播消息,则将所述第一节点的节点状态标记为活动节点;If a broadcast message that the first node is an active node is received within a second preset time period after sending the broadcast message that the first node is an abnormal node, mark the node status of the first node as an active node ;
    如果所述第二预设时长到期时,本节点中所述第一节点的节点状态仍为异常节点,并且在所述第二预设时长内多次收到其他节点发送的所述第一节点为异常节点的广播消息,则将所述第一节点的节点状态标记为不可用节点,并在所述集群内发送所述第一节点为不可用节点的广播消息。If the second preset time period expires, the node status of the first node in this node is still an abnormal node, and the first node sent by other nodes has been received multiple times within the second preset time period. If the broadcast message that the node is an abnormal node, the node status of the first node is marked as an unavailable node, and the broadcast message that the first node is an unavailable node is sent in the cluster.
  4. 根据权利要求3所述的系统,其特征在于,所述探测所述第一节点是否是活动节点,包括:The system according to claim 3, wherein the detecting whether the first node is an active node comprises:
    向所述第一节点发送探测消息;sending a probe message to the first node;
    如果未收到所述第一节点的正确响应,则再次向所述第一节点发送探测消息,或者,随机选取另外的节点作为第二节点,并向所述第二节点发送间接探测请求,以使所述第二节点向所述第一节点发送探测消息并将探测结果返回给本节点,其中所述间接探测请求中包括所述第一节点的地址;If no correct response from the first node is received, send a probe message to the first node again, or randomly select another node as the second node, and send an indirect probe request to the second node, to Make the second node send a detection message to the first node and return the detection result to the own node, wherein the indirect detection request includes the address of the first node;
    如果仍未收到所述第一节点的正确响应,则判定所述第一节点不是活动节点。If the correct response from the first node is still not received, it is determined that the first node is not an active node.
  5. 根据权利要求3所述的系统,其特征在于,在所述第二预设时长内多次收到其他节点发送的所述第一节点为异常节点的广播消息,包括:The system according to claim 3, wherein the broadcast message that the first node is an abnormal node is received multiple times from other nodes within the second preset duration, including:
    在发送所述第一节点为异常节点的广播消息后启动计数器;Start a counter after sending the broadcast message that the first node is an abnormal node;
    在所述第二预设时长内,每当收到其他节点发送的所述第一节点为异常节点的广播消息则所述计数器加1;Within the second preset duration, whenever receiving a broadcast message from other nodes that the first node is an abnormal node, the counter is incremented by 1;
    当所述计数器大于预设数值时,判定为在所述第二预设时长内多次收到其他节点发送的所述第一节点为异常节点的广播消息。When the counter is greater than the preset value, it is determined that the broadcast message that the first node is an abnormal node is received multiple times from other nodes within the second preset time period.
  6. 根据权利要求3所述的系统,其特征在于,所述各节点通过相互探测以监控每个节点的节点状态,还包括:The system according to claim 3, wherein each node monitors the node status of each node through mutual detection, further comprising:
    对于每个节点:For each node:
    当接收到第三节点为不可用节点的广播消息时,如果所述第三节点未在本节点上被标记为不可用节点,则在本节点上将所述第三节点标记为不可用节点,并在所述集群内发送所述第三节点为不可用节点的广播消息以形成再次传播;When receiving a broadcast message that the third node is an unavailable node, if the third node is not marked as an unavailable node on the current node, marking the third node as an unavailable node on the current node, and sending a broadcast message that the third node is an unavailable node in the cluster to form re-propagation;
    当接收到本节点为异常节点的广播消息,或者本节点为不可用节点的广播消息时,在所述集群内发送本节点为活动节点的广播消息,以修正其他节点对本节点的节点状态标记。When receiving a broadcast message that the node is an abnormal node, or a broadcast message that the node is an unavailable node, send a broadcast message that the node is an active node in the cluster, so as to modify other nodes' node status marks for the node.
  7. 根据权利要求3所述的系统,其特征在于,所述各节点通过相互探测以监控每个节点的节点状态,还包括:The system according to claim 3, wherein each node monitors the node status of each node through mutual detection, further comprising:
    对于每个节点:For each node:
    当本节点离开所述集群时,在所述集群内发送本节点为不可用节点的广播消息。When the current node leaves the cluster, a broadcast message that the current node is an unavailable node is sent in the cluster.
  8. 根据权利要求1所述的系统,其特征在于,所述系统还包括:The system according to claim 1, further comprising:
    存储组件,用于存储经所述核心服务集群处理后的数据;a storage component, configured to store data processed by the core service cluster;
    告警组件,用于当发现被监控对象的日志信息异常、所述集群中出现异常节点和\或所述集群中出现不可用节点时,根据第二预设策略发送告警信息;An alarm component, configured to send alarm information according to a second preset strategy when it is found that the log information of the monitored object is abnormal, abnormal nodes appear in the cluster and/or unavailable nodes appear in the cluster;
    数据可视化组件,用于展示所述日志信息和\或所述告警信息。The data visualization component is used to display the log information and/or the alarm information.
  9. 一种提高日志聚合系统可用性的方法,其特征在于,所述方法用于日志聚合系统中的节点;所述日志聚合系统用于采集被监控对象的日志信息,包括反向代理组件和由多个所述节点组成的核心服务集群;所述反向代理组件用于接收被监控对象的日志信息,以及根据第一预设策略从多个所述节点中选择一个节点作为目标节点并将所述日志信息发送给所述目标节点;所述节点用于对收到的日志信息进行预设处理,各节点通过相互探测以监控每个节点的节点状态,所述节点状态分为活动节点、异常节点、不可用节点;A method for improving the availability of a log aggregation system, characterized in that the method is used for nodes in the log aggregation system; the log aggregation system is used to collect log information of monitored objects, including reverse proxy components and multiple The core service cluster composed of the nodes; the reverse proxy component is used to receive the log information of the monitored object, and select a node from a plurality of the nodes as the target node according to the first preset strategy and transfer the log The information is sent to the target node; the node is used to perform preset processing on the received log information, and each node monitors the node status of each node through mutual detection, and the node status is divided into active node, abnormal node, unavailable node;
    所述方法包括:The methods include:
    每隔第一预设时长选取其他一个节点作为第一节点,探测所述第一节点是否是活动节点;selecting another node as the first node every first preset time length, and detecting whether the first node is an active node;
    如果所述第一节点不是活动节点,则在本节点中将所述第一节点的节点状态标记为异常节点,并在所述集群内发送所述第一节点为异常节点的广播消息;If the first node is not an active node, mark the node status of the first node as an abnormal node in this node, and send a broadcast message that the first node is an abnormal node in the cluster;
    如果在发送所述第一节点为异常节点的广播消息后的第二预设时长内收到所述第一节点为活动节点的广播消息,则将所述第一节点的节点状态标记为活动节点;If a broadcast message that the first node is an active node is received within a second preset time period after sending the broadcast message that the first node is an abnormal node, mark the node status of the first node as an active node ;
    如果所述第二预设时长到期时,本节点中所述第一节点的节点状态仍为异常节点,并且在所述第二预设时长内多次收到其他节点发送的所述第一节点为异常节点的广播消息,则将所述第一节点的节点状态标记为不可用节点,并在所述集群内发送所述第一节点为不可用节点的广播消息。If the second preset time period expires, the node status of the first node in this node is still an abnormal node, and the first node sent by other nodes has been received multiple times within the second preset time period. If the broadcast message that the node is an abnormal node, the node status of the first node is marked as an unavailable node, and the broadcast message that the first node is an unavailable node is sent in the cluster.
  10. 根据权利要求9所述的方法,其特征在于,所述多个节点包括至少三个节点。The method of claim 9, wherein the plurality of nodes comprises at least three nodes.
  11. 根据权利要求9所述的方法,其特征在于,所述探测所述第一节点是否是活动节点,包括:The method according to claim 9, wherein the detecting whether the first node is an active node comprises:
    向所述第一节点发送探测消息;sending a probe message to the first node;
    如果未收到所述第一节点的正确响应,则再次向所述第一节点发送探测消 息,或者,随机选取另外的节点作为第二节点,并向所述第二节点发送间接探测请求,以使所述第二节点向所述第一节点发送探测消息并将探测结果返回给本节点,其中所述间接探测请求中包括所述第一节点的地址;If no correct response from the first node is received, send a probe message to the first node again, or randomly select another node as the second node, and send an indirect probe request to the second node, to Make the second node send a detection message to the first node and return the detection result to the own node, wherein the indirect detection request includes the address of the first node;
    如果仍未收到所述第一节点的正确响应,则判定所述第一节点不是活动节点。If the correct response from the first node is still not received, it is determined that the first node is not an active node.
  12. 根据权利要求9所述的方法,其特征在于,在所述第二预设时长内多次收到其他节点发送的所述第一节点为异常节点的广播消息,包括:The method according to claim 9, wherein the broadcast message that the first node is an abnormal node is received multiple times from other nodes within the second preset duration, including:
    在发送所述第一节点为异常节点的广播消息后启动计数器;Start a counter after sending the broadcast message that the first node is an abnormal node;
    在所述第二预设时长内,每当收到其他节点发送的所述第一节点为异常节点的广播消息则所述计数器加1;Within the second preset duration, whenever receiving a broadcast message from other nodes that the first node is an abnormal node, the counter is incremented by 1;
    当所述计数器大于预设数值时,判定为在所述第二预设时长内多次收到其他节点发送的所述第一节点为异常节点的广播消息。When the counter is greater than the preset value, it is determined that the broadcast message that the first node is an abnormal node is received multiple times from other nodes within the second preset time period.
  13. 根据权利要求9所述的方法,其特征在于,所述方法还包括:The method according to claim 9, characterized in that the method further comprises:
    当接收到第三节点为不可用节点的广播消息时,如果所述第三节点未在本节点上被标记为不可用节点,则在本节点上将所述第三节点标记为不可用节点,并在所述集群内发送所述第三节点为不可用节点的广播消息以形成再次传播;When receiving a broadcast message that the third node is an unavailable node, if the third node is not marked as an unavailable node on the current node, marking the third node as an unavailable node on the current node, and sending a broadcast message that the third node is an unavailable node in the cluster to form re-propagation;
    当接收到本节点为异常节点的广播消息,或者本节点为不可用节点的广播消息时,在所述集群内发送本节点为活动节点的广播消息,以修正其他节点对本节点的节点状态标记。When receiving a broadcast message that the node is an abnormal node, or a broadcast message that the node is an unavailable node, send a broadcast message that the node is an active node in the cluster, so as to modify other nodes' node status marks for the node.
  14. 根据权利要求9所述的方法,其特征在于,所述方法还包括:The method according to claim 9, characterized in that the method further comprises:
    当本节点离开所述集群时,在所述集群内发送本节点为不可用节点的广播消息。When the current node leaves the cluster, a broadcast message that the current node is an unavailable node is sent in the cluster.
  15. 根据权利要求9所述的方法,其特征在于,所述日志聚合系统还包括:The method according to claim 9, wherein the log aggregation system further comprises:
    存储组件,用于存储经所述核心服务集群处理后的数据;a storage component, configured to store data processed by the core service cluster;
    告警组件,用于当发现被监控对象的日志信息异常、所述集群中出现异常节点和\或所述集群中出现不可用节点时,根据第二预设策略发送告警信息;An alarm component, configured to send alarm information according to a second preset strategy when it is found that the log information of the monitored object is abnormal, abnormal nodes appear in the cluster and/or unavailable nodes appear in the cluster;
    数据可视化组件,用于展示所述日志信息和\或所述告警信息。The data visualization component is used to display the log information and/or the alarm information.
PCT/CN2022/091817 2021-12-30 2022-05-09 Log aggregation system, and method for improving availability of log aggregation system WO2023123801A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111649527 2021-12-30
CN202111649527.3 2021-12-30

Publications (1)

Publication Number Publication Date
WO2023123801A1 true WO2023123801A1 (en) 2023-07-06

Family

ID=81549719

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/091817 WO2023123801A1 (en) 2021-12-30 2022-05-09 Log aggregation system, and method for improving availability of log aggregation system

Country Status (2)

Country Link
CN (1) CN114513400B (en)
WO (1) WO2023123801A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910231A (en) * 2023-09-11 2023-10-20 社治无忧(成都)智慧科技有限公司 WeChat public opinion early warning method and system based on natural language processing
CN117194175A (en) * 2023-11-02 2023-12-08 广州嘉为科技有限公司 Log alarm monitoring method and device and computer storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109547271A (en) * 2019-01-06 2019-03-29 广州泳泳信息科技有限公司 A kind of network state real time monitoring warning system based on big data
CN109951323A (en) * 2019-02-27 2019-06-28 网宿科技股份有限公司 A kind of log analysis method and system
US20210037111A1 (en) * 2019-07-24 2021-02-04 Wangsu Science & Technology Co., Ltd. Access log processing method and device
CN112383573A (en) * 2021-01-18 2021-02-19 南京联成科技发展股份有限公司 Security intrusion playback equipment based on multiple attack stages
CN113268401A (en) * 2021-06-16 2021-08-17 中移(杭州)信息技术有限公司 Log information output method and device and computer readable storage medium
CN113590492A (en) * 2021-08-23 2021-11-02 宁畅信息产业(北京)有限公司 Information processing method, system, electronic device and computer readable storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101437045B (en) * 2008-12-18 2012-04-25 腾讯科技(深圳)有限公司 Method for selecting transfer node of P2P system and P2P node
US11252119B2 (en) * 2018-06-04 2022-02-15 Salesforce.Com, Inc. Message logging using two-stage message logging mechanisms
CN111352806B (en) * 2020-03-31 2024-04-26 中国工商银行股份有限公司 Log data monitoring method and device
CN113496032A (en) * 2020-04-03 2021-10-12 中国信息安全测评中心 Big data operation abnormity monitoring system based on distributed computation and rule engine
CN112698915A (en) * 2020-12-31 2021-04-23 北京千方科技股份有限公司 Multi-cluster unified monitoring alarm method, system, equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109547271A (en) * 2019-01-06 2019-03-29 广州泳泳信息科技有限公司 A kind of network state real time monitoring warning system based on big data
CN109951323A (en) * 2019-02-27 2019-06-28 网宿科技股份有限公司 A kind of log analysis method and system
US20210037111A1 (en) * 2019-07-24 2021-02-04 Wangsu Science & Technology Co., Ltd. Access log processing method and device
CN112383573A (en) * 2021-01-18 2021-02-19 南京联成科技发展股份有限公司 Security intrusion playback equipment based on multiple attack stages
CN113268401A (en) * 2021-06-16 2021-08-17 中移(杭州)信息技术有限公司 Log information output method and device and computer readable storage medium
CN113590492A (en) * 2021-08-23 2021-11-02 宁畅信息产业(北京)有限公司 Information processing method, system, electronic device and computer readable storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910231A (en) * 2023-09-11 2023-10-20 社治无忧(成都)智慧科技有限公司 WeChat public opinion early warning method and system based on natural language processing
CN116910231B (en) * 2023-09-11 2023-11-17 社治无忧(成都)智慧科技有限公司 WeChat public opinion early warning method and system based on natural language processing
CN117194175A (en) * 2023-11-02 2023-12-08 广州嘉为科技有限公司 Log alarm monitoring method and device and computer storage medium

Also Published As

Publication number Publication date
CN114513400A (en) 2022-05-17
CN114513400B (en) 2024-05-28

Similar Documents

Publication Publication Date Title
WO2023123801A1 (en) Log aggregation system, and method for improving availability of log aggregation system
WO2021128915A1 (en) Smart device monitoring method and apparatus
CN109714192B (en) Monitoring method and system for monitoring cloud platform
JP6686033B2 (en) Method and apparatus for pushing messages
CN110535713B (en) Monitoring management system and monitoring management method
US20020138785A1 (en) Power supply critical state monitoring system
CN101707632A (en) Method for dynamically monitoring performance of server cluster and alarming real-timely
EP2723017A1 (en) Method, apparatus and system for implementing distributed auto-incrementing counting
WO2011017955A1 (en) Method for analyzing alarm data and system thereof
CN105610648A (en) Operation and maintenance monitoring data collection method and server
CN108282355B (en) Equipment inspection device in cloud desktop system
CN110809060A (en) Monitoring system and monitoring method for application server cluster
CN110677304A (en) Distributed problem tracking system and equipment
CN105187554A (en) Method and system for monitoring server performance
CN104679623A (en) Server hard disk maintaining method, system and server monitoring equipment
CN114553747A (en) Method, device, terminal and storage medium for detecting abnormality of redis cluster
CN110545197B (en) Node state monitoring method and device
CN112954372B (en) Streaming media fault monitoring method and device
CN113381884A (en) Full link monitoring method and device for monitoring alarm system
CN113765717A (en) Operation and maintenance management system based on secret-related special computing platform
EP1622310A2 (en) Administration system for network management systems
CN110196787B (en) Data backup and recovery system and data backup and recovery method thereof
JP2009187230A (en) Monitoring device for server
CN114124646B (en) Websocket mode integrated network management system and method
CN113254245A (en) Fault detection method and system for storage cluster

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22913097

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE