CN114513400A - Log aggregation system and method for improving availability of log aggregation system - Google Patents

Log aggregation system and method for improving availability of log aggregation system Download PDF

Info

Publication number
CN114513400A
CN114513400A CN202210002511.1A CN202210002511A CN114513400A CN 114513400 A CN114513400 A CN 114513400A CN 202210002511 A CN202210002511 A CN 202210002511A CN 114513400 A CN114513400 A CN 114513400A
Authority
CN
China
Prior art keywords
node
nodes
broadcast message
abnormal
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210002511.1A
Other languages
Chinese (zh)
Inventor
陆玉平
邓瑞明
蔡攀龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Chuanyuan Information Technology Co ltd
Original Assignee
Shanghai Chuanyuan Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Chuanyuan Information Technology Co ltd filed Critical Shanghai Chuanyuan Information Technology Co ltd
Publication of CN114513400A publication Critical patent/CN114513400A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/22Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks comprising specially adapted graphical user interfaces [GUI]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0823Errors, e.g. transmission errors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/12Network monitoring probes

Abstract

The application provides a log aggregation system and the like, wherein the system is used for collecting log information of a monitored object, a reverse proxy component is used for receiving the log information of the monitored object, selecting one node from a plurality of nodes of a core service cluster as a target node according to a first preset strategy, and sending the log information to the target node; the nodes of the core service cluster are used for presetting the received log information, and the nodes monitor the node state of each node through mutual detection. The system adopts a distributed design, the reverse proxy component and the cluster are matched with each other, not only can the collection and processing of the log of the monitored object and the load balance be realized, but also the stability of the system is improved through the redundant deployment of a plurality of nodes and the mutual monitoring among the nodes, the high availability of the system is ensured, in addition, the monitored problems can be alarmed, and the operation and maintenance team or the technical team can find and position the problems in time.

Description

Log aggregation system and method for improving availability of log aggregation system
Technical Field
The application relates to the technical field of cloud computing and storage, in particular to a log aggregation system and a method for improving the usability of the log aggregation system.
Background
Private Clouds (Private Clouds) are Clouds constructed for individual use by a client (e.g., a large enterprise) using techniques such as cloud computing. The enterprise private cloud integrates various advanced technologies such as cloud computing and big data management, belongs to a new service mode, can integrate resources, improve the resource utilization rate, reduce the resource consumption and reduce the enterprise cost, and therefore the enterprise private cloud is rapidly developed in recent years.
In the enterprise private cloud, processing/management of log data is an important task. For example, a log aggregation monitoring system (e.g., Loki) may be used to compress and store unstructured log data and index only metadata (metadata, including timestamps, labels, etc.) of the log data.
However, in the process of implementing the scheme of the present application, the inventor finds that the existing log aggregation monitoring system lacks a guarantee measure for high availability of the log aggregation monitoring system itself, and when the log aggregation monitoring system fails, the problem cannot be found and reported in time, so that a large amount of log data may be backlogged on the monitored object, the performance of the monitored object is affected, and a hard disk space is wasted.
Disclosure of Invention
The application provides a log aggregation system and a method for improving the usability of the log aggregation system, so as to solve the reliability problem of the log aggregation system.
According to a first aspect of an embodiment of the present application, a log aggregation system is provided, where the log aggregation system is configured to collect log information of a monitored object; the log aggregation system comprises a reverse proxy component and a core service cluster consisting of a plurality of nodes;
the reverse proxy component is used for receiving log information of a monitored object, selecting one node from the plurality of nodes of the core service cluster as a target node according to a first preset strategy, and sending the log information to the target node;
the nodes of the core service cluster are used for presetting the received log information, and the nodes monitor the node state of each node through mutual detection.
Optionally, the plurality of nodes includes at least three nodes.
Optionally, the node states are divided into active nodes, abnormal nodes and unavailable nodes;
the nodes monitor the node state of each node through mutual detection, and the method comprises the following steps:
for each node:
selecting other nodes as first nodes every other first preset time length, and detecting whether the first nodes are active nodes or not;
if the first node is not an active node, the node state of the first node is marked as an abnormal node in the node, and a broadcast message of which the first node is the abnormal node is sent in the cluster;
if the broadcast message of which the first node is an active node is received within a second preset time after the broadcast message of which the first node is an abnormal node is sent, marking the node state of the first node as the active node;
if the node state of the first node in the node is still an abnormal node when the second preset time expires, and a broadcast message that the first node is an abnormal node and is sent by other nodes is received for multiple times within the second preset time, marking the node state of the first node as an unavailable node, and sending the broadcast message that the first node is an unavailable node within the cluster.
Optionally, the detecting whether the first node is an active node includes:
sending a probe message to the first node;
if the correct response of the first node is not received, sending the detection message to the first node again, or randomly selecting another node as a second node and sending an indirect detection request to the second node so that the second node sends the detection message to the first node and returns a detection result to the node, wherein the indirect detection request comprises the address of the first node;
if a correct response has not been received from the first node, it is determined that the first node is not an active node.
Optionally, receiving, multiple times within the second preset time period, a broadcast message that the first node is an abnormal node and is sent by other nodes, where the broadcast message includes:
starting a counter after the broadcast message that the first node is an abnormal node is sent;
within the second preset time length, adding 1 to the counter every time a broadcast message which is sent by other nodes and of which the first node is an abnormal node is received;
and when the counter is greater than a preset value, determining that the broadcast message which is sent by other nodes and is sent by the first node and is an abnormal node is received for multiple times within the second preset time.
Optionally, the nodes monitor the node state of each node through mutual detection, and further include:
for each node:
when a broadcast message that a third node is an unavailable node is received, if the third node is not marked as the unavailable node on the node, marking the third node as the unavailable node on the node, and sending the broadcast message that the third node is the unavailable node in the cluster to form re-propagation;
and when receiving the broadcast message of which the node is an abnormal node or the broadcast message of which the node is an unavailable node, sending the broadcast message of which the node is an active node in the cluster so as to correct the node state mark of other nodes to the node.
Optionally, the nodes monitor the node state of each node through mutual detection, and further include:
for each node:
and when the node leaves the cluster, sending a broadcast message of the node as an unavailable node in the cluster.
Optionally, the system further includes:
the storage component is used for storing the data processed by the core service cluster;
the alarm component is used for sending alarm information according to a second preset strategy when the log information of the monitored object is found to be abnormal, abnormal nodes appear in the cluster and/or unavailable nodes appear in the cluster;
and the data visualization component is used for displaying the log information and/or the alarm information.
According to a second aspect of embodiments of the present application, there is provided a method for improving availability of a log aggregation system, the method being used for a node in the log aggregation system; the log aggregation system is used for collecting log information of a monitored object and comprises a reverse proxy component and a core service cluster consisting of a plurality of nodes; the reverse proxy component is used for receiving log information of a monitored object, selecting one node from the plurality of nodes as a target node according to a first preset strategy and sending the log information to the target node; the nodes are used for carrying out preset processing on the received log information, monitoring the node state of each node through mutual detection of the nodes, and dividing the node state into an active node, an abnormal node and an unavailable node;
the method comprises the following steps:
selecting other nodes as first nodes every other first preset time length, and detecting whether the first nodes are active nodes or not;
if the first node is not an active node, marking the node state of the first node as an abnormal node in the node, and sending a broadcast message that the first node is the abnormal node in the cluster;
if the broadcast message of which the first node is an active node is received within a second preset time after the broadcast message of which the first node is an abnormal node is sent, marking the node state of the first node as the active node;
if the node state of the first node in the node is still an abnormal node when the second preset time expires, and a broadcast message that the first node is an abnormal node and is sent by other nodes is received for multiple times within the second preset time, marking the node state of the first node as an unavailable node, and sending the broadcast message that the first node is an unavailable node within the cluster.
Optionally, the plurality of nodes includes at least three nodes.
Optionally, the detecting whether the first node is an active node includes:
sending a probe message to the first node;
if the correct response of the first node is not received, sending the detection message to the first node again, or randomly selecting another node as a second node and sending an indirect detection request to the second node so that the second node sends the detection message to the first node and returns a detection result to the node, wherein the indirect detection request comprises the address of the first node;
if a correct response has not been received from the first node, it is determined that the first node is not an active node.
Optionally, receiving, multiple times within the second preset time period, a broadcast message that the first node is an abnormal node and is sent by other nodes, where the broadcast message includes:
starting a counter after the broadcast message that the first node is an abnormal node is sent;
within the second preset time length, adding 1 to the counter every time a broadcast message which is sent by other nodes and of which the first node is an abnormal node is received;
and when the counter is greater than a preset value, determining that the broadcast message which is sent by other nodes and is sent by the first node and is an abnormal node is received for multiple times within the second preset time.
Optionally, the method further includes:
when a broadcast message that a third node is an unavailable node is received, if the third node is not marked as the unavailable node on the node, marking the third node as the unavailable node on the node, and sending the broadcast message that the third node is the unavailable node in the cluster to form re-propagation;
and when receiving the broadcast message of which the node is an abnormal node or the broadcast message of which the node is an unavailable node, sending the broadcast message of which the node is an active node in the cluster so as to correct the node state mark of other nodes to the node.
Optionally, the method further includes:
and when the node leaves the cluster, sending a broadcast message of the node as an unavailable node in the cluster.
Optionally, the log aggregation system further includes:
the storage component is used for storing the data processed by the core service cluster;
the alarm component is used for sending alarm information according to a second preset strategy when the log information of the monitored object is found to be abnormal, abnormal nodes appear in the cluster and/or unavailable nodes appear in the cluster;
and the data visualization component is used for displaying the log information and/or the alarm information.
The technical scheme provided by the embodiment of the application can have the following beneficial effects:
in the embodiment of the application, a log aggregation system comprising a reverse proxy component and a cluster is provided, the cluster adopts a distributed design and can comprise a plurality of nodes, the reverse proxy component and the cluster are matched with each other, not only can the collection and processing and load balancing of logs of a monitored object be realized, but also the stability of the system is improved through the redundant deployment of a plurality of nodes and the mutual monitoring among the nodes, the high availability of the system is ensured, if one or two nodes have problems, the service function provided by the system to the outside cannot be influenced, in addition, the monitored problems can be alarmed, and the operation and maintenance team or the technical team can find and position the problems in time.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise. Furthermore, these descriptions should not be construed as limiting the embodiments, wherein elements having the same reference number designation are identified as similar elements throughout the figures, and the drawings are not to scale unless otherwise specified.
Fig. 1 is a schematic diagram of a log aggregation system provided in an embodiment of the present application;
FIG. 2 is a schematic diagram of a scenario of an embodiment of the present application;
FIG. 3 is a schematic diagram of an interior of a Loki cluster in an embodiment of the present application;
fig. 4 is a schematic diagram of a method for improving availability of a log aggregation system according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described in detail below with reference to the drawings in the embodiments of the present application. When referring to the drawings, the same numbers in different drawings represent the same or similar elements unless otherwise noted. It should be apparent that the examples described below are only a part of examples of the present application and not all examples, or that the embodiments described in the following exemplary examples do not represent all embodiments consistent with the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
When the terms "first", "second", "third", and the like appear in the description, the claims, and the above-described drawings of the embodiments of the present application, they are used to distinguish different objects, and not to limit a specific order. In the embodiments of the present application, words such as "exemplary" or "for example" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g.," should not be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.
Fig. 1 is a schematic diagram of a log aggregation system according to an embodiment of the present application. The log aggregation system can be used for collecting log information of the monitored object. The log aggregation system comprises a reverse proxy component and a core service cluster consisting of a plurality of nodes.
The reverse proxy component is used for receiving log information of a monitored object, selecting one node from the plurality of nodes of the core service cluster as a target node according to a first preset strategy, and sending the log information to the target node.
The reverse proxy component specifically selects a node from the plurality of nodes of the core service cluster as a target node according to what policy, that is, specific content of the first preset policy, which is not limited in this embodiment, and those skilled in the art may select and design themselves according to different needs/different scenarios, and these selections and designs that may be used herein do not depart from the spirit and scope of the present application.
As an example, the first preset policy may specifically include one or more of polling, weight assignment according to a node, a minimum link manner, a performance optimization manner, and the like.
The nodes of the core service cluster are used for presetting the received log information, and the nodes monitor the node state of each node through mutual detection.
As an example, the plurality of nodes may specifically include at least three nodes. A node may be, for example, a server.
In addition, the specific content of the processing performed by the node on the received log information, that is, the preset processing, is also not limited in this embodiment.
By way of example, the pre-set processing may include parsing the log, indexing the log contents, tagging, de-duplicating, compressing the storage, and so forth. For example, the date of the log is extracted to be used as an index, and the log can be conveniently retrieved according to the date. For example, the source of the log is extracted to be indexed, for example, the log of the Apach service and the log of the database are distinguished. The logs may also be tagged to distinguish between Matrics and Log, or to distinguish between Info level, Warning level, Error level, etc.
In this embodiment or some other embodiments of the present application, the node states may be specifically classified into an active node, an abnormal node, and an unavailable node;
the nodes monitor the node state of each node through mutual detection, and the method specifically includes:
for each node:
1) and selecting other nodes as first nodes every other first preset time length, and detecting whether the first nodes are active nodes.
For example, a polling method or a random method may be adopted in other nodes outside the local node in the cluster to select one node as the first node.
The embodiment is not limited to how to detect whether the first node is an active node. As an example, the detecting whether the first node is an active node may include:
sending a probe message to the first node (e.g., the first node may be PING);
if the correct response of the first node is not received, sending the detection message to the first node again, or randomly selecting another node as a second node and sending an indirect detection request to the second node so that the second node sends the detection message to the first node and returns a detection result to the node, wherein the indirect detection request comprises the address of the first node;
if a correct response has not been received from the first node, it is determined that the first node is not an active node.
2) If the first node is not an active node, the node state of the first node is marked as an abnormal node in the node, and a broadcast message that the first node is an abnormal node is sent in the cluster.
3) And if the broadcast message of which the first node is an active node is received within a second preset time after the broadcast message of which the first node is an abnormal node is sent, marking the node state of the first node as the active node.
4) If the node state of the first node in the node is still an abnormal node when the second preset time expires, and a broadcast message that the first node is an abnormal node and is sent by other nodes is received for multiple times within the second preset time, marking the node state of the first node as an unavailable node, and sending the broadcast message that the first node is an unavailable node within the cluster.
As an example, receiving, multiple times within the second preset time period, a broadcast message that the first node is an abnormal node and sent by another node, may specifically include:
starting a counter after the broadcast message that the first node is an abnormal node is sent;
within the second preset time length, adding 1 to the counter every time a broadcast message which is sent by other nodes and of which the first node is an abnormal node is received;
and when the counter is greater than a preset value, determining that the broadcast message which is sent by other nodes and is sent by the first node and is an abnormal node is received for multiple times within the second preset time.
For specific contents of the first preset duration, the second preset duration, the plurality of times, the preset value, etc., the embodiment is not limited, and those skilled in the art can select and design the time duration, the plurality of times, the preset value, etc., according to different requirements/different scenarios, and these selections and designs that can be used herein do not depart from the spirit and the scope of the present application.
The foregoing describes operations performed by a node to actively probe the state of other nodes. In addition, in practice, a node may also be detected by other nodes, or receive messages broadcast by other nodes, so that the nodes monitor the node status of each node through mutual detection, which may further include:
for each node:
when a broadcast message that a third node is an unavailable node is received, if the third node is not marked as the unavailable node on the node, marking the third node as the unavailable node on the node, and sending the broadcast message that the third node is the unavailable node in the cluster to form re-propagation;
and when receiving the broadcast message of which the node is an abnormal node or the broadcast message of which the node is an unavailable node, sending the broadcast message of which the node is an active node in the cluster so as to correct the node state mark of other nodes to the node.
In addition, when a node leaves the cluster, it may be informed that other nodes within the cluster are unavailable. Therefore, the nodes monitor the node state of each node by probing each other, and may further include:
for each node:
and when the node leaves the cluster, sending a broadcast message of the node as an unavailable node in the cluster.
In addition, for convenience of storage, alarming, monitoring, etc., the system may further include:
the storage component is used for storing the data processed by the core service cluster;
the alarm component is used for sending alarm information according to a second preset strategy when the log information of the monitored object is found to be abnormal, abnormal nodes appear in the cluster and/or unavailable nodes appear in the cluster;
and the data visualization component is used for displaying the log information and/or the alarm information.
In this embodiment, a log aggregation system including a reverse proxy component and a cluster is provided, where the cluster is designed in a distributed manner and may include a plurality of nodes, and the reverse proxy component and the cluster are matched with each other, which not only can realize collection and load balancing of logs of a monitored object, but also improves the stability of the system through redundant deployment of a plurality of nodes and mutual monitoring among nodes, and ensures high availability of the system itself.
In the following, the scheme of the present application is further described by taking Loki as an example and combining with a specific application scenario. Of course, the Loki application scenario is taken as an example only, and in practical application, the method may also be applied to other application scenarios, for example, a deployment manner such as three nodes may also be adopted to ensure high availability of the Prometheus cluster.
Loki is a log aggregation system which mainly compresses and stores unstructured log data and only indexes metadata (including time stamps, labels, etc.) of the log data. The Loki service receives the log data pushed from the Promtail component and distributes the log data to the internal Ingeter component. Index data of the log data and the log content data are stored by the Ingeter component. Unlike promemeus, Loki can only passively accept log pushes of the Promtail component.
However, the inventor finds that, in the process of implementing the scheme of the present application, the main service of Loki cannot ensure high availability of Loki itself, and when Loki itself fails, if the monitored object also fails, Loki cannot find and report the problem in time, and a large amount of log data is backlogged on the monitored object, which affects the performance of the monitored object and causes waste of hard disk space.
Therefore, in the embodiment, a Loki service core cluster is constructed, the cluster may be composed of three nodes, the three nodes monitor the states of each other, if 1 or 2 nodes in the Loki service core cluster have problems, the cluster does not affect the service of providing log collection aggregation outside, and the fault may be alarmed to an operation and maintenance team or a technical team through an alarm component to request manual intervention processing.
Fig. 2 is a schematic view of a scenario in an embodiment of the present application, and fig. 3 is a schematic view of an interior of a Loki cluster in the embodiment of the present application. In fig. 2, the Loki cluster is connected to the components such as the proxy, the Node Exporter, the alarm component, the data visualization component, the storage pool (including Index, Log, and indicator Metrics), and the like. The Prometheus cluster is a tool cluster for capturing Metrics (index) data of a monitoring target, and is a prior art for Promtail, Node Exporter, Prometheus and the like, and is not described herein again. In fig. 3, a Loki cluster may include a reverse proxy component and a Loki core service cluster, etc. inside the Loki cluster. The following is a detailed description:
loki core service cluster: and connecting with the monitoring target, and collecting and storing various logs of the monitoring target. Queries to the log by the visualization component may also be provided. In the cluster, 3 nodes are adopted for redundant networking, and reverse proxy components are used for reverse proxy.
A reverse proxy component: the client (the visual component and the Promtail component) sends a request to the reverse proxy component, and then the reverse proxy component forwards the request to the Loki node suitable for the back end according to the preset strategy. In addition, in order to ensure the high availability of the reverse proxy component, a main backup disaster recovery can be performed inside the reverse proxy component, the main backup disaster recovery and the standby disaster recovery are connected by using a heartbeat line, and after the host is down, the backup disaster recovery and the standby disaster recovery start to work immediately and report the self condition to a Loki node at the back end.
The Promtail component: and the reverse proxy component is responsible for collecting the log of the monitored object and sending the log to the reverse proxy component of the Loki cluster, so that the reverse proxy component sends a log transmission request to a proper Loki node according to a preset strategy, and the log transmission request is processed and stored in the storage pool.
Monitoring the object: i.e. the monitored object, or monitoring target. The Promtail and Node Exporter components need to be installed in the monitored object respectively, so that data fetching of the Loki cluster and the Prometheus cluster is facilitated. The Promtail component actively pushes log data to the Loki cluster, and the Node Export component exposes an Http interface of the monitored object to Prometheus, so that the Prometheus can conveniently capture indexes regularly.
Storage pool: and the Loki cluster is used for storing the log information and the indexes acquired by the Loki cluster. The storage pool using the object storage service is recommended in consideration of the log and index data read-write performance. In order to improve the retrieval efficiency of the Loki log, the Index of the log data obtained by Loki is recommended to be placed in a high-performance memory database.
And (4) warning component: and the alarm module is used for receiving the alarm information and sending the alarm information to a preset contact person. By making different hooks, the alert message can be sent to various IM communication tools (e.g., WeChat, Slack, Teams, Lync, etc.), mail, phone, text, etc.
Further, as an example, the alert component may specifically adopt an alert manager component, support the query of logs and indexes, and provide a flexible alert manner. According to a preset rule, when detecting that a monitoring object log or index (such as Matrics information, including CPU utilization rate/memory utilization rate/hard disk utilization rate/network card throughput/Response performance of hard disk iops/API/network request error rate and the like) is abnormal, pushing alarm information to an Alertmanager component. When receiving the warning, the alert manager can send the warning according to configuration, aggregation, duplicate removal and noise reduction. The Altermanager component can send the alarm information to the operation and maintenance team or the technical team in the modes of e-mail, short message, telephone and the like, so that the operation and maintenance team or the technical team can find and locate problems in time. And a third-party operation and maintenance supervision platform can be linked, such as xMatters/Sumologc/Splunk and the like.
A data visualization component: and the system is used for aggregating and displaying the monitoring data (for example, Matrics data of a system, such as CPU utilization/memory utilization/hard disk utilization/network card throughput collected every 5 minutes of a certain server, and the like, and a line graph can be drawn according to the values to display), log data and alarm information. When needing to display the logs or the alarm information, the visualization component initiates a Query request of the logs or the alarm information to the reverse proxy component, and after receiving the request, the reverse proxy component forwards the request to a Query-front component, wherein the component is responsible for communicating with a Querier service on each Loki node, acquiring various logs or alarm information stored in a storage pool, replying the logs or the alarm information to the reverse proxy component, and finally displaying the logs or the alarm information on an interface of the visualization component.
The interior of the Loki cluster uses a redundant three-node cluster, and provides a reverse proxy function by using a reverse proxy component containing a main disaster recovery agent to the outside, and only one interface is exposed to a user.
In specific implementation, a whole set of the lowi core services Distributor, Ingester and Querier can be deployed on each node, and the three components belong to the prior art, so that detailed description is omitted. In this embodiment, a node monitoring component is newly deployed on each node, so that each node monitors the node state of each node through mutual detection.
In addition, because the node monitoring component operates in the memory, a UPS power supply can be additionally arranged in the storage device, and the strategy of executing the safe shutdown of the system and dropping the data in the memory when the UPS power supply detects that the storage device continuously maintains the mains supply power-off state for more than 10 minutes and the like is forcibly set.
The states of the nodes can be divided into 3 types: active nodes, abnormal nodes, and unavailable nodes. The node failure detection principle of the node monitoring component may include:
i) after a node (e.g., node a) starts up, another node (e.g., node B) is selected (e.g., polled) at regular intervals to send PING messages to it. When the PING message fails, the PING message can be sent to the node B again, or other nodes (such as the node C) are randomly selected to initiate an indirect PING request, the node C receiving the indirect PING request initiates a PING message to the node B according to the address in the request, and returns the PING result to the source node of the indirect request, namely the node a, and if the node a does not receive any ACK message of the node B after the detection timeout, the state of the node B is marked as an abnormal node. Then node A starts a timer and sends out a broadcast of an abnormal warning for node B, if the same abnormal warning information for node B sent by other nodes is received in the period, node A locally marks node B as an unavailable node by +1 abnormal node number of times, after the timer is overtime, the state of node B is still not active, and the abnormal node number of times meets the requirement.
ii) when the suspected invalid node (for example, node B) receives a message for the abnormal warning from another node, the message is immediately sent to the node which is an active node, so that the mark of the other node that the node is an abnormal node is cleared.
iii) when a node (e.g., node A) leaves the cluster, it sends a broadcast to the cluster that the node is unavailable; when node a marks other nodes (e.g., node B) as unavailable, a broadcast is also sent to the cluster that the node is an unavailable node.
iv) when other nodes (for example, node C) receive the broadcast message of a node (for example, node B) as an unavailable node, the broadcast message is compared with the local record, when the node B in the local record is also an unavailable node, the message is ignored, when the node B in the local record is not an unavailable node, the local original record is deleted and the node B is marked as an unavailable node, and the message of the node B as an unavailable node is broadcast again to form the rebroadcast.
v) if a node (for example, node B) receives a broadcast message that is an unavailable node, it indicates that the node is partitioned with respect to the network of other nodes, and at this time, the node initiates a broadcast that the node is an active node to modify the state flag of the node stored in other nodes.
In this way, the nodes in the cluster are interconnected and the node status can be confirmed by the node monitoring component. If other nodes or self-service have faults, the alarm component can send alarm information to third-party operation and maintenance supervision platforms such as xMatters/Sumologc/Splunk and the like, and can also send the alarm information to operation and maintenance teams or technical teams directly through e-mails/short messages/telephones and the like, so that the operation and maintenance teams or the technical teams can find and locate the problems in time.
In addition, in order to ensure data consistency, an arbitration mechanism can be added into the node monitoring component, and the arbitration mechanism is realized by monitoring each node, so that the data split risk of the cluster can be avoided under severe conditions.
In this embodiment, a log aggregation system including a reverse proxy component and a cluster is provided, where the cluster is designed in a distributed manner and may include a plurality of nodes, and the reverse proxy component and the cluster are matched with each other, which not only can realize collection and load balancing of logs of a monitored object, but also improves the stability of the system through redundant deployment of a plurality of nodes and mutual monitoring among nodes, and ensures high availability of the system itself.
The following are embodiments of methods of the present application that may be used to implement embodiments of systems of the present application. For details which are not disclosed in the method embodiments of the present application, reference is made to the system embodiments of the present application.
Fig. 4 is a schematic diagram of a method for improving availability of a log aggregation system according to an embodiment of the present application. The method may be used for a node in a log aggregation system; the log aggregation system is used for collecting log information of a monitored object and comprises a reverse proxy component and a core service cluster consisting of a plurality of nodes; the reverse proxy component is used for receiving log information of a monitored object, selecting one node from the plurality of nodes as a target node according to a first preset strategy and sending the log information to the target node; the nodes are used for carrying out preset processing on the received log information, monitoring the node state of each node through mutual detection of the nodes, and dividing the node state into an active node, an abnormal node and an unavailable node;
referring to fig. 4, the method may include:
step S401, selecting other nodes as first nodes every other first preset time length, and detecting whether the first nodes are active nodes.
Step S402, if the first node is not an active node, the node state of the first node is marked as an abnormal node in the node, and a broadcast message that the first node is an abnormal node is sent in the cluster.
Step S403, if the broadcast message that the first node is an active node is received within a second preset time after the broadcast message that the first node is an abnormal node is sent, marking the node status of the first node as an active node.
Step S404, if the node status of the first node in the node is still an abnormal node when the second preset time expires, and a broadcast message that the first node is an abnormal node and is sent by another node is received multiple times within the second preset time, marking the node status of the first node as an unavailable node, and sending the broadcast message that the first node is an unavailable node within the cluster.
In this embodiment or some other embodiments of the present application, the plurality of nodes may specifically include at least three nodes.
In this embodiment or some other embodiments of the present application, the detecting whether the first node is an active node may specifically include:
sending a probe message to the first node;
if the correct response of the first node is not received, sending the detection message to the first node again, or randomly selecting another node as a second node and sending an indirect detection request to the second node so that the second node sends the detection message to the first node and returns a detection result to the node, wherein the indirect detection request comprises the address of the first node;
if a correct response has not been received from the first node, it is determined that the first node is not an active node.
In this embodiment or some other embodiments of the present application, the receiving, for multiple times within the second preset time period, a broadcast message that the first node is an abnormal node and is sent by another node may specifically include:
starting a counter after the broadcast message that the first node is an abnormal node is sent;
within the second preset time length, adding 1 to the counter every time a broadcast message which is sent by other nodes and of which the first node is an abnormal node is received;
and when the counter is greater than a preset value, determining that the broadcast message which is sent by other nodes and is sent by the first node and is an abnormal node is received for multiple times within the second preset time.
In this embodiment or some other embodiments of the present application, the method may further include:
when a broadcast message that a third node is an unavailable node is received, if the third node is not marked as the unavailable node on the node, marking the third node as the unavailable node on the node, and sending the broadcast message that the third node is the unavailable node in the cluster to form re-propagation;
and when receiving the broadcast message of which the node is an abnormal node or the broadcast message of which the node is an unavailable node, sending the broadcast message of which the node is an active node in the cluster so as to correct the node state mark of other nodes to the node.
In this embodiment or some other embodiments of the present application, the method may further include:
and when the node leaves the cluster, sending a broadcast message of the node as an unavailable node in the cluster.
In this embodiment or some other embodiments of the present application, the log aggregation system further includes:
the storage component is used for storing the data processed by the core service cluster;
the alarm component is used for sending alarm information according to a second preset strategy when the log information of the monitored object is found to be abnormal, abnormal nodes appear in the cluster and/or unavailable nodes appear in the cluster;
and the data visualization component is used for displaying the log information and/or the alarm information.
Regarding the method in the foregoing embodiments, the specific manner in which each step performs operations has been described in detail in the embodiments of the related system, and is not described herein again.
In this embodiment, a method for improving the availability of a log aggregation system is provided, where the method is used for a node in the log aggregation system, the log aggregation system includes a reverse proxy component and a core service cluster composed of a plurality of nodes, the cluster is designed in a distributed manner and may include a plurality of nodes, the reverse proxy component and the cluster cooperate with each other, so that not only can the collection and processing of logs of a monitored object and load balancing be realized, but also the stability of the system is improved through redundant deployment of the plurality of nodes and mutual monitoring among the nodes, high availability of the system itself is ensured, if one or two nodes have a problem, the service function provided by the system to the outside is not affected, in addition, the monitored problem can be alarmed, and a maintenance team or a technical team can find and locate the problem in time.
Although the present application has been described with reference to a preferred embodiment, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the application, and all changes, substitutions and alterations that fall within the spirit and scope of the application are to be understood as being covered by the following claims.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the aspects disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (15)

1. The log aggregation system is characterized by being used for collecting log information of a monitored object; the log aggregation system comprises a reverse proxy component and a core service cluster consisting of a plurality of nodes;
the reverse proxy component is used for receiving log information of a monitored object, selecting one node from the plurality of nodes of the core service cluster as a target node according to a first preset strategy, and sending the log information to the target node;
the nodes of the core service cluster are used for presetting the received log information, and the nodes monitor the node state of each node through mutual detection.
2. The system of claim 1, wherein the plurality of nodes comprises at least three nodes.
3. The system of claim 1, wherein the node states are classified as active nodes, abnormal nodes, unavailable nodes;
the nodes monitor the node state of each node through mutual detection, and the method comprises the following steps:
for each node:
selecting other nodes as first nodes every other first preset time length, and detecting whether the first nodes are active nodes or not;
if the first node is not an active node, marking the node state of the first node as an abnormal node in the node, and sending a broadcast message that the first node is the abnormal node in the cluster;
if the broadcast message of which the first node is an active node is received within a second preset time after the broadcast message of which the first node is an abnormal node is sent, marking the node state of the first node as the active node;
if the node state of the first node in the node is still an abnormal node when the second preset time expires, and a broadcast message that the first node is an abnormal node and is sent by other nodes is received for multiple times within the second preset time, marking the node state of the first node as an unavailable node, and sending the broadcast message that the first node is an unavailable node within the cluster.
4. The system of claim 3, wherein said detecting whether the first node is an active node comprises:
sending a probe message to the first node;
if the correct response of the first node is not received, sending the detection message to the first node again, or randomly selecting another node as a second node and sending an indirect detection request to the second node so that the second node sends the detection message to the first node and returns a detection result to the node, wherein the indirect detection request comprises the address of the first node;
if the correct response has not been received by the first node, then it is determined that the first node is not an active node.
5. The system according to claim 3, wherein receiving the broadcast message sent by the other node that the first node is an abnormal node for a plurality of times within the second preset time period comprises:
starting a counter after the broadcast message that the first node is an abnormal node is sent;
within the second preset time length, adding 1 to the counter every time a broadcast message which is sent by other nodes and of which the first node is an abnormal node is received;
and when the counter is greater than a preset value, determining that the broadcast message which is sent by other nodes and is sent by the first node and is an abnormal node is received for multiple times within the second preset time.
6. The system of claim 3, wherein the nodes monitor the node status of each node by probing each other, further comprising:
for each node:
when a broadcast message that a third node is an unavailable node is received, if the third node is not marked as the unavailable node on the node, marking the third node as the unavailable node on the node, and sending the broadcast message that the third node is the unavailable node in the cluster to form re-propagation;
and when receiving the broadcast message of which the node is an abnormal node or the broadcast message of which the node is an unavailable node, sending the broadcast message of which the node is an active node in the cluster so as to correct the node state mark of other nodes to the node.
7. The system of claim 3, wherein the nodes monitor the node status of each node by probing each other, further comprising:
for each node:
and when the node leaves the cluster, sending a broadcast message of the node as an unavailable node in the cluster.
8. The system of claim 1, further comprising:
the storage component is used for storing the data processed by the core service cluster;
the alarm component is used for sending alarm information according to a second preset strategy when the log information of the monitored object is found to be abnormal, abnormal nodes appear in the cluster and/or unavailable nodes appear in the cluster;
and the data visualization component is used for displaying the log information and/or the alarm information.
9. A method for improving the availability of a log aggregation system is characterized in that the method is used for nodes in the log aggregation system; the log aggregation system is used for collecting log information of a monitored object and comprises a reverse proxy component and a core service cluster consisting of a plurality of nodes; the reverse proxy component is used for receiving log information of a monitored object, selecting one node from the plurality of nodes as a target node according to a first preset strategy and sending the log information to the target node; the nodes are used for carrying out preset processing on the received log information, monitoring the node state of each node through mutual detection of the nodes, and dividing the node state into an active node, an abnormal node and an unavailable node;
the method comprises the following steps:
selecting other nodes as first nodes every other first preset time length, and detecting whether the first nodes are active nodes or not;
if the first node is not an active node, marking the node state of the first node as an abnormal node in the node, and sending a broadcast message that the first node is the abnormal node in the cluster;
if the broadcast message of which the first node is an active node is received within a second preset time after the broadcast message of which the first node is an abnormal node is sent, marking the node state of the first node as the active node;
if the node state of the first node in the node is still an abnormal node when the second preset time expires, and a broadcast message that the first node is an abnormal node and is sent by other nodes is received for multiple times within the second preset time, marking the node state of the first node as an unavailable node, and sending the broadcast message that the first node is an unavailable node within the cluster.
10. The method of claim 9, wherein the plurality of nodes comprises at least three nodes.
11. The method of claim 9, wherein the detecting whether the first node is an active node comprises:
sending a probe message to the first node;
if the correct response of the first node is not received, sending the detection message to the first node again, or randomly selecting another node as a second node and sending an indirect detection request to the second node so that the second node sends the detection message to the first node and returns a detection result to the node, wherein the indirect detection request comprises the address of the first node;
if a correct response has not been received from the first node, it is determined that the first node is not an active node.
12. The method according to claim 9, wherein receiving the broadcast message sent by the other node that the first node is an abnormal node for a plurality of times within the second preset time period comprises:
starting a counter after the broadcast message that the first node is an abnormal node is sent;
within the second preset time length, adding 1 to the counter every time a broadcast message which is sent by other nodes and of which the first node is an abnormal node is received;
and when the counter is greater than a preset value, determining that the broadcast message which is sent by other nodes and is sent by the first node and is an abnormal node is received for multiple times within the second preset time.
13. The method of claim 9, further comprising:
when a broadcast message that a third node is an unavailable node is received, if the third node is not marked as the unavailable node on the node, marking the third node as the unavailable node on the node, and sending the broadcast message that the third node is the unavailable node in the cluster to form re-propagation;
and when receiving the broadcast message of which the node is an abnormal node or the broadcast message of which the node is an unavailable node, sending the broadcast message of which the node is an active node in the cluster so as to correct the node state mark of other nodes to the node.
14. The method of claim 9, further comprising:
and when the node leaves the cluster, sending a broadcast message of the node as an unavailable node in the cluster.
15. The method of claim 9, wherein the log aggregation system further comprises:
the storage component is used for storing the data processed by the core service cluster;
the alarm component is used for sending alarm information according to a second preset strategy when the log information of the monitored object is found to be abnormal, abnormal nodes appear in the cluster and/or unavailable nodes appear in the cluster;
and the data visualization component is used for displaying the log information and/or the alarm information.
CN202210002511.1A 2021-12-30 2022-01-04 Log aggregation system and method for improving availability of log aggregation system Pending CN114513400A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2021116495273 2021-12-30
CN202111649527 2021-12-30

Publications (1)

Publication Number Publication Date
CN114513400A true CN114513400A (en) 2022-05-17

Family

ID=81549719

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210002511.1A Pending CN114513400A (en) 2021-12-30 2022-01-04 Log aggregation system and method for improving availability of log aggregation system

Country Status (2)

Country Link
CN (1) CN114513400A (en)
WO (1) WO2023123801A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910231B (en) * 2023-09-11 2023-11-17 社治无忧(成都)智慧科技有限公司 WeChat public opinion early warning method and system based on natural language processing
CN117194175A (en) * 2023-11-02 2023-12-08 广州嘉为科技有限公司 Log alarm monitoring method and device and computer storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010069229A1 (en) * 2008-12-18 2010-06-24 腾讯科技(深圳)有限公司 Method for selecting the transit node in p2p system and the p2p node thereof
US20190372924A1 (en) * 2018-06-04 2019-12-05 Salesforce.Com, Inc. Message logging using two-stage message logging mechanisms
CN111352806A (en) * 2020-03-31 2020-06-30 中国工商银行股份有限公司 Log data monitoring method and device
CN112383573A (en) * 2021-01-18 2021-02-19 南京联成科技发展股份有限公司 Security intrusion playback equipment based on multiple attack stages
CN112698915A (en) * 2020-12-31 2021-04-23 北京千方科技股份有限公司 Multi-cluster unified monitoring alarm method, system, equipment and storage medium
CN113496032A (en) * 2020-04-03 2021-10-12 中国信息安全测评中心 Big data operation abnormity monitoring system based on distributed computation and rule engine
CN113590492A (en) * 2021-08-23 2021-11-02 宁畅信息产业(北京)有限公司 Information processing method, system, electronic device and computer readable storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109547271B (en) * 2019-01-06 2020-01-03 广州泳泳信息科技有限公司 Network state real-time monitoring alarm system based on big data
CN109951323B (en) * 2019-02-27 2022-11-08 网宿科技股份有限公司 Log analysis method and system
CN110401657B (en) * 2019-07-24 2020-09-25 网宿科技股份有限公司 Processing method and device for access log
CN113268401B (en) * 2021-06-16 2022-10-04 中移(杭州)信息技术有限公司 Log information output method and device and computer readable storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010069229A1 (en) * 2008-12-18 2010-06-24 腾讯科技(深圳)有限公司 Method for selecting the transit node in p2p system and the p2p node thereof
US20190372924A1 (en) * 2018-06-04 2019-12-05 Salesforce.Com, Inc. Message logging using two-stage message logging mechanisms
CN111352806A (en) * 2020-03-31 2020-06-30 中国工商银行股份有限公司 Log data monitoring method and device
CN113496032A (en) * 2020-04-03 2021-10-12 中国信息安全测评中心 Big data operation abnormity monitoring system based on distributed computation and rule engine
CN112698915A (en) * 2020-12-31 2021-04-23 北京千方科技股份有限公司 Multi-cluster unified monitoring alarm method, system, equipment and storage medium
CN112383573A (en) * 2021-01-18 2021-02-19 南京联成科技发展股份有限公司 Security intrusion playback equipment based on multiple attack stages
CN113590492A (en) * 2021-08-23 2021-11-02 宁畅信息产业(北京)有限公司 Information processing method, system, electronic device and computer readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
胡鑫;姚宇;徐英杰;: "基于ElasticSearch的TEE病例库检索系统设计与实现", 计算机应用, no. 1 *

Also Published As

Publication number Publication date
WO2023123801A1 (en) 2023-07-06

Similar Documents

Publication Publication Date Title
CN110661659B (en) Alarm method, device and system and electronic equipment
CN109714192B (en) Monitoring method and system for monitoring cloud platform
CN110224858B (en) Log-based alarm method and related device
CN114513400A (en) Log aggregation system and method for improving availability of log aggregation system
CN110535713B (en) Monitoring management system and monitoring management method
CN101707632A (en) Method for dynamically monitoring performance of server cluster and alarming real-timely
CN105610648A (en) Operation and maintenance monitoring data collection method and server
CN112015753B (en) Monitoring system and method suitable for containerized deployment of open source cloud platform
CN111752805A (en) Cloud server resource monitoring and warning system
US20110239050A1 (en) System and Method of Collecting and Reporting Exceptions Associated with Information Technology Services
CN111538563A (en) Event analysis method and device for Kubernetes
CN110677304A (en) Distributed problem tracking system and equipment
CN114356499A (en) Kubernetes cluster alarm root cause analysis method and device
CN114124655A (en) Network monitoring method, system, device, computer equipment and storage medium
CN110809060A (en) Monitoring system and monitoring method for application server cluster
CN105187554A (en) Method and system for monitoring server performance
CN109560951A (en) A kind of configuration method, alarm real-time statistical method, server and system
CN114064374A (en) Fault detection method and system based on distributed block storage
CN113381884B (en) Full link monitoring method and device for monitoring alarm system
CN116126621A (en) Task monitoring method of big data cluster and related equipment
CN112202895B (en) Method and system for collecting monitoring index data, electronic equipment and storage medium
CN112533246B (en) Monitoring system and method for frequent network requests of intelligent equipment
TW201409968A (en) Information and communication service quality estimation and real-time alarming system and method
CN112328463A (en) Log monitoring method and device
CN115827393B (en) Server cluster monitoring and alarming system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination