WO2023123801A1

WO2023123801A1 - Log aggregation system, and method for improving availability of log aggregation system

Info

Publication number: WO2023123801A1
Application number: PCT/CN2022/091817
Authority: WO
Inventors: 陆玉平; 邓瑞明; 蔡攀龙
Original assignee: 上海川源信息科技有限公司
Priority date: 2021-12-30
Filing date: 2022-05-09
Publication date: 2023-07-06
Also published as: CN114513400A; CN114513400B

Abstract

Provided in the present application are a log aggregation system, etc. The system is used for collecting log information of a monitored object, wherein a reverse proxy assembly is used for receiving the log information of the monitored object, selecting, according to a first preset policy, one node from a plurality of nodes of a core service cluster as a target node, and sending the log information to the target node; and the nodes of the core service cluster are used for performing preset processing on the received log information, and each node monitors the node state of another node by means of mutual detection. By means of the present application, a distributed design is used, and a reverse proxy assembly and a cluster cooperate with each other, such that not only can the log collection processing and load balancing of a monitored object be realized, but the stability of the system is improved by means of redundant deployment of a plurality of nodes and mutual monitoring between the nodes, thereby ensuring the high availability of the system itself. In addition, an alarm can be raised regarding a detected problem, thereby facilitating an operation and maintenance team or a technical team to discover and position the problem in a timely manner.

Description

A log aggregation system and a method for improving the availability of the log aggregation system

technical field

The present application relates to the technical field of cloud computing and storage, and in particular to a log aggregation system and a method for improving the availability of the log aggregation system.

Background technique

Private Clouds (Private Clouds) are clouds built solely for a customer (such as a large enterprise) using technologies such as cloud computing. Enterprise private cloud integrates various advanced technologies such as cloud computing and big data management, and belongs to a new service model. It can not only integrate resources, improve resource utilization, but also reduce resource consumption and enterprise costs. Therefore, it has gained rapid development in recent years. develop.

In the enterprise private cloud, the processing\management of log data is an important task. For example, a log aggregation monitoring system (such as Loki) can be used to compress and store unstructured log data, and only index the metadata (metadata, including timestamps, labels, etc.) of the log data.

However, the inventor found in the process of implementing the solution of this application that the existing log aggregation monitoring system lacks guarantee measures for its own high availability. When the log aggregation monitoring system fails, it cannot find and report the problem in time, which may A large amount of log data is accumulated on the monitored object, which affects the performance of the monitored object and causes a waste of hard disk space.

Contents of the invention

The present application provides a log aggregation system and a method for improving the availability of the log aggregation system, so as to solve the reliability problem of the log aggregation system itself.

According to the first aspect of the embodiments of the present application, a log aggregation system is provided, the log aggregation system is used to collect log information of monitored objects; the log aggregation system includes a reverse proxy component and a core composed of multiple nodes service cluster;

The reverse proxy component is used to receive the log information of the monitored object, and, according to the first preset policy, select a node from the multiple nodes of the core service cluster as the target node, and send the log information to the the target node;

The nodes of the core service cluster are used to perform preset processing on the received log information, and each node monitors the node status of each node through mutual detection.

Optionally, the multiple nodes include at least three nodes.

Optionally, the node states are divided into active nodes, abnormal nodes, and unavailable nodes;

The nodes monitor each node's node status by detecting each other, including:

For each node:

selecting another node as the first node every first preset time length, and detecting whether the first node is an active node;

If the first node is not an active node, mark the node status of the first node as an abnormal node in this node, and send a broadcast message that the first node is an abnormal node in the cluster;

If a broadcast message that the first node is an active node is received within a second preset time period after sending the broadcast message that the first node is an abnormal node, mark the node status of the first node as an active node ;

If the second preset time period expires, the node status of the first node in this node is still an abnormal node, and the first node sent by other nodes has been received multiple times within the second preset time period. If the broadcast message that the node is an abnormal node, the node status of the first node is marked as an unavailable node, and the broadcast message that the first node is an unavailable node is sent in the cluster.

Optionally, the detecting whether the first node is an active node includes:

sending a probe message to the first node;

If no correct response from the first node is received, send a probe message to the first node again, or randomly select another node as the second node, and send an indirect probe request to the second node, to Make the second node send a detection message to the first node and return the detection result to the own node, wherein the indirect detection request includes the address of the first node;

If the correct response from the first node is still not received, it is determined that the first node is not an active node.

Optionally, receiving broadcast messages from other nodes that the first node is an abnormal node multiple times within the second preset duration includes:

Start a counter after sending the broadcast message that the first node is an abnormal node;

Within the second preset duration, whenever receiving a broadcast message from other nodes that the first node is an abnormal node, the counter is incremented by 1;

When the counter is greater than the preset value, it is determined that the broadcast message that the first node is an abnormal node is received multiple times from other nodes within the second preset time period.

Optionally, each node monitors the node status of each node through mutual detection, and further includes:

For each node:

When receiving a broadcast message that the third node is an unavailable node, if the third node is not marked as an unavailable node on the current node, marking the third node as an unavailable node on the current node, and sending a broadcast message that the third node is an unavailable node in the cluster to form re-propagation;

When receiving a broadcast message that the node is an abnormal node, or a broadcast message that the node is an unavailable node, send a broadcast message that the node is an active node in the cluster, so as to modify other nodes' node status marks for the node.

For each node:

When the current node leaves the cluster, a broadcast message that the current node is an unavailable node is sent in the cluster.

Optionally, the system also includes:

a storage component, configured to store data processed by the core service cluster;

An alarm component, configured to send alarm information according to a second preset strategy when it is found that the log information of the monitored object is abnormal, abnormal nodes appear in the cluster and/or unavailable nodes appear in the cluster;

The data visualization component is used to display the log information and/or the alarm information.

According to the second aspect of the embodiment of the present application, there is provided a method for improving the availability of the log aggregation system, the method is used for nodes in the log aggregation system; the log aggregation system is used to collect the log information of the monitored object, including feedback To a proxy component and a core service cluster composed of a plurality of nodes; the reverse proxy component is used to receive the log information of the monitored object, and select a node from a plurality of nodes according to a first preset policy as The target node sends the log information to the target node; the node is used to perform preset processing on the received log information, and each node monitors the node status of each node through mutual detection, and the node status is divided into Active nodes, abnormal nodes, and unavailable nodes;

The methods include:

Optionally, the multiple nodes include at least three nodes.

Optionally, the detecting whether the first node is an active node includes:

sending a probe message to the first node;

Optionally, the method also includes:

Optionally, the log aggregation system further includes:

The technical solutions provided by the embodiments of the present application may include the following beneficial effects:

In the embodiment of this application, a log aggregation system including a reverse proxy component and a cluster is provided. The cluster adopts a distributed design and can include multiple nodes. The collection and processing of object logs and load balancing, and through the redundant deployment of multiple nodes and mutual monitoring between nodes, the stability of the system has been improved, ensuring the high availability of the system itself. If a problem occurs in one or two nodes, no It will affect the service functions provided by the system to the outside world. In addition, the monitored problems can be alerted, so that the operation and maintenance team or technical team can find and locate problems in time.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, for those of ordinary skill in the art, In other words, other drawings can also be obtained from these drawings without paying creative labor. In addition, these descriptions do not constitute limitations on the embodiments. Elements with the same reference numerals in the drawings indicate similar elements. Unless otherwise specified, the figures in the drawings do not constitute scale limitations.

FIG. 1 is a schematic diagram of a log aggregation system provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of a scene of an embodiment of the present application;

Figure 3 is a schematic diagram of the interior of the Loki cluster in the embodiment of the present application;

Fig. 4 is a schematic diagram of a method for improving the availability of a log aggregation system provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be described in detail below with reference to the drawings in the embodiments of the present application. When referring to the drawings, unless otherwise stated, the same numerals in different drawings identify the same or similar elements. Apparently, the embodiments described below are only some of the embodiments of the application, but not all of the embodiments, or the implementations described in the following exemplary embodiments do not represent all implementations consistent with the application. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

When the terms "first", "second", "third", etc. appear in the specification, claims and above-mentioned drawings of the embodiments of the present application, they are used to distinguish different objects, rather than to limit a specific order . In the embodiments of the present application, words such as "exemplary" or "for example" are used as examples, illustrations or illustrations. Any embodiment or design scheme described as "exemplary" or "for example" in the embodiments of the present application shall not be construed as being more preferred or more advantageous than other embodiments or design schemes. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete manner.

Fig. 1 is a schematic diagram of a log aggregation system provided by an embodiment of the present application. The log aggregation system can be used to collect log information of monitored objects. The log aggregation system includes a reverse proxy component and a core service cluster composed of multiple nodes.

The reverse proxy component is used to receive the log information of the monitored object, and, according to the first preset policy, select a node from the multiple nodes of the core service cluster as the target node, and send the log information to the the target node.

Wherein, the reverse proxy component specifically selects a node from the multiple nodes of the core service cluster as the target node according to what strategy, that is, the specific content of the first preset strategy is not limited in this embodiment. Those skilled in the art can make their own selections and designs according to different requirements/different scenarios, and these selections and designs that can be used here do not deviate from the spirit and protection scope of the present application.

As an example, the first preset strategy may specifically include one or more strategies such as round robin, node weight distribution, least link method, and performance optimal method.

As an example, the plurality of nodes may specifically include at least three nodes. A node can be, for example, a server.

In addition, this embodiment does not limit the processing of the received log information by the node, that is, the specific content of the preset processing.

As an example, the preset processing may include parsing logs, indexing log content, adding tags, removing duplicates, compressing and storing, and so on. For example, the date of the log is extracted for indexing to facilitate subsequent retrieval of logs based on the date. Another example is to extract the source of the log for indexing, such as distinguishing the logs of the Apache service and the logs of the database. You can also label the logs to distinguish between Matrics and Log, or to distinguish between Info level, Warning level, or Error level, and so on.

In this embodiment or some other embodiments of this application, the node states can be specifically divided into three types: active nodes, abnormal nodes, and unavailable nodes;

The nodes monitor the node status of each node through mutual detection, which may specifically include:

For each node:

1) Selecting another node as the first node every first preset time interval, and detecting whether the first node is an active node.

For example, a node may be selected as the first node among other nodes in the cluster other than the own node by polling or randomly.

This embodiment does not limit how to specifically detect whether the first node is an active node. As an example, the detecting whether the first node is an active node may include:

Send a probe message to the first node (for example, the first node can be PING);

2) If the first node is not an active node, mark the node status of the first node as an abnormal node in this node, and send a broadcast message that the first node is an abnormal node in the cluster.

3) If a broadcast message that the first node is an active node is received within a second preset time period after sending the broadcast message that the first node is an abnormal node, mark the node status of the first node as active node.

4) If the second preset time period expires, the node status of the first node in this node is still an abnormal node, and the node sent by other nodes has been received multiple times within the second preset time period. If the broadcast message that the first node is an abnormal node, the node status of the first node is marked as an unavailable node, and the broadcast message that the first node is an unavailable node is sent in the cluster.

As an example, receiving multiple broadcast messages from other nodes that the first node is an abnormal node within the second preset time length may specifically include:

This embodiment does not limit the specific content of the above first preset duration, second preset duration, "multiple times", preset values, etc., and those skilled in the art can choose, Designs, these options and designs may be used herein without departing from the spirit and scope of the application.

The above content mainly describes some operations that a node actively detects the status of other nodes. In addition, in practice, a node may also be detected by other nodes, or receive messages broadcast by other nodes, so each node monitors the node status of each node through mutual detection, which can also include:

For each node:

In addition, when a node leaves the cluster, it can inform other nodes in the cluster that the node is unavailable. Therefore, each node monitors the node state of each node through mutual detection, and may also include:

For each node:

In addition, for the convenience of storage, alarm, monitoring, etc., the system may also include:

In this embodiment, a log aggregation system including a reverse proxy component and a cluster is provided. The cluster adopts a distributed design and may include multiple nodes. The collection and processing of logs and load balancing, and the redundant deployment of multiple nodes and mutual monitoring between nodes improve the stability of the system and ensure the high availability of the system itself. If there is a problem with one or two nodes, it will not Affect the service functions provided by the system to the outside world. In addition, the monitored problems can be alerted, so that the operation and maintenance team or technical team can find and locate problems in time.

Let's take Loki as an example and further describe the solution of this application in combination with specific application scenarios. Of course, taking the Loki application scenario as an example is only exemplary, and it can also be applied to other application scenarios in actual applications. For example, the three-node deployment method of this application can also be adopted to ensure the high availability of the Prometheus cluster.

Loki is a log aggregation system, which mainly compresses and stores unstructured log data and only indexes the metadata of log data (including: timestamp, labels, etc.). The Loki service receives the log data pushed from the Promtail component, and distributes the log data to the internal Ingester component. The index data and log content data of the log data are stored by the Ingester component. Unlike Prometheus, Loki can only passively accept log pushes from Promtail components.

However, the inventor found in the process of implementing the application solution that Loki’s main service cannot guarantee its own high availability. When Loki itself fails, if the monitored object also fails, Loki cannot find out in time. And report problems, and will cause a large amount of log data to be backlogged on the monitored object, affecting the performance of the monitored object and causing a waste of hard disk space.

Therefore, in this embodiment, a Loki service core cluster is constructed, which can be composed of three nodes, and the three nodes monitor each other's status. If a problem occurs in one or two nodes in the Loki service core cluster, the cluster will not be affected Provide external log collection and aggregation services, and alert the operation and maintenance team or technical team of faults through the alarm component, requesting manual intervention.

Fig. 2 is a schematic diagram of a scenario of the embodiment of the present application, and Fig. 3 is a schematic diagram of the interior of the Loki cluster in the embodiment of the present application. In Figure 2, the Loki cluster is externally connected to components such as Promtail, Node Exporter, alarm component, data visualization component, and storage pool (including index Index, log Log, and indicator Metrics). The Prometheus cluster is a tool cluster used to capture the Metrics (indicators) data of the monitoring target. For Promtail, Node Exporter, Prometheus, etc., they are all existing technologies and will not be described here. In Figure 3, the Loki cluster may include reverse proxy components and Loki core service clusters. The following is a detailed introduction:

Loki core service cluster: connect to the monitoring target, collect and store various logs of the monitoring target. It is also possible to provide visual components to query logs. In the cluster, 3Node (nodes) are used for redundant networking, and reverse proxy components are used for reverse proxy.

Reverse proxy component: The client (visualization component and Promtail component) sends a request to the reverse proxy component, and then the reverse proxy component forwards the request to the appropriate Loki node in the backend according to the preset policy. In addition, in order to ensure the high availability of the reverse proxy component, the active and standby disaster recovery can be implemented inside the reverse proxy component. The active and standby are connected by a heartbeat line. For backend Loki nodes.

Promtail component: responsible for collecting logs of monitored objects, and sending the logs to the reverse proxy component of the Loki cluster, so that the reverse proxy component sends the log transmission request to the appropriate Loki node according to the preset policy, and then stores it after processing to the storage pool.

Monitoring object: that is, the monitored object, or monitoring target. The Promtail and Node Exporter components need to be installed in the monitoring object to facilitate the Loki cluster and Prometheus cluster to fetch data. The Promtail component actively pushes the log data to the Loki cluster, while the Node Export component exposes the Http interface of the monitored object to Prometheus, so that Prometheus can regularly capture indicators.

Storage pool: used to store log information and indicators obtained by the Loki cluster. Considering the read and write performance of log and indicator data, it is recommended to use the storage pool of the object storage service. In order to improve the retrieval efficiency of Loki logs, it is recommended that the Index of the log data retrieved by Loki be placed in a high-performance memory database.

Alarm component: used to receive alarm information and send the alarm information to preset contacts. By making different Hooks, the alarm information can be sent to various IM communication tools (such as WeChat, Slack, Teams, Lync, etc.), email, phone, SMS, etc.

Further, as an example, the alert component may specifically use the Alertmanager component, which supports query of logs and indicators, and provides flexible alert methods. According to pre-set rules, monitor object logs or indicators are detected (such as Matrics information, including CPU usage/memory usage/hard disk usage/network card throughput/hard disk iops/API Response performance/network request error rate, etc. etc.) when abnormal, push the alarm information to the Alertmanager component. When Alertmanager receives an alert, it can perform configuration, aggregation, deduplication, noise reduction, and finally send an alert. The Altermanager component can send alarm information to the operation and maintenance team or technical team by email/sms/phone, so that they can find and locate problems in time. You can also link to third-party operation and maintenance monitoring platforms, such as xMatters/Sumologc/Splunk, etc.

Data visualization component: used to aggregate and display the monitoring data (such as the Matrics data of the system, such as the CPU usage/memory usage/hard disk usage/network card throughput of a server collected every 5 minutes, etc., according to These values draw a line chart and display), log data and alarm information. When the visualization component needs to display log or alarm information, it will initiate a query request for log or alarm information to the reverse proxy component. After receiving the request, the reverse proxy component forwards the request to the Query-frontend component, which is responsible for communicating with each Loki The Querier service on the node communicates to obtain various logs or alarm information stored in the storage pool, and replies to the reverse proxy component, which is finally displayed on the interface of the visualization component.

The above Loki cluster uses a redundant three-node cluster internally, and uses a reverse proxy component including active and standby disaster recovery externally to provide reverse proxy functions, and only exposes one interface to users.

In specific implementation, a complete set of Loki core services Distributor, Ingester, and Querier can be deployed on each node. Since these three components belong to the existing technology, they will not be described in detail. In this embodiment, a node monitoring component is newly deployed on each node, so that each node can monitor the node status of each node through mutual detection.

In addition, since the node monitoring component runs in the memory, it is possible to install a UPS power supply in the storage device, and force the UPS power supply to perform a safe shutdown of the system when the storage device detects that the storage device has been in the state of mains power failure for more than 10 minutes. , the strategy for dumping data in memory to disk.

The status of nodes can be divided into three types: active nodes, abnormal nodes and unavailable nodes. The node failure detection principle of the node monitoring component may include:

i) After a node (for example, node A) is started, select (for example, poll) another node (for example, node B) to send a PING message to it at regular time intervals. When the PING message fails, you can send the PING message to node B again, or randomly select other nodes (such as node C) to initiate an indirect PING request, and node C that receives the indirect PING request will initiate a PING to node B according to the address in the request message, and return the result of PING to the source node of the indirect request, that is, node A. If node A does not receive any ACK message from node B after the detection timeout, it will mark the status of node B as an abnormal node. Then node A starts a timer and sends out a broadcast of an abnormal warning for node B. During this period, if it receives the same abnormal warning information for node B from other nodes, node A will locally The number of abnormal nodes + 1, when the timer expires, the state of node B is still inactive, and the number of abnormal nodes reaches the requirement, then node A will mark node B locally as an unavailable node.

ii) When the suspected failure node (such as node B) receives the message of abnormal warning sent by other nodes, it will immediately send the broadcast that the node is the active node, thereby clearing the other nodes on the node. is the label of the abnormal node.

iii) When a node (such as node A) leaves the cluster, it will send a broadcast to the cluster that the node is an unavailable node; when node A marks other nodes (such as node B) as an unavailable node, it will also send a broadcast to the cluster The node is a broadcast of an unavailable node.

iv) When other nodes (such as node C) receive a broadcast message that a node (such as node B) is an unavailable node, they will compare it with the local record, and ignore the message when node B is also an unavailable node in the local record , when node B in the local record is not an unavailable node, the original local record will be deleted and node B will be marked as an unavailable node, and the message that node B is an unavailable node will be broadcast again to form another propagation.

v) If a node (for example, node B) receives a broadcast message that it is an unavailable node, it means that the node is partitioned compared to other nodes. At this time, the node will initiate a broadcast that the node is an active node to correct the error on other nodes. The stored state flag of this node.

In this way, each node in the cluster is connected to each other, and the status of the node can be confirmed through the node monitoring component. If other nodes or their own services fail, the alarm component can send the alarm information to third-party operation and maintenance monitoring platforms such as xMatters/Sumologc/Splunk, or send the alarm information to the operation and maintenance directly by email/sms/phone team or technical team, so that they can discover and locate problems in a timely manner.

In addition, in order to ensure data consistency, an arbitration mechanism can also be added to the node monitoring component. The arbitration mechanism is realized by mutual monitoring of each node, so as to avoid the risk of data split brain in the cluster under severe conditions.

The following are method embodiments of the present application, which can be used to implement the system embodiments of the present application. For details not disclosed in the method embodiments of the present application, please refer to the system embodiments of the present application.

Fig. 4 is a schematic diagram of a method for improving the availability of a log aggregation system provided by an embodiment of the present application. The method can be used for nodes in a log aggregation system; the log aggregation system is used to collect log information of monitored objects, including a reverse proxy component and a core service cluster composed of a plurality of nodes; the reverse proxy The component is used to receive the log information of the monitored object, and select a node from a plurality of nodes as a target node according to a first preset policy and send the log information to the target node; the node is used to The received log information is pre-processed, and each node monitors the node status of each node through mutual detection. The node status is divided into active nodes, abnormal nodes, and unavailable nodes;

Referring to Figure 4, the method may include:

Step S401, select another node as the first node every first preset time interval, and detect whether the first node is an active node.

Step S402, if the first node is not an active node, mark the node status of the first node as an abnormal node in this node, and send a broadcast message that the first node is an abnormal node in the cluster .

Step S403, if the broadcast message that the first node is an active node is received within a second preset time period after sending the broadcast message that the first node is an abnormal node, mark the node status of the first node as for the active node.

Step S404, if when the second preset time period expires, the node status of the first node in this node is still an abnormal node, and the node has received all the messages sent by other nodes multiple times within the second preset time period. If the broadcast message that the first node is an abnormal node is used, the node status of the first node is marked as an unavailable node, and a broadcast message that the first node is an unavailable node is sent in the cluster.

In this embodiment or some other embodiments of the present application, the multiple nodes may specifically include at least three nodes.

In this embodiment or some other embodiments of the present application, the detecting whether the first node is an active node may specifically include:

sending a probe message to the first node;

In this embodiment or some other embodiments of the present application, the broadcast message that the first node is an abnormal node sent by other nodes is received multiple times within the second preset time period may specifically include:

In this embodiment or some other embodiments of the present application, the method may further include:

When receiving a broadcast message that this node is an abnormal node, or a broadcast message that this node is an unavailable node, send a broadcast message that this node is an active node in the cluster to modify other nodes' node status marks for this node.

In this embodiment or some other embodiments of this application, the log aggregation system further includes:

With regard to the methods in the above embodiments, the specific manner of performing operations in each step has been described in detail in the embodiments of related systems, and will not be repeated here.

In this embodiment, a method for improving the availability of the log aggregation system is provided, the method is used for nodes in the log aggregation system, and the log aggregation system includes a reverse proxy component and a core composed of a plurality of nodes Service cluster. The cluster adopts a distributed design and can include multiple nodes. The reverse proxy component and the cluster cooperate with each other. Not only can the collection, processing and load balancing of the logs of the monitored objects be realized, but also through the redundant deployment of multiple nodes and Mutual monitoring between nodes improves the stability of the system and ensures the high availability of the system itself. If there is a problem with one or two nodes, it will not affect the service functions provided by the system to the outside world. In addition, the detected problems can be alerted. It is convenient for the operation and maintenance team or technical team to find and locate problems in time.

The above is only a preferred embodiment of the application, and does not limit the application in any form. Although the application has disclosed the above with the preferred embodiment, it is not used to limit the application. Anyone who is familiar with this field Those skilled in the art, without departing from the scope of the technical solution of this application, can use the technical content disclosed above to make some changes or modify them into equivalent embodiments with equivalent changes, but if they do not deviate from the technical solution of this application, according to The technical essence of the technical solution is within the spirit and principles of the technical solution of the present application. Any simple modifications, equivalent replacements and improvements made to the above embodiments still fall within the protection scope of the technical solution of the present application.

Other embodiments of the present application will be readily apparent to those skilled in the art from consideration of the specification and practice of the approaches disclosed herein. This application is intended to cover any modification, use or adaptation of the application, these modifications, uses or adaptations follow the general principles of the application and include common knowledge or conventional technical means in the technical field not disclosed in the application . The specification and examples are to be considered exemplary only, with a true scope and spirit of the application indicated by the appended claims.

It should be understood that the present application is not limited to the precise constructions which have been described above and shown in the accompanying drawings, and various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

A log aggregation system, characterized in that, the log aggregation system is used to collect log information of monitored objects; the log aggregation system includes a reverse proxy component and a core service cluster composed of a plurality of nodes;

The reverse proxy component is used to receive the log information of the monitored object, and, according to the first preset policy, select a node from the multiple nodes of the core service cluster as the target node, and send the log information to the the target node;

The nodes of the core service cluster are used to perform preset processing on the received log information, and each node monitors the node status of each node through mutual detection.
The system of claim 1, wherein the plurality of nodes comprises at least three nodes.
The system according to claim 1, wherein the node states are divided into active nodes, abnormal nodes, and unavailable nodes;

The nodes monitor each node's node status by detecting each other, including:

For each node:

selecting another node as the first node every first preset time length, and detecting whether the first node is an active node;

If the first node is not an active node, mark the node status of the first node as an abnormal node in this node, and send a broadcast message that the first node is an abnormal node in the cluster;

If a broadcast message that the first node is an active node is received within a second preset time period after sending the broadcast message that the first node is an abnormal node, mark the node status of the first node as an active node ;

If the second preset time period expires, the node status of the first node in this node is still an abnormal node, and the first node sent by other nodes has been received multiple times within the second preset time period. If the broadcast message that the node is an abnormal node, the node status of the first node is marked as an unavailable node, and the broadcast message that the first node is an unavailable node is sent in the cluster.
The system according to claim 3, wherein the detecting whether the first node is an active node comprises:

sending a probe message to the first node;

If no correct response from the first node is received, send a probe message to the first node again, or randomly select another node as the second node, and send an indirect probe request to the second node, to Make the second node send a detection message to the first node and return the detection result to the own node, wherein the indirect detection request includes the address of the first node;

If the correct response from the first node is still not received, it is determined that the first node is not an active node.
The system according to claim 3, wherein the broadcast message that the first node is an abnormal node is received multiple times from other nodes within the second preset duration, including:

Start a counter after sending the broadcast message that the first node is an abnormal node;

Within the second preset duration, whenever receiving a broadcast message from other nodes that the first node is an abnormal node, the counter is incremented by 1;

When the counter is greater than the preset value, it is determined that the broadcast message that the first node is an abnormal node is received multiple times from other nodes within the second preset time period.
The system according to claim 3, wherein each node monitors the node status of each node through mutual detection, further comprising:

For each node:

When receiving a broadcast message that the third node is an unavailable node, if the third node is not marked as an unavailable node on the current node, marking the third node as an unavailable node on the current node, and sending a broadcast message that the third node is an unavailable node in the cluster to form re-propagation;

When receiving a broadcast message that the node is an abnormal node, or a broadcast message that the node is an unavailable node, send a broadcast message that the node is an active node in the cluster, so as to modify other nodes' node status marks for the node.
The system according to claim 3, wherein each node monitors the node status of each node through mutual detection, further comprising:

For each node:

When the current node leaves the cluster, a broadcast message that the current node is an unavailable node is sent in the cluster.
The system according to claim 1, further comprising:

a storage component, configured to store data processed by the core service cluster;

An alarm component, configured to send alarm information according to a second preset strategy when it is found that the log information of the monitored object is abnormal, abnormal nodes appear in the cluster and/or unavailable nodes appear in the cluster;

The data visualization component is used to display the log information and/or the alarm information.
A method for improving the availability of a log aggregation system, characterized in that the method is used for nodes in the log aggregation system; the log aggregation system is used to collect log information of monitored objects, including reverse proxy components and multiple The core service cluster composed of the nodes; the reverse proxy component is used to receive the log information of the monitored object, and select a node from a plurality of the nodes as the target node according to the first preset strategy and transfer the log The information is sent to the target node; the node is used to perform preset processing on the received log information, and each node monitors the node status of each node through mutual detection, and the node status is divided into active node, abnormal node, unavailable node;

The methods include:

selecting another node as the first node every first preset time length, and detecting whether the first node is an active node;

If the first node is not an active node, mark the node status of the first node as an abnormal node in this node, and send a broadcast message that the first node is an abnormal node in the cluster;

If a broadcast message that the first node is an active node is received within a second preset time period after sending the broadcast message that the first node is an abnormal node, mark the node status of the first node as an active node ;

If the second preset time period expires, the node status of the first node in this node is still an abnormal node, and the first node sent by other nodes has been received multiple times within the second preset time period. If the broadcast message that the node is an abnormal node, the node status of the first node is marked as an unavailable node, and the broadcast message that the first node is an unavailable node is sent in the cluster.
The method of claim 9, wherein the plurality of nodes comprises at least three nodes.
The method according to claim 9, wherein the detecting whether the first node is an active node comprises:

sending a probe message to the first node;

If no correct response from the first node is received, send a probe message to the first node again, or randomly select another node as the second node, and send an indirect probe request to the second node, to Make the second node send a detection message to the first node and return the detection result to the own node, wherein the indirect detection request includes the address of the first node;

If the correct response from the first node is still not received, it is determined that the first node is not an active node.
The method according to claim 9, wherein the broadcast message that the first node is an abnormal node is received multiple times from other nodes within the second preset duration, including:

Start a counter after sending the broadcast message that the first node is an abnormal node;

Within the second preset duration, whenever receiving a broadcast message from other nodes that the first node is an abnormal node, the counter is incremented by 1;

When the counter is greater than the preset value, it is determined that the broadcast message that the first node is an abnormal node is received multiple times from other nodes within the second preset time period.
The method according to claim 9, characterized in that the method further comprises:

When receiving a broadcast message that the third node is an unavailable node, if the third node is not marked as an unavailable node on the current node, marking the third node as an unavailable node on the current node, and sending a broadcast message that the third node is an unavailable node in the cluster to form re-propagation;

When receiving a broadcast message that the node is an abnormal node, or a broadcast message that the node is an unavailable node, send a broadcast message that the node is an active node in the cluster, so as to modify other nodes' node status marks for the node.
The method according to claim 9, characterized in that the method further comprises:

When the current node leaves the cluster, a broadcast message that the current node is an unavailable node is sent in the cluster.
The method according to claim 9, wherein the log aggregation system further comprises:

a storage component, configured to store data processed by the core service cluster;

An alarm component, configured to send alarm information according to a second preset strategy when it is found that the log information of the monitored object is abnormal, abnormal nodes appear in the cluster and/or unavailable nodes appear in the cluster;

The data visualization component is used to display the log information and/or the alarm information.