CN114356710A

CN114356710A - Cluster data monitoring method and device, storage medium and electronic equipment

Info

Publication number: CN114356710A
Application number: CN202210002285.7A
Authority: CN
Inventors: 赵宇; 王东; 侯雪峰
Original assignee: Beijing Kingsoft Cloud Network Technology Co Ltd
Current assignee: Beijing Kingsoft Cloud Network Technology Co Ltd
Priority date: 2022-01-04
Filing date: 2022-01-04
Publication date: 2022-04-15

Abstract

The invention discloses a cluster data monitoring method, a cluster data monitoring device, a storage medium and electronic equipment. The method comprises the following steps: acquiring index data of each component in each cluster, wherein each cluster comprises a plurality of components, and each component corresponds to one index data; aggregating the index data in each cluster according to the service dimension and the cluster dimension to obtain cluster data of each cluster and service data of each service in each cluster; and monitoring each component in each cluster according to the index data in each cluster, monitoring each service in each cluster according to the service data in each cluster, and monitoring each cluster according to the cluster data of each cluster. The invention solves the technical problem of low cluster data monitoring alarm efficiency.

Description

Cluster data monitoring method and device, storage medium and electronic equipment

Technical Field

The invention relates to the field of computers, in particular to a cluster data monitoring method, a cluster data monitoring device, a storage medium and electronic equipment.

Background

In the prior art, various components are often used in a server to provide services. The types, sources and the like of the provided components are different, and the services provided by the components are also different, so that the data of the components, the services, the clusters and the like in the clusters cannot be accurately monitored and alarmed.

Disclosure of Invention

The embodiment of the invention provides a cluster data monitoring method, a cluster data monitoring device, a storage medium and electronic equipment, and at least solves the technical problem of low cluster data monitoring alarm efficiency.

According to an aspect of an embodiment of the present invention, a cluster data monitoring method is provided, including: acquiring index data of each component in each cluster, wherein each cluster comprises a plurality of components, and each component corresponds to one index data; aggregating the index data in each cluster according to service dimensions and cluster dimensions to obtain cluster data of each cluster and service data of each service in each cluster; monitoring each of the components in each of the clusters according to the index data in each of the clusters, monitoring each of the services in each of the clusters according to the service data in each of the clusters, and monitoring each of the clusters according to the cluster data of each of the clusters.

According to another aspect of the embodiments of the present invention, there is provided a cluster data monitoring apparatus, including: the acquisition module is used for acquiring index data of each component in each cluster, wherein each cluster comprises a plurality of components, and each component corresponds to one index data; the aggregation module is used for aggregating the index data in each cluster according to service dimensions and cluster dimensions to obtain cluster data of each cluster and service data of each service in each cluster; a first monitoring module, configured to monitor each component in each cluster according to the index data in each cluster, monitor each service in each cluster according to the service data in each cluster, and monitor each cluster according to the cluster data of each cluster.

As an optional example, the apparatus further comprises: the registration module is used for taking each cluster as a current cluster and registering component processes of all components of the current cluster; the first detection module is used for detecting the component process every other first time length; the first reporting module is configured to report that a target process of a target component does not exist under the condition that target index data of the target component fails to be acquired and under the condition that the target process does not exist.

As an optional example, the apparatus further comprises: the second detection module is used for detecting the component service port corresponding to the component process every second time length; a second reporting module, configured to report that a target component service port is unavailable when a target process of the target component exists and a target component service port corresponding to the target component is unavailable under a condition that target index data of the target component fails to be obtained.

As an optional example, the apparatus further comprises: and a third reporting module, configured to report that a special exception exists in the target component when the target process of the target component exists and a service port of the target component corresponding to the target component is available.

As an optional example, the apparatus further comprises: the second monitoring module is used for monitoring the host parameter index of each host in each cluster by the monitoring script in each cluster; and the determining module is used for determining that the corresponding host has abnormity under the condition that the host parameter index exceeds a normal data range.

As an optional example, the obtaining module includes: and the calling unit is used for calling a management extension tool JMX to acquire the CPU utilization rate, the process survival state, the read-write rate, the write-read delay and the load of each component.

As an optional example, the first monitoring module comprises: a first determination unit configured to determine that any one of the first components is abnormal when the index data of the first component is out of a normal range; a second determining unit, configured to determine that the first service is abnormal when the service data of any one first service exceeds a normal range; a third determining unit, configured to determine that any one of the first clusters is abnormal when the cluster data of the first cluster exceeds a normal range.

As an optional example, the apparatus further comprises: and the prompting module is used for sending a monitoring short message to a target number, or dialing the target number, or sending a monitoring mail to a target account to prompt that the cluster or the service or the component is abnormal under the condition that any one of the cluster is abnormal or any one of the service is abnormal or any one of the component is abnormal.

According to another aspect of the embodiments of the present invention, there is also provided a storage medium, in which a computer program is stored, where the computer program is configured to execute the above cluster data monitoring method when running.

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor is configured to execute the cluster data monitoring method through the computer program.

In the embodiment of the invention, index data of each component in each cluster is obtained, wherein each cluster comprises a plurality of components, and each component corresponds to one index data; aggregating the index data in each cluster according to service dimensions and cluster dimensions to obtain cluster data of each cluster and service data of each service in each cluster; monitoring each of the components in each of the clusters based on the metric data in each of the clusters, and monitoring each of the services in each of the clusters according to the service data in each of the clusters, and a method of monitoring each of said clusters based on said cluster data for each of said clusters, since, in the above method, the index data of each component in each cluster can be acquired, then, the data of the components are aggregated according to the service dimension and the cluster dimension, each component, each service and each cluster are respectively monitored according to the index data, the service data and the cluster data after the aggregation, therefore, the purposes of accurately monitoring and alarming components of different sources or types and monitoring and alarming services and clusters are achieved, and the technical problem of low cluster data monitoring and alarming efficiency is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow diagram of an alternative method of cluster data monitoring according to an embodiment of the present invention;

FIG. 2 is a system diagram of an alternative method of cluster data monitoring according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of monitoring alarms of an alternative cluster data monitoring method according to an embodiment of the present invention;

fig. 4 is a schematic diagram of index data acquisition of an optional cluster data monitoring method according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an alternative cluster data monitoring apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to a first aspect of the embodiments of the present invention, a cluster data monitoring method is provided, optionally, as shown in fig. 1, the method includes:

s102, acquiring index data of each component in each cluster, wherein each cluster comprises a plurality of components, and each component corresponds to one index data;

s104, aggregating the index data in each cluster according to service dimensions and cluster dimensions to obtain cluster data of each cluster and service data of each service in each cluster;

s106, monitoring each component in each cluster according to the index data in each cluster, monitoring each service in each cluster according to the service data in each cluster, and monitoring each cluster according to the cluster data of each cluster.

Optionally, the components in this embodiment may be components for providing services in a server, and the type and number of the components are not limited. The server can be a cloud server or a big data server.

In this embodiment, in the case of deploying components on servers, the number, types, and sources of the components in different server clusters may be different. Index data for components in the cluster may be collected. After the index data of each component is collected, the data can be sent to a data center, the data center aggregates the index data into data of service dimensions and cluster dimensions, and the index data, the data of the service dimensions and the data of the cluster dimensions are monitored. If the data center monitors that the index data of a certain component, service or cluster exceeds the normal data range, the component, service or cluster is determined to be abnormal, and warning information is sent to the target account.

In this embodiment, the same or different normal data ranges may be configured for different components and services. Different components correspond to different component tags to uniquely label the component. Different services correspond to different service designations. When the warning information is sent to the target account, a mark of an abnormal component or service can be sent to the target account, so that the target account can process the abnormal component or service. The target account can determine the abnormal type of the abnormal component or service or cluster through the alarm information, thereby quickly positioning the abnormality and completing maintenance.

By the method, index data of each component in each cluster can be acquired, then the data of the components are aggregated according to the service dimension and the cluster dimension, and each component, each service and the cluster are respectively monitored according to the index data, the service data and the cluster data after aggregation, so that the purposes of accurately monitoring and alarming the components with different sources or types and monitoring and alarming the services and the clusters are achieved.

As an optional example, the method further includes:

taking each cluster as a current cluster, and registering component processes of all components of the current cluster;

detecting the component process every other first time length;

and reporting that the target process does not exist under the condition that the target index data of the target component is not obtained and the target process of the target component does not exist.

Optionally, the first duration in this embodiment may be flexibly configured. E.g., every minute, every 10 seconds, etc. Besides the configuration, the number of times of abnormality of the component can be adjusted, and the first duration is shorter as the number of times of abnormality is larger.

And detecting the progress of the component every a first time interval, and if the data of the component cannot be acquired, detecting whether the progress exists. If the process does not exist, reporting the information that the target process does not exist.

The index data of the component is acquired every first time interval through the embodiment, so that the state of the component can be monitored more accurately, the alarm can be given more timely, and the monitoring alarm efficiency of the component is improved.

As an optional example, the method further includes:

detecting the component service port corresponding to the component process every second time length;

reporting that the target component service port is unavailable under the condition that target index data of the target component is not obtained and the target process of the target component exists and the target component service port corresponding to the target component is unavailable.

Optionally, in this embodiment, the component service port of the component process is detected every second duration. And detecting whether a target process exists or not under the condition that the index data of the component cannot be acquired, and if the target process exists, detecting whether a service port of the target component is available or not. And if the target component service port is unavailable, reporting a message that the target component service port is unavailable.

As an optional example, the method further includes:

reporting that the target component has special exception under the condition that the target process of the target component exists and the service port of the target component corresponding to the target component is available.

Optionally, in this embodiment, if the index data of the target component is obtained, and if the index data of the target component is not obtained, and the process of the target component exists and the service port of the target component is also available, it indicates that the target component is abnormal, and a special warning message needs to be sent to the target account.

As an optional example, the method further includes:

monitoring a host parameter index of each host in each cluster by a monitoring script in each cluster;

and determining that the corresponding host has abnormity under the condition that the host parameter index exceeds a normal data range.

Optionally, in this embodiment, when determining whether the index data of the component can be acquired, a host parameter index of a host where the current component is located may also be acquired. The host parameter index may include disk occupancy, etc. Reporting the host machine parameter index to a data center, and monitoring the host machine parameter index.

As an optional example, the obtaining of the index data of each component in the cluster collected by the collection module includes:

and calling a management extension tool JMX to collect the CPU utilization rate, the process survival state, the read-write rate, the write-read delay and the load of each component.

Optionally, in this embodiment, the management extension tool JMX may be used to collect the index data of each component. The target data may include CPU usage, process survival status, read-write rates, write-read latency, and load of the component. Statistics may be performed by type of data.

Optionally, in this embodiment, if the obtaining of the index data of one component fails, the obtaining of the index data of the component may be repeated. And when the index data is repeatedly acquired, the index data is still acquired within a preset time length. For example, the predetermined time is 10 minutes, after the index data of the current component is failed to be acquired at 12:00, the index data is repeatedly acquired before 12:10, and the repeated acquisition times are recorded. The index data is still to be acquired at 12:10, and the index data acquired at 12:10 does not count the number of repeated acquisitions. If 12: if the index data is not acquired 10, the index data is repeatedly acquired before 12:20, and the repeated acquisition times are recorded.

As an optional example, the monitoring each component in each cluster according to the index data in each cluster, and monitoring each service in each cluster according to the service data in each cluster, and monitoring each cluster according to the cluster data of each cluster includes:

determining that the first component is abnormal when the index data of any one first component is beyond a normal range;

determining that the first service is abnormal when the service data of any one first service exceeds a normal range;

determining that the first cluster is abnormal when the cluster data of any one first cluster exceeds a normal range.

Optionally, in this embodiment, statistics may be performed on the obtained index data according to the dimension of the component, the service, or the cluster, and after the statistics, the data may be monitored. And if the data is beyond the normal range, determining that the component or the service or the cluster is abnormal.

As an optional example, the method further includes:

and under the condition that any one cluster is abnormal or any one service is abnormal or any one component is abnormal, sending a monitoring short message to a target number, or dialing the target number, or sending a monitoring mail to a target account to prompt that the cluster or the service or the component is abnormal.

Optionally, in this embodiment, the alarm mode may be multiple. Such as short message alarm, dial alarm, mail alarm, etc.

Optionally, the index data of the component in this embodiment may include at least one of a CPU usage rate, a process survival rate, a read/write delay, and a load of the current component. Each index data may correspond to a normal data range. When the index data of the current component is acquired, whether each item of index data is located in a corresponding normal data range is judged, and whether the component is determined to be a target component is determined. And if the target component is determined, sending alarm information to the target account, wherein the alarm information can comprise a component mark of the target component and the type of the index data beyond the normal data range. For example, the alarm information may be that the CPU utilization of the component 3 is too high, the read-write rate of the component 5 is too slow, and the like.

As an optional example, the method further comprises:

after acquiring the index data of each component in each cluster, aggregating the index data to obtain a service index of each service in the cluster and a cluster index of each cluster;

determining that the corresponding service is abnormal under the condition that the service index exceeds the normal data range of the service index;

and determining that the corresponding cluster has abnormity under the condition that the cluster index exceeds the normal data range of the cluster index.

Optionally, in this embodiment, after the index data of the component is obtained, the index data may be aggregated according to different services or different clusters. For example, if one service includes a plurality of components, the index data of the plurality of components are summed, averaged, maximized, and minimized to obtain an aggregate result, which is the service index. And then judging whether the aggregation result is located in a normal data range corresponding to the service, if so, indicating that the service is abnormal, and if the service index exceeds the normal data range, indicating that the service is abnormal and needing to send a warning message to the target account. Similarly, if a cluster includes multiple components, the index data of the multiple components are summed, averaged, maximized, minimized, and the like to obtain an aggregation result, which is the cluster index. And then judging whether the aggregation result is located in a normal data range corresponding to the cluster, if so, indicating that the cluster is abnormal, and if the service index exceeds the normal data range, indicating that the service is abnormal and needing to send a warning message to the target account.

Optionally, in this embodiment, when the index data of the component is acquired, the component is divided into two parts. As shown in figure 2 of the drawings, in which,

fig. 3 is an architecture diagram of the monitoring alarm of the present embodiment. The cluster 1 and the cluster 2 are decentralized es clusters, comprise distributed system infrastructure hadoops, and use a hbase database. Deploying a big data monitoring and collecting Service module in each cluster in the plurality of clusters, collecting indexes of big data components by the collecting Service module through Java Message Service (JMX) according to preset collecting indexes, and uploading the indexes to a central agency Service; the preset collection index may be set, and for example, the index may include at least one of a survival read/write rate, a write/read delay, and a load of a Central Processing Unit (CPU).

The central proxy service provides domain name resolution and load balance, data is written into a message queue of a proxy server proxy, a consumption end of the message queue uses a big data computing module spark streaming or a computing module fin to perform aggregation computation according to indexes, and indexes of a service level and a cluster level are aggregated. For example, a service includes three components, and respective indices of the three components are retrievable. And summing/averaging/maximizing/minimizing the indexes of the three components, and taking the result as the index of the service level. For example, a cluster includes three components, and respective metrics of the three components are retrievable. And summing/averaging/maximum/minimum values of the indexes of the three components, and taking the result as the index of the cluster level.

And finally storing the index result in a time sequence database Opentsdb. Monitoring index query service and alarm monitoring service can be provided for data pairs in Opensdb. In this embodiment, the alarm monitoring module may monitor and forward messages such as short messages and telephone calls.

In this embodiment, the big data component itself may provide the JMX interface to collect the corresponding monitoring index.

In the process of acquiring the index data, a failure in acquiring the index data may occur. Fig. 4 is a schematic diagram of acquisition index data of the present embodiment. In this embodiment, each cluster is deployed with a monitoring acquisition module, the monitoring acquisition module registers all component processes of the current cluster, starts whether a timing task detection process exists (using a search process command: ps aux | grep service name) and a timing port detection, detects whether a big data component service port is available, and supports reporting of nondata data, where nondata refers to that a big data component cannot acquire monitoring data, and generally reflects high-order monitoring indexes such as blocking of the big data component (process exists, port exists and single service is unavailable).

As shown in fig. 4, the monitoring index data acquisition is mainly divided into two parts, the first part is to acquire JMX monitoring index by traversing the registered big data component, the second part is to execute a machine monitoring script such as (df-lh acquires disk occupancy rate) to read the index of machine latitude, and the second part monitoring index can be directly reported to the center;

two conditions exist in the first part of acquiring JMX, if the monitoring index is normally acquired, the result can be normally uploaded to a center, if the monitoring index is failed, data can not be acquired after repeated retries, whether a big data component process exists or not needs to be detected, the result can not be acquired possibly due to the fact that the process does not exist, if the process does not exist, the result needs to be directly reported, if the process exists, whether a service port exists or not needs to be detected, if the service port does not exist, the service is not available, the result still needs to be reported, in the last condition, the port and the process exist but the monitoring data cannot be acquired from the JMX, so that the NODATA index needs to be reported, in this time, manual intervention processing is needed, and NODATA is one of the most common problems in monitoring and the most difficult detection.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

According to another aspect of the embodiments of the present application, there is also provided a cluster data monitoring apparatus, as shown in fig. 5, including:

an obtaining module 502, configured to obtain index data of each component in each cluster, where each cluster includes multiple components, and each component corresponds to one index data;

an aggregation module 504, configured to aggregate the index data in each cluster according to a service dimension and a cluster dimension, so as to obtain cluster data of each cluster and service data of each service in each cluster;

a first monitoring module 506, configured to monitor each component in each cluster according to the index data in each cluster, monitor each service in each cluster according to the service data in each cluster, and monitor each cluster according to the cluster data of each cluster.

For other examples of this embodiment, please refer to the above examples, which are not described herein.

According to another aspect of the embodiments of the present invention, there is also provided an electronic device for implementing the cluster data monitoring method, where the electronic device may include a memory and a processor, the memory stores a computer program, and the processor is configured to execute the steps in the cluster data monitoring method through the computer program.

According to a further aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to execute the steps of the cluster data monitoring method when running.

Alternatively, in this embodiment, a person skilled in the art may understand that all or part of the steps in the methods of the foregoing embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be substantially or partially implemented in the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, and including instructions for causing one or more computer devices (which may be personal computers, servers, or network devices) to execute all or part of the steps of the method according to the embodiments of the present invention.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, a division of a unit is merely a division of a logic function, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A cluster data monitoring method is characterized by comprising the following steps:

acquiring index data of each component in each cluster, wherein each cluster comprises a plurality of components, and each component corresponds to one index data;

aggregating the index data in each cluster according to service dimensions and cluster dimensions to obtain cluster data of each cluster and service data of each service in each cluster;

monitoring each of the components in each of the clusters according to the index data in each of the clusters, monitoring each of the services in each of the clusters according to the service data in each of the clusters, and monitoring each of the clusters according to the cluster data of each of the clusters.

2. The method of claim 1, further comprising:

detecting the component process every other first time length;

3. The method of claim 2, further comprising:

4. The method of claim 3, further comprising:

5. The method of claim 1, further comprising:

6. The method according to any one of claims 1 to 5, wherein the acquiring the index data of each component in the cluster acquired by the acquisition module comprises:

7. The method of any of claims 1 to 5, wherein the monitoring each of the components in each of the clusters according to the index data in each of the clusters, and each of the services in each of the clusters according to the service data in each of the clusters, and each of the clusters according to the cluster data of each of the clusters comprises:

8. The method according to any one of claims 1 to 5, further comprising:

9. A cluster data monitoring apparatus, comprising:

the acquisition module is used for acquiring index data of each component in each cluster, wherein each cluster comprises a plurality of components, and each component corresponds to one index data;

the aggregation module is used for aggregating the index data in each cluster according to service dimensions and cluster dimensions to obtain cluster data of each cluster and service data of each service in each cluster;

a first monitoring module, configured to monitor each component in each cluster according to the index data in each cluster, monitor each service in each cluster according to the service data in each cluster, and monitor each cluster according to the cluster data of each cluster.

10. A computer-readable storage medium, in which a computer program is stored, which computer program, when running, performs the method of any one of claims 1 to 8.

11. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 8 by means of the computer program.