CN109614283B - Monitoring system of distributed database cluster - Google Patents

Monitoring system of distributed database cluster Download PDF

Info

Publication number
CN109614283B
CN109614283B CN201811244012.3A CN201811244012A CN109614283B CN 109614283 B CN109614283 B CN 109614283B CN 201811244012 A CN201811244012 A CN 201811244012A CN 109614283 B CN109614283 B CN 109614283B
Authority
CN
China
Prior art keywords
monitoring
monitoring system
primary
management platform
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811244012.3A
Other languages
Chinese (zh)
Other versions
CN109614283A (en
Inventor
成思敏
赖超宇
祝鹏
崔伟
潘浩
高保庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianyi Digital Life Technology Co Ltd
Original Assignee
Tianyi Digital Life Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianyi Digital Life Technology Co Ltd filed Critical Tianyi Digital Life Technology Co Ltd
Priority to CN201811244012.3A priority Critical patent/CN109614283B/en
Publication of CN109614283A publication Critical patent/CN109614283A/en
Application granted granted Critical
Publication of CN109614283B publication Critical patent/CN109614283B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/302Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system

Landscapes

  • Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a monitoring system of a distributed database cluster, which comprises an administrator client, a monitoring management platform and a plurality of monitoring nodes, wherein each monitoring node is provided with a primary monitoring system, a secondary monitoring system and a database to be monitored, and the primary monitoring system is used for monitoring the database; the monitoring management platform is used for sending the setting information of the monitoring task to the corresponding monitoring node; in the monitoring node, the secondary monitoring system is used for executing a monitoring task aiming at the primary monitoring system, carrying out state evaluation on the primary monitoring system according to task execution result information and sending a state evaluation result to the monitoring management platform; and when receiving the abnormal state evaluation result of the primary monitoring system, the monitoring management platform triggers an alarm mechanism and sends alarm information to the associated terminal and/or the administrator client. The embodiment of the invention solves the problem that the monitoring of the distributed database cluster by the existing database monitoring system can not be guaranteed.

Description

Monitoring system of distributed database cluster
Technical Field
The invention relates to the technical field of databases, in particular to a monitoring system of a distributed database cluster.
Background
In the internet field, enterprises mostly adopt a distributed database cluster as a data storage scheme, and in order to guarantee high availability of a database, a set of monitoring system is also deployed and used for monitoring the service and health condition of the distributed database cluster.
In the process of implementing the invention, the inventor finds that the following problems exist in the prior art: the existing monitoring system mostly adopts a distributed structure, a plurality of monitoring systems form a distributed monitoring cluster, and when the scale of the distributed database cluster is larger, it is difficult to ensure that the local scripts, remote agents and timing task deployment of each monitoring system in the monitoring cluster all run normally, so that the monitoring effect of the monitoring cluster on the distributed database cluster cannot be guaranteed.
Disclosure of Invention
Therefore, it is necessary to provide a monitoring system for a distributed database cluster to solve the problem that monitoring of the distributed database cluster by the monitoring system in the existing manner cannot be guaranteed.
The embodiment of the invention provides a monitoring system of a distributed database cluster, which comprises an administrator client, a monitoring management platform and a plurality of monitoring nodes, wherein each monitoring node is provided with a primary monitoring system, a secondary monitoring system and a database to be monitored, and the primary monitoring system is used for monitoring the database;
the administrator client is used for receiving a monitoring setting instruction and sending the monitoring setting instruction to the monitoring management platform; the monitoring setting instruction carries setting information of a monitoring node and setting information of a monitoring task;
the monitoring management platform is used for sending the setting information of the monitoring task to the corresponding monitoring node according to the setting information of the monitoring node;
in the corresponding monitoring node, the secondary monitoring system is used for receiving the setting information of the monitoring task sent by the monitoring management platform, executing the monitoring task corresponding to the setting information of the monitoring task and aiming at the primary monitoring system, performing state evaluation on the primary monitoring system according to the task execution result information, and sending the state evaluation result to the monitoring management platform;
and the monitoring management platform is also used for triggering an alarm mechanism when receiving the state evaluation result that the primary monitoring system has abnormity, and sending alarm information to the associated terminal and/or the administrator client through the alarm mechanism.
In an embodiment, the monitoring management platform is specifically configured to perform, according to the task execution result information, state evaluation on the primary monitoring system, where the state evaluation includes at least: monitoring index coverage rate evaluation, monitoring index version evaluation, monitoring index execution state evaluation and monitoring index execution result accuracy evaluation.
In an embodiment, the monitoring management platform is further configured to, after receiving a state evaluation result that the primary monitoring system has an abnormality, perform corresponding abnormality repair on the primary monitoring system for an abnormality type that the primary monitoring system has an abnormality.
In one embodiment, the exceptions present in the primary monitoring system include: monitoring index coverage rate abnormity, monitoring index version abnormity, monitoring index execution state abnormity and/or monitoring index execution result accuracy abnormity.
In one embodiment, in each monitoring node, the primary monitoring system, the secondary monitoring system and the database to be monitored correspond to each other one by one.
In one embodiment, in the secondary monitoring system, a plurality of monitoring scripts corresponding to monitoring tasks are deployed, and the monitoring tasks correspond to the monitoring scripts one to one.
In one embodiment, in the secondary monitoring system, the monitoring script is deployed in an administrator user directory, and non-administrator users have no right to access the monitoring script.
In one embodiment, in the secondary monitoring system, the monitoring script is deployed in the form of a hidden file.
In one embodiment, the monitoring management platform is specifically configured to, when a state evaluation result that the primary monitoring system is abnormal is received, identify an abnormal level;
triggering a short message alarm mechanism aiming at the first abnormal level, and sending alarm information to a corresponding terminal through the short message alarm mechanism;
triggering an alarm mechanism of the mail and the short message aiming at the second abnormal level, and sending alarm information to a corresponding terminal through the alarm mechanism of the mail and the short message;
triggering an alarm mechanism of short message + APP aiming at a third abnormal grade, and sending alarm information to a corresponding terminal and the administrator client through the alarm mechanism of short message + APP;
the emergency degree of the first abnormal grade, the second abnormal grade and the third abnormal grade is increased in sequence.
In one embodiment, the monitoring management platform is further configured to send the status evaluation result to the administrator client;
the administrator client is further used for displaying the state evaluation result.
One of the above technical solutions has the following advantages or beneficial effects: in a monitoring system comprising: the system comprises an administrator client, a monitoring management platform and a plurality of monitoring nodes, wherein a primary monitoring system, a secondary monitoring system and a database to be monitored are arranged in each monitoring node, and the primary monitoring system is used for monitoring the database. A database to be monitored in the plurality of monitoring nodes forms a distributed database cluster, and a primary monitoring system in the plurality of monitoring nodes forms a distributed primary monitoring cluster; and the secondary monitoring systems in the plurality of monitoring nodes form a distributed secondary monitoring cluster. Therefore, an auxiliary distributed secondary monitoring scheme is provided, the working states of the distributed database cluster and the distributed primary monitoring cluster are comprehensively monitored, and high availability of the distributed database cluster is further guaranteed.
Drawings
FIG. 1 is a schematic block diagram of a monitoring system for a distributed database cluster, according to an embodiment;
fig. 2 is a schematic configuration diagram of a monitoring system of a distributed database cluster according to another embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
In the embodiment of the invention, the distributed database cluster comprises databases deployed in a plurality of different machine rooms, in order to ensure high availability of each database in the distributed database cluster, a monitoring cluster is deployed in a traditional monitoring system for the distributed database cluster, the monitoring cluster comprises a plurality of monitoring services, and the services and health conditions of the databases in the distributed database cluster are monitored through the monitoring services. Such a monitoring system has the following problems: (1) when the scale of the distributed database cluster is larger, it is difficult to ensure that the local scripts, remote agents and timing task deployments of each monitoring service in the monitoring cluster all run normally, so that the monitoring of the distributed database cluster by the monitoring cluster cannot be ensured; (2) when the monitoring of a certain monitoring cluster is found to have omission, wrong deployment, too old version, task execution failure and agent failure, the monitoring cluster becomes a blind spot for operation and maintenance, and great potential threat is caused to the service; (3) when the scale of the monitoring cluster is too large, the monitoring problems are too many, which brings great operation and maintenance burden and overlarge operation and maintenance cost; (4) the monitoring cluster itself needs to be kept highly available, but the monitoring cluster lacks necessary means to ensure its robustness. It can be seen that the conventional monitoring system for distributed database cluster still needs to be improved.
According to an embodiment of the present invention, as shown in fig. 1, a monitoring system for a distributed database cluster is provided, which includes: the system comprises an administrator client, a monitoring management platform and a plurality of monitoring nodes, wherein a primary monitoring system, a secondary monitoring system and a database to be monitored are arranged in each monitoring node, and the primary monitoring system is used for monitoring the database. The system comprises a plurality of monitoring nodes, a database to be monitored, a distributed database cluster, a distributed primary monitoring system cluster and a plurality of monitoring nodes, wherein the database to be monitored in the plurality of monitoring nodes forms the distributed database cluster; and the secondary monitoring systems in the plurality of monitoring nodes form a distributed secondary monitoring system cluster. In the embodiment of the invention, the monitoring management platform can be a server or a cloud server cluster; the first-level monitoring system and the second-level monitoring system can be a server and are deployed in a machine room where the database is located, so that delay of information receiving and sending can be avoided.
The administrator client can be a webpage or an APP and is used for receiving a monitoring setting instruction and sending the monitoring setting instruction to the monitoring management platform; the monitoring setting instruction carries setting information of a monitoring node and setting information of a monitoring task;
the monitoring management platform is used for sending the setting information of the monitoring task to the corresponding monitoring node according to the setting information of the monitoring node;
in the corresponding monitoring node, the secondary monitoring system is used for receiving the setting information of the monitoring task sent by the monitoring management platform, executing the monitoring task corresponding to the setting information of the monitoring task and aiming at the primary monitoring system, performing state evaluation on the primary monitoring system according to the task execution result information, and sending the state evaluation result to the monitoring management platform;
and the monitoring management platform is also used for triggering an alarm mechanism when receiving the state evaluation result that the primary monitoring system has abnormity, and sending alarm information to the associated terminal and/or the administrator client through the alarm mechanism.
In the monitoring system of the distributed database cluster in the embodiment, the administrator client, the monitoring management platform and the plurality of monitoring nodes are arranged in the monitoring system, each monitoring node is provided with a primary monitoring system, a secondary monitoring system and a database to be monitored, and the primary monitoring system is used for monitoring the database. A database to be monitored in the plurality of monitoring nodes forms a distributed database cluster, and a primary monitoring system in the plurality of monitoring nodes forms a distributed primary monitoring cluster; and the secondary monitoring systems in the plurality of monitoring nodes form a distributed secondary monitoring cluster. Therefore, an auxiliary distributed secondary monitoring scheme is provided, the working states of the distributed database cluster and the distributed primary monitoring cluster are comprehensively monitored, and high availability of the distributed database cluster is further guaranteed.
Further, according to an embodiment of the present invention, the monitoring management platform is specifically configured to perform, according to the task execution result information, state evaluation on the primary monitoring system, where the state evaluation includes at least: monitoring index coverage rate evaluation, monitoring index version evaluation, monitoring index execution state evaluation and monitoring index execution result accuracy evaluation.
In a traditional monitoring system, when monitoring indexes in a certain monitoring service are abnormal, such as omission, wrong deployment, too old version, task execution failure, agent failure and the like, a distributed primary monitoring cluster becomes a blind point for operation and maintenance, and great potential threats are caused to services. In the embodiment of the invention, the monitoring indexes of each primary monitoring system in the primary monitoring cluster are all monitored by each secondary monitoring system in the distributed secondary monitoring cluster, so that the problems of monitoring index loss, too low monitoring index version, ineffective operation of the monitoring indexes, inaccurate execution result of the monitoring indexes and the like in each primary monitoring system in the primary monitoring cluster can be found in time, the corresponding primary monitoring system in the primary monitoring cluster can be repaired in time, and the monitoring effect on the distributed database cluster is further improved.
Further, according to an embodiment of the present invention, the monitoring management platform is further configured to, after receiving a state evaluation result that the primary monitoring system has an abnormality, perform corresponding abnormality repair on the primary monitoring system according to an abnormality type of the primary monitoring system.
In some scenarios, the anomalies present in the primary monitoring system include: monitoring index coverage rate abnormity, monitoring index version abnormity, monitoring index execution state abnormity and/or monitoring index execution result accuracy abnormity.
In some scenarios, the monitoring metrics in the primary monitoring system include: collecting at least one of a program, a timing task, a database middleware service, process concurrency, a monitoring strategy, authority configuration, an alarm channel, database backup, database replication and service self-healing.
Aiming at the actual situation, the monitoring management platform can also perform the following state evaluation on the primary monitoring system according to the task execution result information: log evaluation, disk evaluation, time synchronization evaluation, data synchronization evaluation and timing task evaluation; correspondingly, the exceptions existing in the primary monitoring system may also include other situations, such as log exception, disk exception, time synchronization exception, data synchronization exception, timing task exception, and the like, which are not exhaustive.
In the traditional monitoring system, when the distributed monitoring cluster is abnormal, automatic repair cannot be carried out, and when the scale of the distributed monitoring cluster is too large, the repair problem of the distributed monitoring cluster brings great operation and maintenance burden to operation and maintenance personnel, so that the operation and maintenance cost is overlarge.
Further, according to an embodiment of the present invention, in each monitoring node, the primary monitoring system, the secondary monitoring system, and the database to be monitored correspond to one another one to one; aiming at a distributed database cluster of multiple data centers crossing a machine room, the embodiment of the invention is beneficial to quickly positioning a primary monitoring system and/or a database cluster with abnormity.
Furthermore, in the secondary monitoring system, a plurality of monitoring scripts corresponding to the monitoring tasks may be deployed, and the monitoring tasks correspond to the monitoring scripts one to one. The monitoring script can be started through a TIMER mechanism of JAVA to execute a corresponding monitoring task.
In some scenes, the monitoring script is deployed in the directory of the administrator user, and the non-administrator user has no permission to access the monitoring script, so that the safety index of the monitoring script in the secondary monitoring system is improved. Furthermore, the monitoring script can be deployed in a hidden file form, for example, the file name starts with a point, and the monitoring log forms a hidden file at the start of the point, so that the safety index of the monitoring script in the secondary monitoring system is further improved.
According to an embodiment of the present invention, the monitoring management platform is specifically configured to identify an abnormal level when receiving a state evaluation result that the primary monitoring system is abnormal; triggering a short message alarm mechanism aiming at the first abnormal level, and sending alarm information to a corresponding terminal through the short message alarm mechanism; triggering an alarm mechanism of the mail and the short message aiming at the second abnormal level, and sending alarm information to a corresponding terminal through the alarm mechanism of the mail and the short message; triggering an alarm mechanism of short message + APP aiming at a third abnormal grade, and sending alarm information to a corresponding terminal and the administrator client through the alarm mechanism of short message + APP; the emergency degree of the first abnormal grade, the second abnormal grade and the third abnormal grade is increased in sequence.
When the second-level monitoring system finds the abnormal condition of the first-level monitoring system, the warning notification and the recovery notification after the problem is solved can be carried out by adopting three channels of short messages, mails and APP voices, for example: the abnormal condition with low emergency level can be sent to the relevant responsible person in the form of mail; the system condition with medium emergency level can be sent to the relevant responsible person in a mail + short message mode; and the system condition with high emergency level can be sent to the relevant responsible person in a short message + APP voice mode. In addition, different alarm modes can adopt different alarm frequencies, for example, short message alarm can be given once every 12 hours, and APP alarm can be given once every 10 minutes, so as to remind relevant responsible persons to timely handle abnormal conditions of the primary monitoring system.
According to an embodiment of the present invention, the monitoring management platform is further configured to send the state evaluation result to the administrator client; the administrator client is further used for displaying the state evaluation result. In some scenarios, the administrator client is specifically configured to visually display the real-time state and the trend report of the primary monitoring system by using a GUI visual interface, so that relevant responsible personnel can clearly and visually know the current state of the distributed primary monitoring cluster, perform key processing and evaluation and analysis on the problem in time, and implement self-service one-key registration, configuration, update, deployment and the like through the GUI interface.
Further, as shown in fig. 2, the monitoring system structure of the distributed database cluster according to the embodiment of the present invention may be divided into three parts: the system comprises a secondary monitoring system cluster (comprising a plurality of secondary monitoring systems), a primary monitoring system cluster (comprising a plurality of primary monitoring systems) and a database cluster (comprising a plurality of databases). The three parts are deployed in different machine rooms, and the secondary monitoring system, the primary monitoring system and the database are in one-to-one correspondence. The monitoring updated data of any machine room can be synchronized to the monitoring database of other machine rooms. The data in each room remains consistent. Through the monitoring system of the embodiment, the distributed database cluster can be monitored from four dimensions of monitoring effectiveness, automatic restoration, self robustness and monitoring quality evaluation, so that the normality and reliability of the distributed database cluster are indirectly guaranteed.
The monitoring effectiveness means that the monitoring system of the distributed database cluster can monitor whether the coverage rate, the old and new rate of the version, the execution efficiency of the monitoring task and the accuracy of the monitoring index in the primary monitoring system reach the set threshold value in real time.
The automatic repair means that, based on the monitoring system of the distributed database cluster in this embodiment, after a monitoring index in the primary monitoring system is abnormal, a specific problem point can be quickly located, and the monitoring system can automatically repair the problem according to a preset rule, process and alarm the problem of the primary monitoring system, and then repair the monitoring function of the primary monitoring system in a form of adding, updating or restarting a task.
The robustness refers to that based on the monitoring system of the distributed database cluster of the embodiment, the secondary monitoring system can rapidly repair the whole monitoring system under the conditions that the monitoring of the distributed database cluster crossing the machine room and multiple centers is illegally damaged, and tasks are deleted and updated, so that the whole robustness of the whole monitoring system is ensured, and the reliability of the operation of the distributed database of the multiple data centers is indirectly ensured.
The monitoring quality evaluation refers to quantitative evaluation of the coverage rate, the version old and new rate, the execution state and the accuracy of the monitoring indexes in the primary monitoring system based on the monitoring system of the distributed database cluster of the embodiment, and is used as a decision basis for upgrading the primary monitoring system.
Monitoring the execution state of the index includes: compliance and timeliness and consistency. Compliance means whether the monitoring means of the monitoring index conforms to the configured rule, because the statistical analysis function of the system is based on a certain rule, if the rules are not matched during monitoring, the subsequent statistical analysis is also influenced; timeliness refers to whether information acquired by monitoring indexes is the latest information or not, and is very important for guaranteeing a monitoring system of a database; consistency means whether the content and version of all monitoring methods are consistent, and if the monitoring methods on some servers are not consistent with the latest version, the consistency affects not only the accuracy of data acquisition but also the diagnosis and positioning of database cluster operation.
The accuracy refers to whether the monitoring result obtained by the primary monitoring system is consistent with the current real state of the database, if so, the monitoring result of the primary monitoring system is accurate, otherwise, the monitoring result of the primary monitoring system is inaccurate.
In some scenarios, the monitoring tasks to be performed by the secondary monitoring system include: at least one of an instrument panel management task, an asset management task, a project management task, a program resource pool task, a quality evaluation task, a monitoring index library task, a matching rule library task, a monitoring report task, an automation management task and a system configuration task.
The instrument panel displays the monitoring result of the primary monitoring system in the form of a multi-dimensional radar chart and a trend sketch; the instrument panel management is to monitor the instrument panel function of the secondary monitoring system; asset management (ITIL) refers to configuration information of distributed database cluster and middleware IP, service version configuration, operating system configuration, personnel role, authority, asset management task, which is about monitoring of asset management function of secondary monitoring system; project management refers to project, sub-project, affiliate and project authority role configuration of a database; the project management task is about the monitoring of the project management function of the primary monitoring system; the program resource pool refers to a set of scripts, programs and dictionaries required by the secondary monitoring system, and the task of the program resource pool is to monitor the functions of the program resource pool of the secondary monitoring system; the quality evaluation task is a task for evaluating the coverage rate of the monitoring indexes, the old and new rates of the versions, the execution efficiency and the trend of the accuracy dimension of the execution result of the primary monitoring system; the monitoring index library is a monitoring index set required by the primary monitoring system for monitoring the database, and the task of the monitoring index library is about monitoring of the monitoring index condition of the primary monitoring system; the matching rule base is a standard base for managing the monitoring indexes corresponding to the primary monitoring system and is used for matching all the monitoring indexes required by the primary monitoring system and outputting the monitoring indexes to the primary monitoring system; the matching rule base task is that the secondary monitoring system monitors the condition of the local matching rule base; the secondary monitoring report task is used for monitoring the comprehensive analysis table function of the primary monitoring system by the secondary monitoring system; self-service management means that an administrator can realize self-service one-key registration, configuration, updating and deployment of the monitoring system through the function of the secondary monitoring system; automatic management, namely configuration of an automatic strategy, self-healing of monitoring problem indexes and statistical analysis of historical logs; the system configuration is to automatically configure the system authority, resources and authority roles of the secondary monitoring system.
In some scenarios, the primary monitoring system needs monitoring indexes such as: the method comprises the steps of collecting programs, timing tasks, agency services, process concurrency, monitoring strategies, authority configuration, alarm channels, database backup, heartbeat detection, database copying, service self-healing, performance self-healing and other monitoring scripts, remote monitoring tasks, host survival monitoring tasks, host disk monitoring tasks and time synchronization monitoring tasks. The purpose of the secondary monitoring system is to judge whether the monitoring indexes are abnormal or not, whether the versions are new or old, the execution state and the accuracy of the execution result, so that a strategy is formulated to process the abnormal monitoring indexes in the primary monitoring system.
The warning and recovery notification rules of the secondary monitoring system refer to what kind of modes are adopted for warning notification and recovery notification after problem solving when the secondary monitoring system finds the abnormal condition of the primary monitoring system, and include three channels of short messages, mails and APP voices, and different warning frequencies can be adopted for different warning modes. The system condition with low emergency level can be sent to the relevant responsible person in the form of mail; the system condition with medium emergency level can be sent to the relevant responsible person in a mail + short message mode; the system condition with high emergency level can be sent to the relevant responsible person in a short message + APP voice mode, and the sending frequency of the APP information is different according to different emergency degrees. Meanwhile, the secondary monitoring system also records the sending conditions of the short message, the mail and the APP, such as the sending time of the short message, the mail and the APP, whether the sending is successful and the like, and records in a log file form, wherein the log file name can be started in a point form.
The method comprises the steps that abnormal information (uncovered, inconsistent in version, abnormal in cluster process and inaccurate in data logs and data) is acquired through the acquisition of data matched with rules by the secondary monitoring system, a database cluster result in a time range is obtained through the operation of the secondary monitoring system, the detection accuracy, the compliance rate, the consistency, the timeliness, the coverage rate and the like of the primary monitoring system can be judged by comparing the database cluster result acquired by the secondary monitoring system with the database cluster result acquired by the primary monitoring system, and therefore a visual monitoring analysis trend graph for the primary monitoring system is obtained.
The secondary monitoring report is based on monitoring indexes such as acquisition programs, timing tasks, agent services, process concurrence, monitoring strategies, authority configuration, alarm channels, database backup, heartbeat detection, database replication and service self-healing, performance self-healing and the like, monitoring scripts of remote monitoring tasks, host survival monitoring tasks, host disk monitoring tasks, time synchronization monitoring tasks and the like, and statistics and collections of monitoring index abnormity and repairing details (monitoring existence, latest, feasible, abnormity and repairing specific time (accurate to seconds) and abnormity repairing logs and schemes) based on time periods (days, weeks, months, half years and years) according to the dimensions of database centers, projects, sub-projects, server applications, databases and middleware.
The monitoring tasks of the secondary monitoring system to the primary monitoring system are as follows: the method comprises the steps of collecting a program, timing tasks, agency services, process concurrency, monitoring strategies, authority configuration, alarm channels, database backup, heartbeat detection, database copying, service self-healing, performance self-healing and other monitoring scripts, remote monitoring tasks, host survival monitoring tasks, host disk monitoring tasks and time synchronization monitoring tasks, wherein the scripts of the monitoring tasks can be separated according to services and are deployed on specific related servers or service programs. The execution frequency of the scripts for the different monitoring tasks may be set individually, for example, every n minutes.
The monitoring tasks of the secondary monitoring system are all deployed on a secondary monitoring system program, run at regular time through modes of a remote agent and a client, receive a regular task instruction, execute and return a result to a corresponding module of the secondary monitoring system.
Scripts used by monitoring tasks of the secondary monitoring system are all deployed under a special secondary monitoring user directory, non-special secondary monitoring users generally do not have permission to switch to the secondary monitoring user directory, so that the safety index of the secondary monitoring scripts is improved to a certain extent, and the program level is started by using TIMER timing tasks in JAVA. The various files generated in these scripts exist in the form of hidden files, i.e. the file names begin with dots. If the monitoring log forms a hidden file with the beginning of a point. When some monitoring scripts and monitoring tasks are not the latest or nonexistent, the monitoring tasks are supplemented by agents in time and added into a temporary task list of a secondary monitoring system according to rules when the monitoring scripts and the monitoring tasks are nonexistent, so that the monitoring scripts and the monitoring tasks are subjected to post-processing by workers and are matched with and added into a normal task list.
The secondary monitoring system also has a machine learning function, and the machine learning provides reasonable optimization for the existing monitoring tasks by learning and analyzing the monitoring tasks of the secondary monitoring system and combining the result information returned by monitoring, such as adding new monitoring tasks, unloading monitoring tasks which do not meet the specifications, and the like. The machine learning is used for learning and analyzing the timing task of the secondary monitoring system, so that the execution frequency and execution time of the timing task of the secondary monitoring system are reasonably optimized, and the best monitoring effect is achieved.
It should be understood that, for the monitoring tasks executed by the aforementioned secondary monitoring system, the monitoring tasks may be executed according to an instruction of the administrator client, or may be executed at regular time according to a set time, and the execution sequence and the execution frequency of the multiple monitoring tasks may be set according to an actual situation and are automatically executed by the secondary monitoring system.
The monitoring system of the distributed database cluster in the embodiment can ensure the effectiveness, automatic repair, integrity and robustness of the original monitoring system of the distributed database cluster, and indirectly ensure the normal and reliable service of the distributed database cluster.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The terms "comprises" and "comprising," as well as any variations thereof, of the embodiments herein are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or (module) elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Reference herein to "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
References to "first \ second" herein are merely to distinguish between similar objects and do not denote a particular ordering with respect to the objects, it being understood that "first \ second" may, where permissible, be interchanged with a particular order or sequence. It should be understood that "first \ second" distinct objects may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced in sequences other than those illustrated or described herein.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A monitoring system of a distributed database cluster is characterized by comprising an administrator client, a monitoring management platform and a plurality of monitoring nodes, wherein each monitoring node is provided with a primary monitoring system, a secondary monitoring system and a database to be monitored, and the primary monitoring system is used for monitoring the database;
the administrator client is used for receiving a monitoring setting instruction and sending the monitoring setting instruction to the monitoring management platform; the monitoring setting instruction carries setting information of a monitoring node and setting information of a monitoring task;
the monitoring management platform is used for sending the setting information of the monitoring task to the corresponding monitoring node according to the setting information of the monitoring node;
in the corresponding monitoring node, the secondary monitoring system is used for receiving the setting information of the monitoring task sent by the monitoring management platform, executing the monitoring task corresponding to the setting information of the monitoring task and aiming at the primary monitoring system, performing state evaluation on the primary monitoring system according to the task execution result information, and sending the state evaluation result to the monitoring management platform;
the monitoring management platform is further used for triggering an alarm mechanism when receiving a state evaluation result that the primary monitoring system has abnormality, and sending alarm information to a related terminal and/or the administrator client through the alarm mechanism;
the monitoring management platform is specifically configured to, according to the task execution result information, perform state evaluation on the primary monitoring system at least including: monitoring index coverage rate evaluation, monitoring index version evaluation, monitoring index execution state evaluation and monitoring index execution result accuracy evaluation.
2. The monitoring system according to claim 1, wherein the monitoring management platform is further configured to, after receiving a state evaluation result that the primary monitoring system has an abnormality, perform corresponding abnormality repair on the primary monitoring system with respect to an abnormality type that the primary monitoring system has an abnormality.
3. The monitoring system of claim 2, wherein the anomalies present in the primary monitoring system include: monitoring index coverage rate abnormity, monitoring index version abnormity, monitoring index execution state abnormity and/or monitoring index execution result accuracy abnormity.
4. The monitoring system of claim 3, wherein the monitoring criteria in the primary monitoring system include: collecting at least one of a program, a timing task, a database middleware service, process concurrency, a monitoring strategy, authority configuration, an alarm channel, database backup, database replication and service self-healing.
5. The monitoring system according to claim 1, wherein in each monitoring node, the primary monitoring system, the secondary monitoring system and the database to be monitored are in one-to-one correspondence.
6. The monitoring system according to any one of claims 1 to 5, wherein a plurality of monitoring scripts corresponding to the monitoring tasks are deployed in the secondary monitoring system, and the monitoring tasks correspond to the monitoring scripts one by one.
7. The monitoring system of claim 6, wherein in the secondary monitoring system, the monitoring script is deployed under an administrator user directory, and non-administrator users have no authority to access the monitoring script.
8. The monitoring system according to claim 6 or 7, wherein in the secondary monitoring system, the monitoring script is deployed in the form of a hidden file.
9. The monitoring system according to any one of claims 1 to 5 and 7, wherein the monitoring management platform is specifically configured to, when a state evaluation result that the primary monitoring system is abnormal is received, identify an abnormality level;
triggering a short message alarm mechanism aiming at the first abnormal level, and sending alarm information to a corresponding terminal through the short message alarm mechanism;
triggering an alarm mechanism of the mail and the short message aiming at the second abnormal level, and sending alarm information to a corresponding terminal through the alarm mechanism of the mail and the short message;
triggering an alarm mechanism of short message + APP aiming at a third abnormal grade, and sending alarm information to a corresponding terminal and the administrator client through the alarm mechanism of short message + APP;
the emergency degree of the first abnormal grade, the second abnormal grade and the third abnormal grade is increased in sequence.
10. Monitoring system according to claim 1,
the monitoring management platform is further used for sending the state evaluation result to the administrator client;
the administrator client is further used for displaying the state evaluation result.
CN201811244012.3A 2018-10-24 2018-10-24 Monitoring system of distributed database cluster Active CN109614283B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811244012.3A CN109614283B (en) 2018-10-24 2018-10-24 Monitoring system of distributed database cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811244012.3A CN109614283B (en) 2018-10-24 2018-10-24 Monitoring system of distributed database cluster

Publications (2)

Publication Number Publication Date
CN109614283A CN109614283A (en) 2019-04-12
CN109614283B true CN109614283B (en) 2022-04-08

Family

ID=66001945

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811244012.3A Active CN109614283B (en) 2018-10-24 2018-10-24 Monitoring system of distributed database cluster

Country Status (1)

Country Link
CN (1) CN109614283B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111190888A (en) * 2020-01-03 2020-05-22 中国建设银行股份有限公司 Method and device for managing graph database cluster
CN111586129A (en) * 2020-04-28 2020-08-25 北京奇艺世纪科技有限公司 Alarm method and device for data synchronization, electronic equipment and storage medium
CN114070858B (en) * 2020-07-31 2024-07-02 中移(苏州)软件技术有限公司 Data processing method and device, equipment and storage medium
CN112491858B (en) * 2020-11-20 2023-05-30 北京百度网讯科技有限公司 Method, device, equipment and storage medium for detecting abnormal information
CN112559519A (en) * 2020-12-09 2021-03-26 北京红山信息科技研究院有限公司 Big data cluster management system
CN112631297A (en) * 2020-12-18 2021-04-09 上海商汤临港智能科技有限公司 Monitoring system, monitoring method, intelligent driving device, computer device, and medium
CN113342418B (en) * 2021-06-24 2022-11-22 国网黑龙江省电力有限公司 Distributed machine learning task unloading method based on block chain
CN114461449A (en) * 2022-01-21 2022-05-10 浪潮卓数大数据产业发展有限公司 Multi-source data backup method and system based on big data platform
CN115759734B (en) * 2022-10-19 2024-01-12 国网物资有限公司 Index-based power service supply chain monitoring method, device, equipment and medium
CN115994044B (en) * 2023-01-09 2023-06-13 苏州浪潮智能科技有限公司 Database fault processing method and device based on monitoring service and distributed cluster

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101222650A (en) * 2007-01-10 2008-07-16 中兴通讯股份有限公司 Multi-redundancy monitoring method and system
CN105337765A (en) * 2015-10-10 2016-02-17 上海新炬网络信息技术有限公司 Distributed hadoop cluster fault automatic diagnosis and restoration system
CN105915405A (en) * 2016-03-29 2016-08-31 深圳市中博科创信息技术有限公司 Large-scale cluster node performance monitoring system
CN106100938A (en) * 2016-08-19 2016-11-09 浪潮(北京)电子信息产业有限公司 The monitoring of a kind of distributed cluster system and alarm method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7962616B2 (en) * 2005-08-11 2011-06-14 Micro Focus (Us), Inc. Real-time activity monitoring and reporting

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101222650A (en) * 2007-01-10 2008-07-16 中兴通讯股份有限公司 Multi-redundancy monitoring method and system
CN105337765A (en) * 2015-10-10 2016-02-17 上海新炬网络信息技术有限公司 Distributed hadoop cluster fault automatic diagnosis and restoration system
CN105915405A (en) * 2016-03-29 2016-08-31 深圳市中博科创信息技术有限公司 Large-scale cluster node performance monitoring system
CN106100938A (en) * 2016-08-19 2016-11-09 浪潮(北京)电子信息产业有限公司 The monitoring of a kind of distributed cluster system and alarm method and system

Also Published As

Publication number Publication date
CN109614283A (en) 2019-04-12

Similar Documents

Publication Publication Date Title
CN109614283B (en) Monitoring system of distributed database cluster
CN111209131B (en) Method and system for determining faults of heterogeneous system based on machine learning
US11379292B2 (en) Baseline modeling for application dependency discovery, reporting, and management tool
US20220300290A1 (en) Determining problem dependencies in application dependency discovery, reporting, and management tool
US11354222B2 (en) Discovery crawler for application dependency discovery, reporting, and management tool
US20190196894A1 (en) Detecting and analyzing performance anomalies of client-server based applications
CN110716842B (en) Cluster fault detection method and device
CN107800783B (en) Method and device for remotely monitoring server
CN109034423B (en) Fault early warning judgment method, device, equipment and storage medium
CN107241229A (en) A kind of business monitoring method and device based on interface testing instrument
CN110784352B (en) Data synchronous monitoring and alarming method and device based on Oracle golden gate
CN108199901B (en) Hardware repair reporting method, system, device, hardware management server and storage medium
GB2440069A (en) Monitoring simulating device, method, and program
CN111698121B (en) SNMP trap alarm test method and related device
CN117670033A (en) Security check method, system, electronic equipment and storage medium
CN114327967A (en) Equipment repairing method and device, storage medium and electronic device
CN112615848B (en) Vulnerability repair state detection method and system
CN106982141A (en) Weblogic examples monitoring method and device
CN114338363A (en) Continuous integration method, device, equipment and storage medium
Wu et al. An empirical study on change-induced incidents of online service systems
WO2010010393A1 (en) Monitoring of backup activity on a computer system
JP2010244137A (en) Failure information collection device
KR101973728B1 (en) Integration security anomaly symptom monitoring system
CN107590647A (en) The servo supervisory systems of ship-handling system
CN112131090B (en) Service system performance monitoring method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220121

Address after: Room 1423, No. 1256 and 1258, Wanrong Road, Jing'an District, Shanghai 200040

Applicant after: Tianyi Digital Life Technology Co.,Ltd.

Address before: 1 / F and 2 / F, East Garden, Huatian International Plaza, 211 Longkou Middle Road, Tianhe District, Guangzhou, Guangdong 510630

Applicant before: Century Dragon Information Network Co.,Ltd.

GR01 Patent grant
GR01 Patent grant