CN110535713B - Monitoring management system and monitoring management method - Google Patents

Monitoring management system and monitoring management method Download PDF

Info

Publication number
CN110535713B
CN110535713B CN201810509664.9A CN201810509664A CN110535713B CN 110535713 B CN110535713 B CN 110535713B CN 201810509664 A CN201810509664 A CN 201810509664A CN 110535713 B CN110535713 B CN 110535713B
Authority
CN
China
Prior art keywords
monitoring
queue
message queue
information
monitoring information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810509664.9A
Other languages
Chinese (zh)
Other versions
CN110535713A (en
Inventor
杨猛
邵利铎
鹿慧
何栋
于灏
欧创新
王路远
王磊
刘松
王龙涛
刘皓
刘震
蔡雨佳
张娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peoples Insurance Company of China
Original Assignee
Peoples Insurance Company of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peoples Insurance Company of China filed Critical Peoples Insurance Company of China
Priority to CN201810509664.9A priority Critical patent/CN110535713B/en
Publication of CN110535713A publication Critical patent/CN110535713A/en
Application granted granted Critical
Publication of CN110535713B publication Critical patent/CN110535713B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/04Network management architectures or arrangements
    • H04L41/046Network management architectures or arrangements comprising network management agents or mobile agents therefor
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/04Processing captured monitoring data, e.g. for logfile generation
    • H04L43/045Processing captured monitoring data, e.g. for logfile generation for graphical visualisation of monitoring data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/06Generation of reports
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/16Threshold monitoring
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/104Peer-to-peer [P2P] networks
    • H04L67/1044Group management mechanisms 

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application discloses control management system includes: the system comprises a message queue cluster, a monitoring management platform and a database; the message queue cluster comprises at least one message queue gateway and a plurality of message queue nodes; the method comprises the steps that a gateway monitoring data acquisition agent program acquires monitoring information of a message queue gateway and reports the acquired monitoring information of the message queue gateway to a monitoring management platform, and a node monitoring data acquisition agent program acquires the monitoring information of a message queue node and reports the acquired monitoring information of the message queue node to the monitoring management platform; the monitoring management platform is used for analyzing the monitoring information, obtaining a monitoring result and storing the monitoring information into a database; and the database is used for storing the monitoring information provided by the monitoring management platform. By adopting the system, the running condition and the health condition of the current message queue cluster are summarized through mechanisms such as information acquisition and analysis, and the operation and maintenance guarantee of the system is greatly improved.

Description

Monitoring management system and monitoring management method
Technical Field
The application relates to the technical field of computers, in particular to a monitoring management system. The application also relates to a monitoring management method.
Background
The message queue middleware product is a general message queue product and is widely used in scenes such as data distribution, message interaction and the like.
In the prior art, message middleware products (such as IBM MQ) are often adopted for message interaction, but a complete monitoring management system is lacked for monitoring the operation of the message queue cluster. In the process of using the message queue technology, a main problem is that when a message queue cluster fails, it cannot be found in time to perform related operation and maintenance operations, for example, when the queue is full or an MQ cluster fails, it cannot be found in time to cause low reliability and low maintainability of the system.
The prior art has the problems of low reliability and low maintainability when a message queue middleware product is adopted.
Disclosure of Invention
The application provides a monitoring management system and a monitoring management method, which aim to solve the problems of low reliability and low maintainability existing in the prior art when a message queue middleware product is adopted.
The application provides a monitoring management system, includes: the system comprises a message queue cluster, a monitoring management platform and a database;
the message queue cluster comprises at least one message queue gateway and a plurality of message queue nodes;
the message queue gateway is used for distributing messages of a butt joint application system to the message queue nodes according to the load of the message queue nodes, the message queue gateway runs a gateway monitoring data acquisition agent program, and the gateway monitoring data acquisition agent program is used for acquiring monitoring information of the message queue gateway and reporting the acquired monitoring information of the message queue gateway to the monitoring management platform;
the message queue node is used for receiving messages of the butt joint application system provided by the message queue gateway, processing the messages of the butt joint application system, or storing the messages of the butt joint application system in a message queue mode, the message queue node runs a node monitoring data acquisition agent program, and the node monitoring data acquisition agent program is used for acquiring monitoring information of the message queue node and reporting the acquired monitoring information of the message queue node to the monitoring management platform;
the monitoring management platform is used for acquiring monitoring information reported by a gateway monitoring data acquisition agent program and a node monitoring data acquisition agent program, analyzing the monitoring information, acquiring a monitoring result and storing the monitoring information into the database;
and the database is used for storing the monitoring information provided by the monitoring management platform.
Optionally, the gateway monitoring data collecting agent program is specifically configured to collect monitoring information of the message queue gateway at regular time, and report the collected monitoring information of the message queue gateway to the monitoring management platform.
Optionally, the node monitoring data collection agent is specifically configured to collect monitoring information of the message queue node at regular time, and report the collected monitoring information of the message queue node to the monitoring management platform.
Optionally, the monitoring management platform includes:
the early warning submodule is used for determining whether early warning is needed or not according to the monitoring result of the monitoring information, and if so, performing early warning processing;
and the statistics and display submodule is used for carrying out multi-dimensional statistics and display on the monitoring information in the database.
Optionally, the early warning sub-module is specifically configured to:
the system is used for notifying a system administrator through a mobile phone short message or a mail when the monitoring result of the monitoring information reaches an early warning condition threshold value set for the monitoring information; or, an image or sound warning or alarm is issued.
Optionally, the early warning sub-module is further configured to set an early warning level of the monitoring information, where different early warning levels correspond to different early warning condition thresholds set for the monitoring information.
Optionally, the statistics and display sub-module includes:
the MQ cluster system operation condition statistics and display submodule is used for carrying out statistics and display on the MQ cluster system operation condition; or,
the hardware condition counting and displaying submodule is used for counting and displaying the health condition of the hardware; or,
and the queue manager counting and displaying submodule is used for counting and displaying the queue manager.
Optionally, the system operation condition statistics and display submodule is specifically configured to display a topological graph of the docking application service system.
Optionally, the queue manager statistics and display sub-module is specifically configured to:
counting and displaying the data quantity of the queue manager; or,
counting and displaying the queue information in the queue manager; or,
counting the number of data which are successfully put in, failed to put in, successfully taken out and failed to take out of the queue manager; or,
and counting and displaying the data conditions of successful putting, failed putting, successful taking and failed taking of the queue in the queue manager.
Optionally, the counting and displaying of the queue manager includes at least one of the following modes:
monthly statistics, daily statistics, hourly statistics, minute statistics, historical statistics, custom statistics.
Optionally, the monitoring information includes at least one of the following information:
queue manager state; a message channel state; message queue information; error queue information; a deadlock queue information; queue statistics.
Optionally, the queue statistical information includes: queue data flow to information and/or data traffic.
The present application further provides a monitoring management method, which is applied to the monitoring management system, and the method includes:
the message queue gateway reports own monitoring information to a monitoring management platform through a gateway monitoring data acquisition agent program running on the message queue gateway;
the message queue node reports own monitoring information to a monitoring management platform through a node monitoring data acquisition agent program running on the message queue node;
and the monitoring management platform analyzes the monitoring information, obtains a monitoring result of the monitoring information, and stores the monitoring information in a database.
Optionally, the reporting, by the message queue gateway, monitoring information of the message queue gateway to the monitoring management platform through a gateway monitoring data acquisition agent running on the message queue gateway includes:
and the message queue gateway reports the monitoring information of the message queue gateway to a monitoring management platform at regular time through a gateway monitoring data acquisition agent program running on the message queue gateway.
Optionally, the reporting of the monitoring information of the message queue node to the monitoring management platform by the message queue node through the node monitoring data acquisition agent running on the message queue node includes:
and the message queue node reports the monitoring information of the message queue node to a monitoring management platform at regular time through a node monitoring data acquisition agent program running on the message queue node.
Optionally, the method further includes:
and determining whether to alarm or pre-warn according to the monitoring result of the monitoring information.
Optionally, determining whether to alarm or perform early warning according to the monitoring result of the monitoring information includes:
judging whether the monitoring result of the monitoring information reaches an alarm condition set for the monitoring information, and if so, carrying out alarm processing; or
And judging whether the monitoring result of the monitoring information reaches an early warning condition threshold value set for the monitoring information, and if so, carrying out early warning processing.
Optionally, the method further includes:
and setting early warning levels of the monitoring information, wherein different early warning levels correspond to different early warning condition thresholds set for the monitoring information.
Optionally, the method further includes:
and carrying out multidimensional statistics and display on the monitoring information in the database.
Optionally, the performing multidimensional statistics and display on the monitoring information in the database includes:
counting and displaying the operation condition of the MQ cluster system; or,
counting and displaying the health condition of the hardware; or,
and counting and displaying the queue manager.
Optionally, the counting and displaying the operation condition of the MQ cluster system includes:
and displaying a topological graph of the butt joint application service system.
Optionally, the counting and displaying the queue manager includes:
counting and displaying the data quantity of the queue manager; or,
counting and displaying the queue information in the queue manager;
counting the number of data which are successfully put in, failed to put in, successfully taken out and failed to take out of the queue manager;
and counting and displaying the data conditions of successful putting, failed putting, successful taking and failed taking of the queue in the queue manager.
Optionally, the counting and displaying of the queue manager includes at least one of the following modes:
monthly statistics, daily statistics, hourly statistics, minute statistics, historical statistics, custom statistics.
Optionally, the monitoring information includes at least one of the following information:
queue manager state; a message channel state; message queue information; error queue information; a deadlock queue information; queue statistics.
Optionally, the queue statistical information includes: queue data flow to information and/or data traffic.
Compared with the prior art, the method has the following advantages:
according to the monitoring management system and the monitoring management method, the monitoring information of the monitoring management platform is reported to the monitoring management platform through the message queue gateway and the message queue nodes, the monitoring management platform analyzes the monitoring information to obtain a monitoring result, the problems in the message queue cluster can be timely found and correspondingly processed, the running state and the health state of the current message queue cluster are summarized through mechanisms such as information acquisition and analysis, and the operation and maintenance guarantee of the system is greatly improved.
Drawings
Fig. 1 is a schematic diagram of a monitoring management system according to a first embodiment of the present application. .
Fig. 2 is a schematic diagram of a monitoring agent program acquiring monitoring information and sending the acquired monitoring information to a monitoring management platform according to a first embodiment of the present application.
Fig. 3 is a functional schematic diagram of a monitoring management platform according to a first embodiment of the present application.
Fig. 4 is a schematic diagram illustrating that a monitoring management platform sends warning information to a mailbox of a system administrator according to a first embodiment of the present application.
Fig. 5 is a schematic diagram illustrating statistics and display of the number of data volumes of the queue manager according to the first embodiment of the present application.
Fig. 6 is a schematic diagram for counting and presenting queue messages in a queue manager according to the first embodiment of the present application.
Fig. 7 is a schematic diagram of counting the number of data pieces of the queue manager that are put in success, put in failure, taken out success, and taken out failure according to the first embodiment of the present application.
Fig. 8 is a schematic diagram of data of successful put, failed put, successful take, and failed take of each queue in the queue manager according to the first embodiment of the present application.
Fig. 9 is a schematic diagram showing a topology diagram of a docking application service system according to a first embodiment of the present application.
Fig. 10 is a flowchart of a monitoring management method according to a second embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
A first embodiment of the present application provides a monitoring management system, and a message queue cluster in the first embodiment of the present application is described by taking an mq (ibm mq) cluster as an example. The following description will be made in detail with reference to fig. 1, fig. 2, fig. 3, fig. 4, fig. 5, fig. 6, fig. 7, and fig. 8.
The system comprises: the system comprises a message queue cluster 101, a monitoring management platform 102 and a database 103.
The cluster of message queues 101 is described as,
the message queue cluster comprises at least one message queue gateway and a plurality of message queue nodes;
the message queue gateway is used for distributing messages of a butt joint application system to the message queue nodes according to the load of the message queue nodes, the message queue gateway runs a gateway monitoring data acquisition agent program, and the gateway monitoring data acquisition agent program is used for acquiring monitoring information of the message queue gateway and reporting the acquired monitoring information of the message queue gateway to the monitoring management platform;
the message queue node is used for receiving messages of an application system provided by a message queue gateway, processing the messages of the application system, or storing the messages of the application system in a message queue mode, the message queue node runs a node monitoring data acquisition agent program, and the node monitoring data acquisition agent program is used for acquiring monitoring information of the message queue node and reporting the acquired monitoring information of the message queue node to the monitoring management platform.
It should be noted that the gateway monitoring data collection agent and the node monitoring data collection agent may adopt the same program or different programs.
As shown in fig. 1, the MQ gateway server 1 (message queue gateway) and the MQ gateway server 2 are message queue gateways.
The MQ gateway server refers to a message queue gateway server in an MQ cluster, serves as a gateway of the whole MQ cluster, is mainly oriented to application connection requests, and distributes message data of a butt joint application system to MQ nodes through the message queue gateway server through a load balancing mechanism.
The monitoring information comprises hardware information: information such as CPU utilization rate, disk use condition, file size, process, network and the like; the following information is also included: queue manager state; a message channel state; message queue information; error queue information; a deadlock queue information; queue statistics, etc. Wherein the queue statistics include: queue data flow information and/or data traffic information, etc.
As shown in fig. 1, a gateway monitoring data collection Agent (Agent) running on the MQ gateway server 1 and the MQ gateway server 2 may collect monitoring information, and report the collected monitoring information of the message queue gateway to the monitoring management platform. Preferably, the gateway monitoring data acquisition agent program can acquire monitoring information at regular time and report the acquired monitoring information of the message queue gateway to the monitoring management platform, so that the monitoring management platform can process abnormal conditions in time according to the monitoring information and realize real-time monitoring of the system. The gateway monitoring data acquisition agent program is a software program running on the message queue gateway, and can acquire various monitoring information from the message queue gateway and send the acquired monitoring information to the monitoring management platform. As shown in fig. 2, the gateway monitoring data acquisition agent program acquires monitoring information by calling an API interface, and sends the acquired monitoring information to the monitoring management platform server, where system hardware information (information such as memory, CPU, disk, etc.) is mainly obtained by querying through the API interface of the operating system, the related information of the MQ cluster is acquired through the SDK API interface provided by the IBM MQ, and the acquired MQ data mainly includes: MQ cluster, MQ queue manager, queue depth, message channel, IP, port, number of entries and withdrawals, time and other information.
As shown in fig. 1, the MQ cluster node 2, the MQ cluster node 3, and the MQ cluster node 4 are message queue nodes.
The node monitoring data collection Agent (Agent) running on the MQ cluster node 1, the MQ cluster node 2, the MQ cluster node 3 and the MQ cluster node 4 can collect monitoring information of the message queue node and report the collected monitoring information of the message queue node to the monitoring management platform. Preferably, the node monitoring data acquisition agent program can acquire the monitoring information of the message queue nodes at regular time and report the acquired monitoring information of the message queue nodes to the monitoring management platform, so that the monitoring management platform can process abnormal conditions in time according to the monitoring information and realize real-time monitoring of the message queue nodes. The node monitoring data acquisition agent program is a software program running on the message queue node, and can acquire various monitoring information from the message queue node and send the acquired monitoring information to the monitoring management platform.
The monitoring management platform 103 is configured to obtain monitoring information reported by a gateway monitoring data acquisition agent program and a node monitoring data acquisition agent program, analyze the monitoring information, obtain a monitoring result, and store the monitoring information in the database. Fig. 3 shows a functional schematic of a monitoring management platform.
The monitoring management platform may include:
the alarm submodule is used for alarming according to the monitoring result of the monitoring information;
the early warning submodule is used for determining whether early warning is needed or not according to the monitoring result of the monitoring information, and if so, performing early warning processing;
and the statistics and display submodule is used for carrying out multi-dimensional statistics and display on the monitoring information in the database.
And the alarm submodule can alarm when the monitoring information is abnormal. And when the monitoring result of the monitoring information meets the alarm condition, alarming. For example, when the monitoring result of the monitoring information satisfies the following condition, an alarm may be given: the memory usage rate exceeds 95 percent; the CPU utilization rate exceeds 95 percent; the hard disk utilization rate exceeds 95 percent; data exists in the deadlock queue; the "message channel unavailable" information appears.
The early warning submodule can perform early warning. When the monitoring result of the monitoring information reaches the early warning condition threshold value set for the monitoring information, a system administrator can be notified through a mobile phone short message or a mail; or, an image or sound warning or alarm is issued.
Because the monitoring items corresponding to different monitoring information are different, different early warning condition thresholds can be set for different monitoring information. And when the preset early warning condition threshold is reached, early warning or alarm processing is carried out.
For example, when the warning condition threshold is set for a certain message queue, if the number of messages is 9 ten thousand, warning is needed when the number of messages in the message queue is greater than or equal to 9 ten thousand; and setting a memory early warning condition threshold value to be 85% of memory occupation aiming at a certain MQ node, and needing early warning when the memory occupies 85%.
It should be noted that the early warning condition thresholds set for different message queues are different, for example, for a message queue sensitive to real-time performance, the number of the early warning condition thresholds may be set to 10; for a message queue with a large data size, the early warning condition threshold may be set to 2000 messages.
Preferably, the early warning sub-module is further configured to set an early warning level of the monitoring information, where different early warning levels correspond to different early warning condition thresholds set for the monitoring information. For example, different early warning condition thresholds can be set for the CPU utilization, and the early warning condition thresholds for the CPU utilization are respectively set to 70%, 80%, and 90%, and respectively correspond to the first-stage early warning, the second-stage early warning, and the third-stage early warning.
Preferably, in order to enable a system administrator (including the system administrator of the monitoring management system and each application system administrator) to know the early warning or alarm information in real time, the monitoring management platform may bind a mobile phone number and/or a mailbox address of the system administrator. Fig. 4 is a schematic diagram illustrating that the monitoring management platform sends the warning information to the mailbox of the system administrator.
Preferably, the advice information is carried during the early warning or alarm processing.
The importance of performing early warning or alarm processing when a set early warning condition threshold is reached is described below with reference to a scene.
For example, in a data distribution platform in the insurance industry, an MQ cluster is adopted for receiving and forwarding messages. Assuming that a message queue of an underwriting policy is set, and the number of the early warning condition threshold is set to be 10 thousands of messages, if the policy pending for claim settlement is sent once in a concentrated manner every day, the impact on the message queue of the underwriting policy is likely to be brought, when the number of the messages in the message queue is greater than or equal to 10 thousands of messages, early warning is carried out, and if early warning is carried out continuously for a period of time (for example, 5 days), capacity expansion suggestions can be carried in early warning information, so that a system administrator can timely expand the capacity, the normal operation of the system is ensured, and the reliability of the system is improved.
The statistics and display submodule comprises:
the MQ cluster system operation condition statistics and display submodule is used for carrying out statistics and display on the MQ cluster system operation condition; or,
the hardware condition counting and displaying submodule is used for counting and displaying the health condition of the hardware; or,
and the queue manager counting and displaying submodule is used for counting and displaying the queue manager.
The queue manager statistics and display submodule is specifically configured to:
counting and displaying the data quantity of the queue manager; or,
counting and displaying the queue messages in the queue manager; or,
counting the number of data which are successfully put in, failed to put in, successfully taken out and failed to take out of the queue manager; or,
and counting and displaying the data conditions of successful putting, failed putting, successful taking and failed taking of the queue in the queue manager.
As shown in FIG. 5, the number of queue manager data volumes is counted and shown, for example, the number of newly held queue manager data volumes is 420573.
The statistics and display of the queue manager comprises at least one of the following modes:
monthly statistics, daily statistics, hourly statistics, minute statistics, historical statistics, custom statistics. As shown in fig. 6, the queue messages in the queue manager are counted and shown, for example, the queue messages of the queues QAREINS001, and QAREINS001 managed by the queue manager QMGWC may be counted and shown by day or by minute.
As shown in fig. 7, which shows a schematic diagram for counting the number of data items of the queue manager that are put-in-success, put-in-failure, take-out-success, and take-out-failure, the data amount of the head office data sent to different branch companies on the day and the data amount received from the branch companies can be counted and shown as shown in fig. 7. If the data counted by the data sender is inconsistent with the data counted by the data receiver, the situation of inconsistent service can be judged.
As shown in FIG. 8, it shows the data cases of put-successful, put-failed, put-successful, and put-failed for each queue in the queue manager. According to the statistics of a certain queue in fig. 8, it can be determined whether the putting-in speed and the taking-out speed of the data are equivalent, and whether data accumulation occurs in the MQ cluster.
The MQ cluster system operation condition statistics and display submodule is specifically used for displaying a topological graph of the docking application service system. As shown in fig. 9, a schematic diagram illustrating a topology diagram of a docking application business system is shown.
It should be noted that, since the monitoring information reported by the message queue gateway to the monitoring management platform is not in a data format required by the monitoring management platform, the monitoring information may be analyzed and processed by the monitoring management platform; and then the analyzed and processed monitoring information is stored in a database.
The database 104 is configured to store the monitoring information provided by the monitoring management platform.
The database, which may refer to a repository that organizes, stores, and manages data by data structure. The database may comprise a relational database, such as Oracle, SQL Server, which may be used to query information from the database using database query statements. The database and the monitoring management platform can be deployed on the same physical server, and in order to ensure the safety of data storage, the database and the monitoring management platform can also be deployed on different physical servers.
Now, a detailed description is given of an implementation of the monitoring management system according to the first embodiment of the present application. According to the first embodiment of the application, monitoring information of the message queue gateway and the message queue nodes is collected, the collected monitoring information is reported to the monitoring management platform, and the monitoring management platform analyzes the monitoring information to obtain a monitoring result, so that problems in the message queue cluster can be found in time, and early warning or alarming is performed; the monitoring management platform can also count and display the monitoring information. The running state and the health state of the current message queue cluster are collected and summarized in real time through mechanisms such as monitoring information collection, statistics and display, early warning and the like, and the operation and maintenance guarantee of the system is greatly improved.
A second embodiment of the present application provides a monitoring management method, which is applied to the monitoring management system of the first embodiment of the present application. The following description will be made in detail with reference to fig. 2, 3, 4, 5, 6, 7, 8, 9 and 10.
As shown in fig. 10, in step S1001, the message queue gateway reports its own monitoring information to the monitoring management platform through the gateway monitoring data collection agent running on the message queue gateway.
The message queue gateway, which refers to a message queue gateway server in a message queue cluster (e.g., MQ cluster), is a gateway of the whole message queue cluster, and is mainly oriented to an application connection request, and message data of an application system is distributed to a message queue node (e.g., MQ node) through the message queue gateway server by a load balancing mechanism. As shown in fig. 1, the MQ gateway server 1 and MQ gateway server 2 are message queue gateways.
The monitoring information comprises hardware information: information such as CPU utilization rate, disk use condition, file size, process, network and the like; the following information is also included: queue manager state; a message channel state; message queue information; error queue information; a deadlock queue information; queue statistics, etc. Wherein the queue statistics include: queue data flow information and/or data traffic information, etc.
The monitoring management platform is a software system, and is used for acquiring monitoring information reported by the message queue gateway and the message queue node to the monitoring management platform, analyzing the monitoring information, acquiring a monitoring result of the monitoring information, and storing the monitoring information in the database.
As shown in fig. 1, a gateway monitoring data collection Agent (Agent) running on the MQ gateway server 1 and the MQ gateway server 2 may collect monitoring information, and report the collected monitoring information of the message queue gateway to the monitoring management platform. Preferably, the gateway monitoring data acquisition agent program can acquire monitoring information at regular time and report the acquired monitoring information of the message queue gateway to the monitoring management platform, so that the monitoring management platform can process abnormal conditions in time according to the monitoring information and realize real-time monitoring of the system. The gateway monitoring data acquisition agent program is a software program running on the message queue gateway, and can acquire various monitoring information from the message queue gateway and send the acquired monitoring information to the monitoring management platform. As shown in fig. 2, the gateway monitoring data acquisition agent program acquires monitoring information by calling an API interface, and sends the acquired monitoring information to the monitoring management platform server, where system hardware information (information such as memory, CPU, disk, etc.) is mainly obtained by querying through the API interface of the operating system, the related information of the MQ cluster is acquired through the SDK API interface provided by the IBM MQ, and the acquired MQ data mainly includes: MQ cluster, MQ queue manager, queue depth, message channel, IP, port, number of entries and withdrawals, time and other information.
As shown in fig. 10, in step S1002, the message queue node reports its own monitoring information to the monitoring management platform through the node monitoring data collection agent running on the message queue node.
As shown in fig. 1, the MQ cluster node 2, the MQ cluster node 3, and the MQ cluster node 4 are message queue nodes.
The node monitoring data collection Agent (Agent) running on the MQ cluster node 1, the MQ cluster node 2, the MQ cluster node 3 and the MQ cluster node 4 can collect monitoring information of the message queue node and report the collected monitoring information of the message queue node to the monitoring management platform. Preferably, the node monitoring data acquisition agent program can acquire the monitoring information of the message queue nodes at regular time and report the acquired monitoring information of the message queue nodes to the monitoring management platform, so that the monitoring management platform can process abnormal conditions in time according to the monitoring information and realize real-time monitoring of the message queue nodes. The node monitoring data acquisition agent program is a software program running on the message queue node, and can acquire various monitoring information from the message queue node and send the acquired monitoring information to the monitoring management platform.
As shown in fig. 10, in step S1003, the monitoring management platform analyzes the monitoring information, obtains a monitoring result of the monitoring information, and stores the monitoring information in a database.
After the monitoring management platform analyzes the monitoring information and obtains the monitoring result of the monitoring information, the monitoring management platform can also determine whether to alarm or give an early warning according to the monitoring result of the monitoring information.
Determining whether to alarm or pre-warn according to the monitoring result of the monitoring information, comprising:
judging whether the monitoring result of the monitoring information reaches an alarm condition set for the monitoring information, and if so, carrying out alarm processing; or
And judging whether the monitoring result of the monitoring information reaches an early warning condition threshold value set for the monitoring information, and if so, carrying out early warning processing.
And when the monitoring result of the monitoring information meets the alarm condition set for the monitoring information, alarming. For example, when the monitoring result of the monitoring information satisfies the following condition, an alarm may be given: the memory usage rate exceeds 95 percent; the CPU utilization rate exceeds 95 percent; the hard disk utilization rate exceeds 95 percent; data exists in the deadlock queue; the "message channel unavailable" information appears. The monitoring management platform informs a system administrator through a mobile phone short message or a mail; and also can display or sound alarm through the monitoring management platform interface.
When the monitoring result of the monitoring information reaches the early warning condition threshold value set for the monitoring information, a system administrator can be notified through a mobile phone short message or a mail; or, an image or sound warning or alarm is issued.
Because the monitoring items corresponding to different monitoring information are different, different early warning condition thresholds can be set for different monitoring information. And when the preset early warning condition threshold is reached, early warning or alarm processing is carried out.
For example, when the warning condition threshold is set for a certain message queue, if the number of messages is 9 ten thousand, warning is needed when the number of messages in the message queue is greater than or equal to 9 ten thousand; and setting a memory early warning condition threshold value to be 85% of memory occupation aiming at a certain MQ node, and needing early warning when the memory occupies 85%.
It should be noted that the early warning condition thresholds set for different message queues are different, for example, for a message queue sensitive to real-time performance, the number of the early warning condition thresholds may be set to 10; for a message queue with a large data size, the early warning condition threshold may be set to 2000 messages.
Preferably, early warning levels of monitoring information can be set, and different early warning levels correspond to different early warning condition thresholds set for the monitoring information. For example, different early warning condition thresholds can be set for the CPU utilization, and the early warning condition thresholds for the CPU utilization are respectively set to 70%, 80%, and 90%, and respectively correspond to the first-stage early warning, the second-stage early warning, and the third-stage early warning.
Preferably, in order to enable a system administrator (including the system administrator of the monitoring management system and each application system administrator) to know the early warning or alarm information in real time, the monitoring management platform may bind a mobile phone number and/or a mailbox address of the system administrator. Fig. 4 is a schematic diagram illustrating that the monitoring management platform sends the warning information to the mailbox of the system administrator.
Preferably, the advice information is carried during the early warning or alarm processing.
The importance of performing early warning or alarm processing when a set early warning condition threshold is reached is described below with reference to a scene.
For example, in a data distribution platform in the insurance industry, an MQ cluster is adopted for receiving and forwarding messages. Assuming that a message queue of an underwriting policy is set, and the number of the early warning condition threshold is set to be 10 thousands of messages, if the policy pending for claim settlement is sent once in a concentrated manner every day, the impact on the message queue of the underwriting policy is likely to be brought, when the number of the messages in the message queue is greater than or equal to 10 thousands of messages, early warning is carried out, and if early warning is carried out continuously for a period of time (for example, 5 days), capacity expansion suggestions can be carried in early warning information, so that a system administrator can timely expand the capacity, the normal operation of the system is ensured, and the reliability of the system is improved.
The monitoring management platform can perform multi-dimensional statistics and display on monitoring information in the database besides performing early warning and alarming.
The multidimensional statistics and display of the monitoring information in the database comprises the following steps:
counting and displaying the operation condition of the MQ cluster system; or,
counting and displaying the health condition of the hardware; or,
and counting and displaying the queue manager.
The counting and displaying of the queue manager comprises the following steps:
counting and displaying the data quantity of the queue manager; or,
counting and displaying the queue information in the queue manager;
counting the number of data which are successfully put in, failed to put in, successfully taken out and failed to take out of the queue manager;
and counting and displaying the data conditions of successful putting, failed putting, successful taking and failed taking of the queue in the queue manager.
As shown in FIG. 5, the number of queue manager data volumes is counted and shown, for example, the number of newly held queue manager data volumes is 420573.
The statistics and display of the queue manager comprises at least one of the following modes:
monthly statistics, daily statistics, hourly statistics, minute statistics, historical statistics, custom statistics. As shown in fig. 6, the queue messages in the queue manager are counted and shown, for example, the queue messages of the queues QAREINS001, and QAREINS001 managed by the queue manager QMGWC may be counted and shown by day or by minute.
As shown in fig. 7, which shows a schematic diagram for counting the number of data items of the queue manager that are put-in-success, put-in-failure, take-out-success, and take-out-failure, the data amount of the head office data sent to different branch companies on the day and the data amount received from the branch companies can be counted and shown as shown in fig. 7. If the put-in data is inconsistent with the taken-out data, the condition that the data of the docking service system is inconsistent can be judged.
As shown in FIG. 8, it shows the data cases of put-successful, put-failed, put-successful, and put-failed for each queue in the queue manager. From the statistics of a queue in fig. 8, it can be determined whether the data input speed and the data output speed are equivalent to each other, and whether data accumulation occurs in the system.
The counting and displaying of the operation condition of the MQ cluster system comprises the following steps: and displaying a topological graph of the butt joint application service system. As shown in fig. 9, a schematic diagram illustrating a topology diagram of a docking application business system is shown.
It should be noted that, since the monitoring information reported by the message queue gateway to the monitoring management platform is not in a data format required by the monitoring management platform, the monitoring information may be analyzed and processed by the monitoring management platform; and then the analyzed and processed monitoring information is stored in a database.
Now, a detailed description is given of an implementation of the monitoring management method according to the second embodiment of the present application. In the second embodiment of the application, by acquiring the monitoring information of the message queue gateway and the message queue node and reporting the acquired monitoring information to the monitoring management platform, the monitoring management platform analyzes the monitoring information to obtain a monitoring result, so that problems in the message queue cluster can be timely found, and early warning or alarming is performed; the monitoring management platform can also count and display the monitoring information. The running state and the health state of the current message queue cluster are collected and summarized in real time through mechanisms such as monitoring information collection, statistics and display, early warning and the like, and the operation and maintenance guarantee of the system is greatly improved.
Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto, and variations and modifications may be made by those skilled in the art without departing from the spirit and scope of the present invention.

Claims (23)

1. A monitoring management system, comprising: the system comprises a message queue cluster, a monitoring management platform and a database;
the message queue cluster comprises at least one message queue gateway and a plurality of message queue nodes;
the message queue gateway is used for distributing messages of a butt joint application system to the message queue nodes according to the load of the message queue nodes, the message queue gateway runs a gateway monitoring data acquisition agent program, and the gateway monitoring data acquisition agent program is used for acquiring monitoring information of the message queue gateway and reporting the acquired monitoring information of the message queue gateway to the monitoring management platform;
the message queue node is used for receiving messages of the butt joint application system provided by the message queue gateway, processing the messages of the butt joint application system, or storing the messages of the butt joint application system in a message queue mode, the message queue node runs a node monitoring data acquisition agent program, and the node monitoring data acquisition agent program is used for acquiring monitoring information of the message queue node and reporting the acquired monitoring information of the message queue node to the monitoring management platform;
the monitoring management platform is used for acquiring monitoring information reported by a gateway monitoring data acquisition agent program and a node monitoring data acquisition agent program, analyzing the monitoring information, acquiring a monitoring result and storing the monitoring information into the database;
the database is used for storing the monitoring information provided by the monitoring management platform; the monitoring information comprises at least one of the following information: queue manager state; a message channel state; message queue information; error queue information; a deadlock queue information; queue statistics.
2. The monitoring management system according to claim 1, wherein the gateway monitoring data collection agent is specifically configured to collect monitoring information of the message queue gateway at regular time, and report the collected monitoring information of the message queue gateway to the monitoring management platform.
3. The monitoring management system according to claim 1, wherein the node monitoring data collection agent is specifically configured to collect monitoring information of the message queue node at regular time, and report the collected monitoring information of the message queue node to the monitoring management platform.
4. The monitoring management system according to claim 1, wherein the monitoring management platform comprises:
the early warning submodule is used for determining whether early warning is needed or not according to the monitoring result of the monitoring information, and if so, performing early warning processing;
and the statistics and display submodule is used for carrying out multi-dimensional statistics and display on the monitoring information in the database.
5. The monitoring management system of claim 4, wherein the early warning sub-module is specifically configured to:
the system is used for notifying a system administrator through a mobile phone short message or a mail when the monitoring result of the monitoring information reaches an early warning condition threshold value set for the monitoring information; or, an image or sound warning or alarm is issued.
6. The monitoring management system of claim 4, wherein the early warning sub-module is further configured to set early warning levels of monitoring information, and different early warning levels correspond to different early warning condition thresholds set for the monitoring information.
7. The monitoring management system of claim 4, wherein the statistics and presentation submodule comprises:
the MQ cluster system operation condition statistics and display submodule is used for carrying out statistics and display on the MQ cluster system operation condition; or,
the hardware condition counting and displaying submodule is used for counting and displaying the health condition of the hardware; or,
and the queue manager counting and displaying submodule is used for counting and displaying the queue manager.
8. The monitoring management system according to claim 7, wherein the MQ cluster system operation condition statistics and presentation submodule is specifically configured to present a topology map of the docking application service system.
9. The monitoring management system of claim 7, wherein the queue manager statistics and presentation sub-module is specifically configured to:
counting and displaying the data quantity of the queue manager; or,
counting and displaying the queue information in the queue manager; or,
counting the number of data which are successfully put in, failed to put in, successfully taken out and failed to take out of the queue manager; or,
and counting and displaying the data conditions of successful putting, failed putting, successful taking and failed taking of the queue in the queue manager.
10. The monitoring management system of claim 7, wherein the statistics and presentation of the queue manager includes at least one of the following modes:
monthly statistics, daily statistics, hourly statistics, minute statistics, historical statistics, custom statistics.
11. The monitoring management system of claim 1, wherein the queue statistics comprise: queue data flow to information and/or data traffic.
12. A monitoring management method applied to the monitoring management system of claim 1, the method comprising:
the message queue gateway reports own monitoring information to a monitoring management platform through a gateway monitoring data acquisition agent program running on the message queue gateway;
the message queue node reports own monitoring information to a monitoring management platform through a node monitoring data acquisition agent program running on the message queue node;
the monitoring management platform analyzes the monitoring information, obtains a monitoring result of the monitoring information, and stores the monitoring information into a database; the monitoring information comprises at least one of the following information: queue manager state; a message channel state; message queue information; error queue information; a deadlock queue information; queue statistics.
13. The method of claim 12, wherein the reporting, by the message queue gateway, of the monitoring information of the message queue gateway to the monitoring management platform through a gateway monitoring data collection agent running on the message queue gateway includes:
and the message queue gateway reports the monitoring information of the message queue gateway to a monitoring management platform at regular time through a gateway monitoring data acquisition agent program running on the message queue gateway.
14. The method of claim 12, wherein the reporting of the monitoring information of the message queue node to the monitoring management platform by the node monitoring data collection agent running on the message queue node comprises:
and the message queue node reports the monitoring information of the message queue node to a monitoring management platform at regular time through a node monitoring data acquisition agent program running on the message queue node.
15. The method of claim 12, further comprising:
and determining whether to alarm or pre-warn according to the monitoring result of the monitoring information.
16. The method of claim 15, wherein determining whether an alarm or pre-warning is required based on the monitoring of the monitoring information comprises:
judging whether the monitoring result of the monitoring information reaches an alarm condition set for the monitoring information, and if so, carrying out alarm processing; or
And judging whether the monitoring result of the monitoring information reaches an early warning condition threshold value set for the monitoring information, and if so, carrying out early warning processing.
17. The method of claim 16, further comprising:
and setting early warning levels of the monitoring information, wherein different early warning levels correspond to different early warning condition thresholds set for the monitoring information.
18. The method of claim 12, further comprising:
and carrying out multidimensional statistics and display on the monitoring information in the database.
19. The method of claim 18, wherein the performing multidimensional statistics and presentation on the monitoring information in the database comprises:
counting and displaying the operation condition of the MQ cluster system; or,
counting and displaying the health condition of the hardware; or,
and counting and displaying the queue manager.
20. The method as claimed in claim 19, wherein the counting and exposing the MQ cluster system operation condition comprises:
and displaying a topological graph of the butt joint application service system.
21. The method of claim 19, wherein said counting and exposing the queue manager comprises:
counting and displaying the data quantity of the queue manager; or,
counting and displaying the queue information in the queue manager;
counting the number of data which are successfully put in, failed to put in, successfully taken out and failed to take out of the queue manager;
and counting and displaying the data conditions of successful putting, failed putting, successful taking and failed taking of the queue in the queue manager.
22. The method of claim 21, wherein the statistics and presentation of the queue manager comprises at least one of:
monthly statistics, daily statistics, hourly statistics, minute statistics, historical statistics, custom statistics.
23. The method of claim 12, wherein the queue statistics comprise: queue data flow to information and/or data traffic.
CN201810509664.9A 2018-05-24 2018-05-24 Monitoring management system and monitoring management method Active CN110535713B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810509664.9A CN110535713B (en) 2018-05-24 2018-05-24 Monitoring management system and monitoring management method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810509664.9A CN110535713B (en) 2018-05-24 2018-05-24 Monitoring management system and monitoring management method

Publications (2)

Publication Number Publication Date
CN110535713A CN110535713A (en) 2019-12-03
CN110535713B true CN110535713B (en) 2021-08-03

Family

ID=68657435

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810509664.9A Active CN110535713B (en) 2018-05-24 2018-05-24 Monitoring management system and monitoring management method

Country Status (1)

Country Link
CN (1) CN110535713B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111556019B (en) * 2020-03-27 2022-06-14 天津市普迅电力信息技术有限公司 Vehicle-mounted machine data encryption transmission and processing method under distributed environment
CN113630284B (en) * 2020-05-08 2023-07-07 网联清算有限公司 Message middleware monitoring method, device and equipment
CN111626870B (en) * 2020-05-25 2023-07-21 泰康保险集团股份有限公司 Nuclear data processing method, device and equipment for cleaning physical examination piece
CN111638981A (en) * 2020-05-27 2020-09-08 南京犀六智能科技有限公司 Safety management system
CN112333042A (en) * 2020-10-27 2021-02-05 广州助蜂网络科技有限公司 Monitoring management method and device for Internet of things card middleware
CN112291254B (en) * 2020-11-05 2023-05-05 中国人民银行清算总中心 Message processing method and device for reliable transaction
CN115776435B (en) * 2022-10-24 2024-03-01 华能信息技术有限公司 Early warning method based on API gateway
CN116170385A (en) * 2023-04-21 2023-05-26 四川汉科计算机信息技术有限公司 Gateway information forwarding system, method, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101965005A (en) * 2009-07-21 2011-02-02 中兴通讯股份有限公司 Distributed access gateway system
CN102801585A (en) * 2012-08-24 2012-11-28 上海和辰信息技术有限公司 Information monitoring system and method based on cloud computing network environment
CN107766207A (en) * 2017-10-20 2018-03-06 中国人民财产保险股份有限公司 Distributed automatic monitoring method, system, computer-readable recording medium and terminal device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9240937B2 (en) * 2011-03-31 2016-01-19 Microsoft Technology Licensing, Llc Fault detection and recovery as a service

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101965005A (en) * 2009-07-21 2011-02-02 中兴通讯股份有限公司 Distributed access gateway system
CN102801585A (en) * 2012-08-24 2012-11-28 上海和辰信息技术有限公司 Information monitoring system and method based on cloud computing network environment
CN107766207A (en) * 2017-10-20 2018-03-06 中国人民财产保险股份有限公司 Distributed automatic monitoring method, system, computer-readable recording medium and terminal device

Also Published As

Publication number Publication date
CN110535713A (en) 2019-12-03

Similar Documents

Publication Publication Date Title
CN110535713B (en) Monitoring management system and monitoring management method
CN110222091B (en) Real-time statistical analysis method for mass data
US8769095B2 (en) System and method for dynamically grouping devices based on present device conditions
US9367578B2 (en) Method and system for message tracking and checking
CN102801785B (en) System and method for monitoring advertisement putting engine
CN112698915A (en) Multi-cluster unified monitoring alarm method, system, equipment and storage medium
WO2011017955A1 (en) Method for analyzing alarm data and system thereof
CN111221890B (en) Automatic monitoring and early warning method and device for universal index class
CN112395156A (en) Fault warning method and device, storage medium and electronic equipment
CN114154035A (en) Data processing system for dynamic loop monitoring
CN110221947A (en) Warning information method for inspecting, system, computer installation and readable storage medium storing program for executing
CN110990245A (en) Micro-service operation state judgment method and device based on call chain data
WO2023123801A1 (en) Log aggregation system, and method for improving availability of log aggregation system
CN111339466A (en) Interface management method and device, electronic equipment and readable storage medium
US20170213142A1 (en) System and method for incident root cause analysis
EP1622310A2 (en) Administration system for network management systems
CN111352746B (en) Message flow limiting method and storage medium
CN105607983B (en) Data exception monitoring method and device
CN114138522A (en) Micro-service fault recovery method and device, electronic equipment and medium
CN116795631A (en) Service system monitoring alarm method, device, equipment and medium
CN108173711B (en) Data exchange monitoring method for internal system of enterprise
KR100970211B1 (en) Method and Apparatus for Monitoring Service Status Via Special Message Watcher in Authentication Service System
CN113254313A (en) Monitoring index abnormality detection method and device, electronic equipment and storage medium
CN109508356B (en) Data abnormality early warning method, device, computer equipment and storage medium
CN112416731B (en) Stability monitoring method and device applied to block chain system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant