CN114356722A - Monitoring alarm method, system, equipment and storage medium for server cluster - Google Patents

Monitoring alarm method, system, equipment and storage medium for server cluster Download PDF

Info

Publication number
CN114356722A
CN114356722A CN202210033756.0A CN202210033756A CN114356722A CN 114356722 A CN114356722 A CN 114356722A CN 202210033756 A CN202210033756 A CN 202210033756A CN 114356722 A CN114356722 A CN 114356722A
Authority
CN
China
Prior art keywords
alarm
alarm messages
target
monitoring
messages
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210033756.0A
Other languages
Chinese (zh)
Inventor
张豪杰
陈文俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An E Wallet Electronic Commerce Co Ltd
Original Assignee
Ping An E Wallet Electronic Commerce Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An E Wallet Electronic Commerce Co Ltd filed Critical Ping An E Wallet Electronic Commerce Co Ltd
Priority to CN202210033756.0A priority Critical patent/CN114356722A/en
Publication of CN114356722A publication Critical patent/CN114356722A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention relates to the field of software monitoring. The invention discloses a monitoring and alarming method for a server cluster, which comprises the following steps: monitoring the operation condition of a server cluster, and acquiring a plurality of alarm messages provided by the server cluster; recording a plurality of alarm messages, and recording by taking a monitoring node associated with each alarm message as a unit; determining a plurality of normal alarm messages from the plurality of alarm messages based on actual fault information of the server cluster; determining a plurality of false alarm messages in the plurality of alarm messages according to the plurality of normal alarm messages; analyzing the records corresponding to the false alarm messages to determine one or more target monitoring nodes; and adjusting the quantity of the alarm messages of each target monitoring node. The technical scheme provided by the invention can automatically filter the false alarm message.

Description

Monitoring alarm method, system, equipment and storage medium for server cluster
Technical Field
The present invention relates to the field of software monitoring technologies, and in particular, to a monitoring alarm method and system for a server cluster, a computer device, and a computer-readable storage medium.
Background
With the development of computer science and technology, big data storage and processing become the current hot spot. For example, more and more companies and users migrate various types of files and computing power to a server cluster, thereby implementing cloud storage and cloud computing.
Stability of server clusters, cloud storage and cloud computing are very important. For example, for some serious faults in a server cluster, such as a database crash, a link disconnection, a service abnormality, a production line fault, etc., an IT system administrator, and corresponding developers and monitoring personnel must know and repair the faults as soon as possible in the first time to reduce the influence on normal production line services. Therefore, after the server cluster is deployed, a monitoring system is usually built in to monitor abnormal events, and alarm messages are sent to designated persons or groups in the form of mails or short messages based on the abnormal events.
The existing monitoring alarm system performs alarm configuration according to various preset service requirements, so as to perform alarm monitoring on production line services, and the alarm configuration is various, such as a service drop percentage of several continuous minutes, a service drop below a baseline percentage, a service drop zero, and the like. As described above, the form of the alarm is usually notified by mail or short message.
Although the monitoring alarm system has the function of notification, when the number of servers is large, the problems faced by the servers are various, so that the number and the variety of the alarms are various, and further the alarm flooding and even the false alarm are easily caused.
Disclosure of Invention
In view of the above, an object of the embodiments of the present invention is to provide a monitoring alarm method, system, computer device and computer readable storage medium for a server cluster, which can solve the above problems.
One aspect of the embodiments of the present invention provides a monitoring and warning method for a server cluster, including:
monitoring the operation condition of a server cluster, and acquiring a plurality of alarm messages provided by the server cluster;
recording a plurality of alarm messages, and recording by taking a monitoring node associated with each alarm message as a unit;
determining a plurality of normal alarm messages from the plurality of alarm messages based on actual fault information of the server cluster;
determining a plurality of false alarm messages in the plurality of alarm messages according to the plurality of normal alarm messages;
analyzing the records corresponding to the false alarm messages to determine one or more target monitoring nodes; and
and adjusting the quantity of the alarm messages of each target monitoring node.
Preferably, the determining, based on the actual fault information of the server cluster, a plurality of normal warning messages from a plurality of warning messages includes:
obtaining actual fault information based on the fault report, wherein the actual fault information comprises a fault time period and a monitoring node associated with the fault;
and determining a plurality of normal alarm messages from the plurality of alarm messages based on a fault time period and the monitoring node associated with the fault, wherein the generation time of the normal alarm messages is in the fault time period and carries the monitoring node associated with the fault.
Preferably, the analyzing the record corresponding to each false alarm message to determine one or more target monitoring nodes includes:
analyzing a target dimension of a plurality of records corresponding to a plurality of false alarm messages, wherein the target dimension comprises a time dimension;
determining each target record conforming to a target rule based on the time dimension;
marking a first label for the false alarm message corresponding to each target record, wherein the first label represents regular false alarm messages;
and determining the monitoring node associated with each false alarm message carrying the first label as a target monitoring node.
Preferably, the adjusting the number of the alarm messages of each target monitoring node includes:
adjusting the alarm message quantity threshold of the target monitoring node;
monitoring the alarm quantity of the alarm messages associated with the target monitoring node;
when the alarm quantity is larger than a preset threshold value, marking subsequent alarm messages associated with the target monitoring node with second labels respectively, wherein the second labels are used for indicating the regulated monitoring node;
recording the plurality of subsequent alarm messages marked with the second label to obtain a plurality of subsequent records;
determining whether a plurality of the subsequent records conform to a target rule based on the time dimension;
and if the plurality of subsequent records conform to the target rule based on the time dimension, marking a third label on the plurality of subsequent alarm messages corresponding to the plurality of subsequent records, wherein the third label is used for indicating that the alarm messages are not sent.
Preferably, the target rule comprises that alarm messages with the number exceeding a preset threshold value are generated in a target period of each day in a day period;
the determining whether the plurality of subsequent records conform to a target rule based on the time dimension includes:
comparing the similarity of each subsequent alarm message generated in a target time period in the same day to obtain a plurality of first similarities;
carrying out similarity comparison on a plurality of subsequent alarm messages generated in a target time period every day to obtain a plurality of second similarities;
and if the first similarity and the second similarity are larger than the similarity threshold, judging that the subsequent records conform to the target rule based on the time dimension.
Preferably, the comparing the similarity of the plurality of subsequent warning messages generated in the target time period of each day to obtain a plurality of second similarities includes:
performing word segmentation on each subsequent alarm message in a target time period every day, and performing a plurality of word segmentation sets taking a plurality of days as units;
sequencing the participles in each participle set from high to low according to the occurrence frequency of the participles to obtain a participle sequence corresponding to each participle set one by one;
carrying out similarity calculation on different word segmentation sequences to obtain a plurality of second similarities;
and each second similarity is a result obtained by calculating one word segmentation sequence or the other word segmentation sequence based on a similarity algorithm.
Preferably, the method further comprises: and respectively marking the third label on each subsequent alarm message generated in the target time period.
An aspect of an embodiment of the present invention further provides a monitoring alarm system for a server cluster, including:
the acquisition module is used for monitoring the operation condition of the server cluster and acquiring a plurality of alarm messages provided by the server cluster;
the recording module is used for recording the plurality of alarm messages and recording the alarm messages by taking the monitoring node associated with each alarm message as a unit;
the first determining module is used for determining a plurality of normal warning messages from the plurality of warning messages based on the actual fault information of the server cluster;
the second determining module is used for determining a plurality of false alarm messages in the plurality of alarm messages according to the plurality of normal alarm messages;
the third determining module is used for analyzing the record corresponding to each false alarm message and determining one or more target monitoring nodes; and
and the adjusting module is used for adjusting the quantity of the alarm messages of each target monitoring node.
An aspect of the embodiments of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the processor is configured to implement the steps of the monitoring alarm method for a server cluster as described above.
An aspect of embodiments of the present invention further provides a computer-readable storage medium having stored therein a computer program, which is executable by at least one processor to cause the at least one processor to perform the steps of the monitoring alarm method for a server cluster as described above.
The monitoring alarm method, the system, the equipment and the computer readable storage medium for the server cluster provided by the embodiment of the invention have the following technical advantages:
under the condition that a large number of monitoring nodes (such as the monitoring node with the service domain, the service index and the alarm identifier as the minimum unit) generate a large number of alarm messages, a target monitoring node which determines possible false alarm can be determined from the large number of monitoring nodes, and the number of the alarm messages of the target monitoring node is automatically adjusted downwards. Namely, the false alarm messages are automatically filtered, the target monitoring node is more automatically adjusted, and a large amount of false alarm messages are reduced.
By automatically adjusting the sending quantity of the false alarm messages, an IT manager or a developer can concentrate more on the sensitivity to the alarm messages, namely, the problem whether a production line really generates can be confirmed after the alarm messages are received.
Drawings
Fig. 1 schematically shows a flowchart of a monitoring alarm method for a server cluster according to a first embodiment of the present invention;
FIG. 2 schematically illustrates a block diagram of a monitoring and warning system for a server cluster according to a second embodiment of the present invention; and
fig. 3 schematically shows a hardware architecture diagram of a computer device suitable for implementing the monitoring alarm method for the server cluster according to a third embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the descriptions relating to "first", "second", etc. in the embodiments of the present invention are only for descriptive purposes and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
In the description of the present invention, it should be understood that the numerical references before the steps do not identify the order of performing the steps, but merely serve to facilitate the description of the present invention and to distinguish each step, and thus should not be construed as limiting the present invention.
Example one
Fig. 1 schematically shows a flowchart of a monitoring alarm method for a server cluster according to a first embodiment of the present invention.
As shown in fig. 1, the monitoring alarm method for a server cluster may include steps S100 to S105, where:
s100, monitoring the operation condition of the server cluster, and acquiring a plurality of alarm messages provided by the server cluster.
In this embodiment, the server cluster may be a distributed cluster, that is, a large number of servers are provided for mass storage and mass operation. The servers can be connected through routers, repeaters, hubs and switches. The server cluster can be used for processing a large number of services, such as life insurance service, financing service, production line service and the like. It should be noted that the above-mentioned services are only exemplary and are not used to limit the scope of the present invention.
In the face of a large number of different service types, it is necessary to monitor alarms to know the current situation of each monitoring node, and particularly for some serious faults, such as database downtime, link disconnection, abnormal service, failure of a production line, and the like, related personnel need to be notified to know the occurred abnormality at the first time and repair the abnormality as soon as possible, so as to reduce the influence on the normal service of the production line.
Therefore, each item of data in the server cluster needs to be monitored in real time, and an alarm message is obtained when an abnormal condition is found.
S101, recording a plurality of alarm messages, and recording by taking a monitoring node associated with each alarm message as a unit.
Since server clusters typically handle a huge amount of traffic and may generate a huge amount of alarm messages.
The alarm messages are flooded due to the massive alarm messages, so that the service pressure of the monitoring system is increased, the sensitivity of related personnel to important alarm contents is reduced, and the faults to be solved urgently in the server cluster cannot be processed in time. The warning message is usually notified to the client of the relevant person in the form of mail or short message. Taking the mail as an example, if a large amount of such mails appear, the burden of the monitoring system and the workload of the related personnel are increased undoubtedly.
The inventors have found that among a large number of alert messages, some are often false alert messages. For example, normal drop of the production line service, change of a baseline caused by production line pressure measurement, release of production line codes, system change and maintenance, and the like may trigger generation of alarm messages, and when the alarm messages do not need to be processed, the alarm messages belong to false alarm messages.
Therefore, in this embodiment, a large number of acquired alarm messages (i.e., a plurality of alarm messages) are recorded, and whether a server cluster is currently generating a false alarm message is further analyzed based on the recording for a period of time.
The monitoring node can be a service domain, a service index and an alarm index.
In this embodiment, each alarm message is recorded, and the recording is performed according to the unit with the minimum service domain, service index, and alarm index. And analyzing the existence possibility of the false alarm message by taking the service domain as a dimension, the service index as a dimension and the alarm index as a dimension.
S102, determining a plurality of normal warning messages from the plurality of warning messages based on the actual fault information of the server cluster.
In this embodiment, a normal warning message, i.e. a non-false warning message, is first determined.
Further, in order to effectively determine the normal alarm message and thus learn the possible false alarm message, S103 may be implemented by the following steps:
S102A, obtaining actual fault information based on the fault report, wherein the actual fault information comprises a fault time period and a monitoring node associated with the fault;
S102B, based on the fault time section and the monitoring node associated with the fault, determining a plurality of normal alarm messages from the plurality of alarm messages, wherein the generation time of the normal alarm messages is in the fault time section and carries the monitoring node associated with the fault.
Taking the warning message of the production line fault as an example: and marking the alarm message recorded before as a normal alarm message according to the time period, the service domain, the service index and the alarm index of the fault report of the production line, and then, the rest is the false alarm message.
S103, determining a plurality of false alarm messages in the plurality of alarm messages according to the plurality of normal alarm messages.
In this embodiment, the plurality of warning messages include: and the plurality of normal alarm messages and the plurality of false alarm messages.
And S104, analyzing the records corresponding to the false alarm messages, and determining one or more target monitoring nodes.
And analyzing the target monitoring nodes which possibly generate the false alarm messages subsequently based on the multiple false alarm messages.
Further, for an effective and accurate target monitoring node which still generates a false alarm message subsequently, S104 may be implemented by the following steps:
S104A, analyzing the target dimension of the plurality of records corresponding to the plurality of false alarm messages, wherein the target dimension comprises a time dimension;
S104B, determining each target record conforming to the target rule based on the time dimension;
S104C, marking a first label for the false alarm message corresponding to each target record, wherein the first label represents that the false alarm message is regular;
S104D, determining the monitoring node associated with each false alarm message carrying the first label as a target monitoring node.
For example: there is a time accumulation (such as a week) corresponding to the record of the false alarm message, the alarm quantity generated every day is determined by the dimension of hour according to the alarm time period, if the alarm messages are recorded for a plurality of times in the same time period on different dates, the alarm messages are automatically marked with a first label which is marked as regular false alarm messages.
And S105, adjusting the quantity of the alarm messages of each target monitoring node.
After the target monitoring node is determined, the number of alarm messages of the target monitoring node can be automatically adjusted downwards. Therefore, the false alarm messages are automatically filtered, the alarm indexes are more automatically adjusted, and a large amount of false alarm messages are reduced.
Therefore, by automatically adjusting the sending quantity of the false alarm messages, an IT manager or a developer can concentrate more on the sensitivity to the alarm messages, namely, the alarm messages can be received to confirm whether the problem is really generated.
Further, S105 may be implemented by:
S105A, adjusting the alarm message quantity threshold of the target monitoring node;
S105B, monitoring the alarm quantity of the alarm messages related to the target monitoring node;
S105C, when it is monitored that the alarm number is greater than the preset threshold, marking subsequent alarm messages associated with the target monitoring node with second labels, respectively, where the second labels are used to indicate adjusted monitoring nodes;
S105D, recording the plurality of subsequent alarm messages marked with the second label to obtain a plurality of subsequent records;
S105E, judging whether the subsequent records conform to a target rule based on the time dimension;
S105F, if the plurality of subsequent records conform to the target rule based on the time dimension, marking a third label on a plurality of subsequent alarm messages corresponding to the plurality of subsequent records, where the third label is used to indicate that no alarm message is sent.
In the scheme, the method comprises the following steps: (1) the target monitoring node marked as regular false alarm information can adjust the alarm information data threshold value so as to reduce the number of false alarm information, and meanwhile, the alarm information which is adjusted to exceed the alarm information number threshold value can be marked with a label which is marked as an adjusted alarm index. (2) The alarm indexes which are marked as adjusted can record an alarm message again, if the regular alarm messages are displayed and still can be generated subsequently, a label which does not perform alarm any more (namely a third label) can be marked at the moment, the alarm is not performed any more, and the false alarm messages are further reduced.
Further, the target rule comprises that alarm messages with the quantity exceeding a preset threshold value are generated in a target time period of each day in a day period;
in order to improve the accuracy of the determination of the target rule and prevent the misdetermination as the "false alarm message", the S105E may be implemented by the following steps:
S105E1, carrying out similarity comparison on each subsequent alarm message generated in the target time period in the same day to obtain a plurality of first similarities;
S105E2, carrying out similarity comparison on a plurality of subsequent alarm messages generated in a target time period every day to obtain a plurality of second similarities;
S105E3, if each of the first similarity degrees and each of the second similarity degrees are greater than the similarity threshold, determining that the plurality of subsequent records conform to the target rule based on the time dimension.
Further, to further improve the determination accuracy, step S105E2 may be implemented by:
performing word segmentation on each subsequent alarm message in a target time period every day, and performing a plurality of word segmentation sets taking a plurality of days as units;
sequencing the participles in each participle set from high to low according to the occurrence frequency of the participles to obtain a participle sequence corresponding to each participle set one by one;
carrying out similarity calculation on different word segmentation sequences to obtain a plurality of second similarities;
and each second similarity is a result obtained by calculating one word segmentation sequence or the other word segmentation sequence based on a similarity algorithm.
Further, the method may further include: and respectively marking the third label on each subsequent alarm message generated in the target time period. In this embodiment, the same target monitoring node (with the service domain, the service index, and the alarm index as the minimum unit) may generate a false alarm message in a certain time period, and generate a normal alarm message in other time periods. To prevent the masking of normal alert messages, only alert messages within the target time period are masked.
The monitoring and alarming method for the server cluster in the embodiment of the invention comprises the following steps:
(1) under the condition that a large number of monitoring nodes (such as the monitoring node with the service domain, the service index and the alarm identifier as the minimum unit) generate a large number of alarm messages, a target monitoring node which determines possible false alarm can be determined from the large number of monitoring nodes, and the number of the alarm messages of the target monitoring node is automatically adjusted downwards. Namely, the false alarm messages are automatically filtered, the target monitoring node is more automatically adjusted, and a large amount of false alarm messages are reduced.
By automatically adjusting the sending quantity of the false alarm messages, an IT manager or a developer can concentrate more on the sensitivity to the alarm messages, namely, the problem whether a production line really generates can be confirmed after the alarm messages are received.
Take production line service as an example: the actual condition that a production line needs to monitor the service can be reflected more truly, the condition of service fluctuation caused in reality is more, the production line problem still belongs to a few conditions, and the false alarm message caused by the service fluctuation is also endless, so that the sensitivity of the production line problem of an IT manager or a developer is seriously influenced. In this embodiment, the problem of the false alarm message can be solved to a great extent, and the real and effective alarm message is fed back to the monitoring personnel, so that the monitoring personnel can solve and process the production line problems more truly and effectively.
(2) The method does not need to develop a large number of new plug-ins (and update the processing of different alarm messages through manual setting), and has good flexibility.
Example two
Fig. 2 schematically shows a block diagram of a monitoring alarm system for a server cluster according to a second embodiment of the present invention. The monitoring and alarm system for a server cluster may be partitioned into one or more program modules, which are stored in a storage medium and executed by one or more processors to implement embodiments of the present invention. The program modules referred to in the embodiments of the present invention refer to a series of computer program instruction segments that can perform specific functions, and the following description will specifically describe the functions of the program modules in the embodiments.
As shown in fig. 2, the monitoring alarm system 200 for a server cluster may include an obtaining module 210, a recording module 220, a first determining module 230, a second determining module 240, a third determining module 250, and an adjusting module 260, wherein:
an obtaining module 210, configured to monitor an operating condition of a server cluster, and obtain multiple warning messages provided by the server cluster;
a recording module 220, configured to record multiple warning messages, and record the warning messages in units of monitoring nodes associated with each warning message;
a first determining module 230, configured to determine multiple normal warning messages from multiple warning messages based on actual failure information of the server cluster;
a second determining module 240, configured to determine multiple false alarm messages in the multiple alarm messages according to the multiple normal alarm messages;
a third determining module 250, configured to analyze records corresponding to each false alarm message, and determine one or more target monitoring nodes; and
and the adjusting module 260 is configured to adjust the number of the alarm messages of each target monitoring node.
Preferably, the first determining module 230 is further configured to:
obtaining actual fault information based on the fault report, wherein the actual fault information comprises a fault time period and a monitoring node associated with the fault;
and determining a plurality of normal alarm messages from the plurality of alarm messages based on a fault time period and the monitoring node associated with the fault, wherein the generation time of the normal alarm messages is in the fault time period and carries the monitoring node associated with the fault.
Preferably, the third determining module 250 is further configured to:
analyzing a target dimension of a plurality of records corresponding to a plurality of false alarm messages, wherein the target dimension comprises a time dimension;
determining each target record conforming to a target rule based on the time dimension;
marking a first label for the false alarm message corresponding to each target record, wherein the first label represents regular false alarm messages;
and determining the monitoring node associated with each false alarm message carrying the first label as a target monitoring node.
Preferably, the adjusting module 260 is further configured to:
adjusting the alarm message quantity threshold of the target monitoring node;
monitoring the alarm quantity of the alarm messages associated with the target monitoring node;
when the alarm quantity is larger than a preset threshold value, marking subsequent alarm messages associated with the target monitoring node with second labels respectively, wherein the second labels are used for indicating the regulated monitoring node;
recording the plurality of subsequent alarm messages marked with the second label to obtain a plurality of subsequent records;
determining whether a plurality of the subsequent records conform to a target rule based on the time dimension;
and if the plurality of subsequent records conform to the target rule based on the time dimension, marking a third label on the plurality of subsequent alarm messages corresponding to the plurality of subsequent records, wherein the third label is used for indicating that the alarm messages are not sent.
Preferably, the target rule comprises that alarm messages with the number exceeding a preset threshold value are generated in a target period of each day in a day period;
the adjusting module 260 is further configured to:
comparing the similarity of each subsequent alarm message generated in a target time period in the same day to obtain a plurality of first similarities;
carrying out similarity comparison on a plurality of subsequent alarm messages generated in a target time period every day to obtain a plurality of second similarities;
and if the first similarity and the second similarity are larger than the similarity threshold, judging that the subsequent records conform to the target rule based on the time dimension.
Preferably, the adjusting module 260 is further configured to:
performing word segmentation on each subsequent alarm message in a target time period every day, and performing a plurality of word segmentation sets taking a plurality of days as units;
sequencing the participles in each participle set from high to low according to the occurrence frequency of the participles to obtain a participle sequence corresponding to each participle set one by one;
carrying out similarity calculation on different word segmentation sequences to obtain a plurality of second similarities;
and each second similarity is a result obtained by calculating one word segmentation sequence or the other word segmentation sequence based on a similarity algorithm.
Preferably, the adjusting module 260 is further configured to: and respectively marking the third label on each subsequent alarm message generated in the target time period.
EXAMPLE III
Fig. 3 schematically shows a hardware architecture diagram of a computer device suitable for implementing the monitoring alarm method for the server cluster according to a third embodiment of the present invention. In this embodiment, the computer device 10000 is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and may be, for example, a rack server, a blade server, a tower server, or a rack server (including an independent server or a server cluster including a plurality of servers). As shown in fig. 3, computer device 10000 includes at least, but is not limited to: the memory 10010, processor 10020, and network interface 10030 may be communicatively linked to each other via a system bus. Wherein:
the memory 10010 includes at least one type of computer-readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 10010 may be an internal storage module of the computer device 10000, such as a hard disk or a memory of the computer device 10000. In other embodiments, the memory 10010 may also be an external storage device of the computer device 10000, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device 10000. Of course, the memory 10010 may also include both internal and external memory modules of the computer device 10000. In this embodiment, the memory 10010 is generally configured to store an operating system installed on the computer device 10000 and various application software, such as program codes of a monitoring alarm method for a server cluster. In addition, the memory 10010 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 10020, in some embodiments, can be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip. The processor 10020 is generally configured to control overall operations of the computer device 10000, such as performing control and processing related to data interaction or communication with the computer device 10000. In this embodiment, the processor 10020 is configured to execute program codes stored in the memory 10010 or process data.
Network interface 10030 may comprise a wireless network interface or a wired network interface, and network interface 10030 is generally used to establish a communication link between computer device 10000 and other computer devices. For example, the network interface 10030 is used to connect the computer device 10000 to an external terminal through a network, establish a data transmission channel and a communication link between the computer device 10000 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), or Wi-Fi.
It should be noted that fig. 3 only illustrates a computer device having the components 10010-10030, but it is to be understood that not all illustrated components are required and that more or less components may be implemented instead.
In this embodiment, the monitoring alarm method for server cluster stored in the memory 10010 can be further divided into one or more program modules and executed by one or more processors (in this embodiment, the processor 10020) to complete the present invention.
Example four
The present embodiment also provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the monitoring alarm method for a server cluster in the embodiments.
In this embodiment, the computer-readable storage medium includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the computer readable storage medium may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the computer readable storage medium may be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device. Of course, the computer-readable storage medium may also include both internal and external storage devices of the computer device. In this embodiment, the computer-readable storage medium is generally used for storing an operating system and various types of application software installed in a computer device, for example, the program code of the monitoring alarm method for the server cluster in the embodiment, and the like. Further, the computer-readable storage medium may also be used to temporarily store various types of data that have been output or are to be output.
It will be apparent to those skilled in the art that the modules or steps of the embodiments of the invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A monitoring and alarming method for a server cluster is characterized by comprising the following steps:
monitoring the operation condition of a server cluster, and acquiring a plurality of alarm messages provided by the server cluster;
recording a plurality of alarm messages, and recording by taking a monitoring node associated with each alarm message as a unit;
determining a plurality of normal alarm messages from the plurality of alarm messages based on actual fault information of the server cluster;
determining a plurality of false alarm messages in the plurality of alarm messages according to the plurality of normal alarm messages;
analyzing the records corresponding to the false alarm messages to determine one or more target monitoring nodes; and
and adjusting the quantity of the alarm messages of each target monitoring node.
2. The monitoring alarm method for server cluster according to claim 1, wherein said determining a plurality of normal alarm messages from a plurality of said alarm messages based on actual failure information of said server cluster comprises:
obtaining actual fault information based on the fault report, wherein the actual fault information comprises a fault time period and a monitoring node associated with the fault;
and determining a plurality of normal alarm messages from the plurality of alarm messages based on a fault time period and the monitoring node associated with the fault, wherein the generation time of the normal alarm messages is in the fault time period and carries the monitoring node associated with the fault.
3. The monitoring alarm method for the server cluster according to claim 2, wherein the analyzing the record corresponding to each false alarm message to determine one or more target monitoring nodes comprises:
analyzing a target dimension of a plurality of records corresponding to a plurality of false alarm messages, wherein the target dimension comprises a time dimension;
determining each target record conforming to a target rule based on the time dimension;
marking a first label for the false alarm message corresponding to each target record, wherein the first label represents regular false alarm messages;
and determining the monitoring node associated with each false alarm message carrying the first label as a target monitoring node.
4. The monitoring alarm method for server cluster according to claim 3, wherein said adjusting the number of alarm messages of each of said target monitoring nodes comprises:
adjusting the alarm message quantity threshold of the target monitoring node;
monitoring the alarm quantity of the alarm messages associated with the target monitoring node;
when the alarm quantity is larger than a preset threshold value, marking subsequent alarm messages associated with the target monitoring node with second labels respectively, wherein the second labels are used for indicating the regulated monitoring node;
recording the plurality of subsequent alarm messages marked with the second label to obtain a plurality of subsequent records;
determining whether a plurality of the subsequent records conform to a target rule based on the time dimension;
and if the plurality of subsequent records conform to the target rule based on the time dimension, marking a third label on the plurality of subsequent alarm messages corresponding to the plurality of subsequent records, wherein the third label is used for indicating that the alarm messages are not sent.
5. The monitoring alarm method for server cluster according to claim 4, wherein the target rule includes generating alarm messages with a number exceeding a preset threshold value in a target time period of each day in a cycle of each day;
the determining whether the plurality of subsequent records conform to a target rule based on the time dimension includes:
comparing the similarity of each subsequent alarm message generated in a target time period in the same day to obtain a plurality of first similarities;
carrying out similarity comparison on a plurality of subsequent alarm messages generated in a target time period every day to obtain a plurality of second similarities;
and if the first similarity and the second similarity are larger than the similarity threshold, judging that the subsequent records conform to the target rule based on the time dimension.
6. The monitoring alarm method for server cluster according to claim 5, wherein said comparing the similarity of a plurality of subsequent alarm messages generated in a target time period of each day to obtain a plurality of second similarities comprises:
performing word segmentation on each subsequent alarm message in a target time period every day, and performing a plurality of word segmentation sets taking a plurality of days as units;
sequencing the participles in each participle set from high to low according to the occurrence frequency of the participles to obtain a participle sequence corresponding to each participle set one by one;
carrying out similarity calculation on different word segmentation sequences to obtain a plurality of second similarities;
and each second similarity is a result obtained by calculating one word segmentation sequence or the other word segmentation sequence based on a similarity algorithm.
7. The monitoring alarm method for server cluster according to claim 5, characterized in that the method further comprises: and respectively marking the third label on each subsequent alarm message generated in the target time period.
8. A monitoring alarm system for a cluster of servers, comprising:
the acquisition module is used for monitoring the operation condition of the server cluster and acquiring a plurality of alarm messages provided by the server cluster;
the recording module is used for recording the plurality of alarm messages and recording the alarm messages by taking the monitoring node associated with each alarm message as a unit;
the first determining module is used for determining a plurality of normal warning messages from the plurality of warning messages based on the actual fault information of the server cluster;
the second determining module is used for determining a plurality of false alarm messages in the plurality of alarm messages according to the plurality of normal alarm messages;
the third determining module is used for analyzing the record corresponding to each false alarm message and determining one or more target monitoring nodes; and
and the adjusting module is used for adjusting the quantity of the alarm messages of each target monitoring node.
9. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor is adapted to carry out the steps of the method for monitoring alarms for a server cluster according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which is executable by at least one processor to cause the at least one processor to perform the steps of the monitoring alarm method for a server cluster according to any of claims 1 to 7.
CN202210033756.0A 2022-01-12 2022-01-12 Monitoring alarm method, system, equipment and storage medium for server cluster Pending CN114356722A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210033756.0A CN114356722A (en) 2022-01-12 2022-01-12 Monitoring alarm method, system, equipment and storage medium for server cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210033756.0A CN114356722A (en) 2022-01-12 2022-01-12 Monitoring alarm method, system, equipment and storage medium for server cluster

Publications (1)

Publication Number Publication Date
CN114356722A true CN114356722A (en) 2022-04-15

Family

ID=81109912

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210033756.0A Pending CN114356722A (en) 2022-01-12 2022-01-12 Monitoring alarm method, system, equipment and storage medium for server cluster

Country Status (1)

Country Link
CN (1) CN114356722A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115801203A (en) * 2023-01-19 2023-03-14 苏州浪潮智能科技有限公司 Distributed cluster reliability management method, device and equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115801203A (en) * 2023-01-19 2023-03-14 苏州浪潮智能科技有限公司 Distributed cluster reliability management method, device and equipment
CN115801203B (en) * 2023-01-19 2023-04-25 苏州浪潮智能科技有限公司 Distributed cluster reliability management method, device and equipment

Similar Documents

Publication Publication Date Title
CN110661659B (en) Alarm method, device and system and electronic equipment
CN113176978B (en) Monitoring method, system, equipment and readable storage medium based on log file
CN110888783B (en) Method and device for monitoring micro-service system and electronic equipment
CN110046073B (en) Log collection method and device, equipment and storage medium
CN112631913A (en) Method, device, equipment and storage medium for monitoring operation fault of application program
CN113297042B (en) Method, device and equipment for processing alarm message
CN110784352B (en) Data synchronous monitoring and alarming method and device based on Oracle golden gate
CN114201201A (en) Method, device and equipment for detecting abnormity of business system
CN108039971A (en) A kind of alarm method and device
CN114924929A (en) NVMe hard disk fault early warning method, system and computer equipment
CN113704018A (en) Application operation and maintenance data processing method and device, computer equipment and storage medium
CN114356722A (en) Monitoring alarm method, system, equipment and storage medium for server cluster
CN109818808B (en) Fault diagnosis method and device and electronic equipment
CN112769615B (en) Anomaly analysis method and device
CN110069382B (en) Software monitoring method, server, terminal device, computer device and medium
CN112416896A (en) Data abnormity warning method and device, storage medium and electronic device
CN116032725B (en) Method and device for generating fault root cause positioning model
CN111427959A (en) Data storage method and device
CN115981950A (en) Monitoring alarm method, device, equipment and computer readable storage medium
CN111935279B (en) Internet of things network maintenance method based on block chain and big data and computing node
CN112035315A (en) Webpage data monitoring method and device, computer equipment and storage medium
CN111581044A (en) Cluster optimization method, device, server and medium
CN110995500A (en) Node log management and control method, system and related components
CN111338900A (en) Method and device for monitoring running state of software system
CN112039681A (en) Alarm reporting method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination