CN112650642A - Alarm processing method and device, equipment and storage medium - Google Patents

Alarm processing method and device, equipment and storage medium Download PDF

Info

Publication number
CN112650642A
CN112650642A CN202011430102.9A CN202011430102A CN112650642A CN 112650642 A CN112650642 A CN 112650642A CN 202011430102 A CN202011430102 A CN 202011430102A CN 112650642 A CN112650642 A CN 112650642A
Authority
CN
China
Prior art keywords
alarm
self
healing
activation factor
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011430102.9A
Other languages
Chinese (zh)
Inventor
龚治文
饶俊明
卢道和
郑晓腾
龚洵峰
刘生庆
吴立
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN202011430102.9A priority Critical patent/CN112650642A/en
Publication of CN112650642A publication Critical patent/CN112650642A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3034Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a storage system, e.g. DASD based or network based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the application discloses an alarm processing method, an alarm processing device, equipment and a storage medium, wherein the method comprises the following steps: acquiring alarm information of each cluster in at least one file system cluster; determining the alarm type and cluster identification of each alarm message; matching a corresponding self-healing instruction according to the alarm type of each alarm message; determining the state of an activation factor corresponding to each alarm information according to the cluster identifier and the alarm type of each alarm information; and executing the self-healing instruction according to the state of the activation factor so as to realize the processing of the alarm information.

Description

Alarm processing method and device, equipment and storage medium
Technical Field
The embodiment of the application relates to but is not limited to an information technology of financial technology (Fintech), and particularly relates to an alarm processing method and device, equipment and a storage medium.
Background
With the development of computer technology, more and more technologies are applied in the financial field, and the traditional financial industry is gradually changing to financial technology (Fintech), however, the financial technology also puts higher demands on the technology due to the requirements of security and real-time performance of the financial industry. However, in the field of financial technology, in an application scenario where the warning information collection platform receives warning information related to a file system, the related warning information is displayed on the display interface, and related personnel contact staff of the corresponding system to manually process the warning information. Therefore, a safe and robust alarm processing method is needed to avoid the problems of labor consumption, time delay and misoperation when alarm information is manually processed.
Disclosure of Invention
In view of this, embodiments of the present application provide an alarm processing method and apparatus, a device, and a storage medium to solve at least one problem in the related art.
The technical scheme of the embodiment of the application is realized as follows:
in one aspect, an embodiment of the present application provides an alarm processing method, where the method includes:
acquiring alarm information of each cluster in at least one file system cluster;
determining the alarm type and cluster identification of each alarm message;
matching a corresponding self-healing instruction according to the alarm type of each alarm message;
determining the state of an activation factor corresponding to each alarm information according to the cluster identifier and the alarm type of each alarm information;
and executing the self-healing instruction according to the state of the activation factor so as to realize the processing of the alarm information.
In another aspect, an embodiment of the present application provides an alarm processing apparatus, where the apparatus includes:
the first acquisition module is used for acquiring the alarm information of each cluster in at least one file system cluster;
the first determining module is used for determining the alarm type and the cluster identifier of each alarm message;
the matching module is used for matching the corresponding self-healing instruction according to the alarm type of each alarm message;
the second determining module is used for determining the state of the activation factor corresponding to the alarm information according to the cluster identifier and the alarm type of each alarm information;
and the execution module is used for executing the self-healing instruction according to the state of the activation factor so as to realize the processing of the alarm information.
In another aspect, an embodiment of the present application provides a computer device, including a memory and a processor, where the memory stores a computer program executable on the processor, and the processor implements the steps in the method when executing the program.
In a further aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps in the method.
The alarm processing method provided by the embodiment of the application comprises the steps of firstly determining the alarm type and the cluster identification of each alarm message, and then matching the corresponding self-healing instruction according to the alarm type of each alarm message; determining the state of an activation factor corresponding to each alarm information according to the cluster identifier and the alarm type of each alarm information; and finally, executing the self-healing instruction according to the state of the activation factor so as to realize the processing of the alarm information. Therefore, whether the alarm information is automatically processed through the self-healing instruction can be judged according to the state of the activation factor. Therefore, on one hand, the safe and robust process for triggering the self-healing of the alarm information can be realized, and the problem of manpower consumption when the alarm information is manually processed is avoided; on the other hand, the problems of processing time delay and misoperation when the alarm information is manually processed are avoided.
Drawings
FIG. 1 is a schematic diagram illustrating an implementation flow of an alarm processing method according to an embodiment of the present application;
FIG. 2A is a schematic diagram of an implementation flow of an alarm processing method according to an embodiment of the present application;
FIG. 2B is a schematic diagram illustrating an implementation flow of an alarm processing method according to an embodiment of the present application;
FIG. 3 is a schematic diagram illustrating an implementation flow of an alarm processing method according to an embodiment of the present application;
FIG. 4 is a schematic diagram illustrating an implementation flow of an alarm processing method according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a component structure of an alarm processing apparatus according to an embodiment of the present application;
fig. 6 is a hardware entity diagram of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the purpose, technical solutions and advantages of the present application clearer, the technical solutions of the present application are further described in detail with reference to the drawings and the embodiments, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts belong to the protection scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
Where similar language of "first/second" appears in the specification, the following description is added, and where reference is made to the term "first \ second \ third" merely to distinguish between similar items and not to imply a particular ordering with respect to the items, it is to be understood that "first \ second \ third" may be interchanged with a particular sequence or order as permitted, to enable the embodiments of the application described herein to be performed in an order other than that illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
The technical solution of the present application is further elaborated below with reference to the drawings and the embodiments.
An embodiment of the present application provides an alarm processing method, and fig. 1 is a schematic diagram illustrating an implementation flow of the alarm processing method according to the embodiment of the present application, where as shown in fig. 1, the method includes:
step S101, acquiring alarm information of each cluster in at least one file system cluster;
here, the file system cluster will generate cluster information characterizing the cluster operation state during operation.
In the implementation process, under the condition that the cluster information meets the preset alarm rule, the file system cluster reports an error to generate the alarm information. Here, the preset alarm rule may be an abnormal value of each index of each cluster.
Step S102, determining the alarm type and the cluster identification of each alarm message;
here, the alarm information includes at least: alarm type and cluster identification. The cluster identifier may be a cluster name, a serial number, or the like, and is used to locate a certain cluster in the file system cluster.
For example, since the file system is an input-output intensive application, the alert type may be a ceph osd down type. Wherein ceph represents a file system, osd represents a node, and down represents that the current node is unavailable. The ceph osd down type of alarm may be caused by a number of conditions: for example, the corresponding underlying storage medium is abnormal, the osd with heavy traffic has long time and high load, the heartbeat is not sent in time, the clock offset of the osd node is too large, and the like.
Step S103, matching corresponding self-healing instructions according to the alarm type of each alarm message;
here, the self-healing instruction is stored in a self-healing instruction set for performing alarm processing, and the self-healing instruction set is stored as a method for performing alarm processing, and the self-healing instruction is stored in a packaged instruction set form.
In the implementation process, the self-healing instruction is issued to the server, the server executes the instruction set to complete alarm processing, under the condition that alarm information is received, the alarm type of the reported alarm information is collected through the system monitoring alarm frame, whether the corresponding self-healing instruction can be found through matching the alarm or not is matched, if the self-healing instruction is matched, the alarm type corresponding to the alarm information can be self-healed, otherwise, the self-healing cannot be executed, and manual processing is needed. The matching process comprises the following steps: searching the alarm type of the alarm information in the self-healing instruction set; and under the condition that the alarm type exists in the self-healing instruction set, determining the self-healing instruction corresponding to the alarm type as the self-healing instruction of the alarm information.
Step S104, determining the state of an activation factor corresponding to each alarm information according to the cluster identifier and the alarm type of each alarm information;
here, the activation factor is used to determine whether to execute the self-healing instruction. The activation factors may be provided by a cloud management platform, the activation factors including at least: and the alarm maintenance state, the alarm change state, the repeated self-healing state and the like.
In some embodiments, the activation factor may further include: the alarm IP address, the time of the last alarm, the time of the current alarm, the alarm level and the like.
Here, the activation factors are divided according to alarm types, and different activation factors are different for alarm types on the same cluster.
In some embodiments, the step S104, determining, according to the cluster identifier and the alarm type of each alarm information, a state of an activation factor corresponding to the alarm information, includes: searching for an activation factor consistent with the cluster identifier and the alarm type of the alarm information; and reading the state field of the activation factor to determine the state of the activation factor corresponding to the alarm information.
And step S105, executing the self-healing instruction according to the state of the activation factor so as to realize the processing of the alarm information.
Here, the states of the activation factors include two types: the activation factor is present and the activation factor is absent. And under the condition that the activation factor exists, the activation factor comprises the state of a self-healing factor, and the state of the self-healing factor is used for representing whether the activation factor is available or not.
In an implementation process, different processes for executing the self-healing instruction may be determined according to different states of the activation factor, so as to implement processing of the alarm information.
In the embodiment of the application, the alarm type and the cluster identification of each alarm message are determined, and then the corresponding self-healing instruction is matched according to the alarm type of each alarm message; determining the state of an activation factor corresponding to each alarm information according to the cluster identifier and the alarm type of each alarm information; and finally, executing the self-healing instruction according to the state of the activation factor so as to realize the processing of the alarm information. Therefore, whether the alarm information is automatically processed through the self-healing instruction can be judged according to the state of the activation factor. Therefore, on one hand, the safe and robust process for triggering the self-healing of the alarm information can be realized, and the problem of manpower consumption when the alarm information is manually processed is avoided; on the other hand, the problems of processing time delay and misoperation when the alarm information is manually processed are avoided.
An alarm processing method according to an embodiment of the present application is provided, and fig. 2A is a schematic flow chart illustrating an implementation of the alarm processing method according to the embodiment of the present application, where as shown in fig. 2A, the method includes:
step S201, acquiring alarm information of each cluster in at least one file system cluster;
step S202, determining the alarm type and the cluster identification of each alarm message;
step S203, matching corresponding self-healing instructions according to the alarm type of each alarm message;
step S204, determining the state of an activation factor corresponding to the alarm information according to the cluster identifier and the alarm type of each alarm information;
step S205, under the condition that the state of the activation factor is that the activation factor does not exist, creating a corresponding activation factor according to the cluster identifier and the alarm type, wherein the activation factor comprises a self-healing factor state field; the self-healing factor state field is a field used for indicating whether a self-healing instruction is available or not in an activation factor;
for example, the data structure of the activation factor is:
{
cluster_name:rbd_arm_ft01
warn_name:ceph osd.12down
warn_ip:10.110.3.8
last time 2020-09-16:15:58:23# time when last alarm occurred
now _ time:2020-09-16:15:58:23# time when alarm occurs this time
ttl_time:120s
warn_level:Major
……………
krill result, ok "# ok represents the successful execution of the self-healing instruction, wait represents the execution result of the self-healing instruction, fail represents the execution failure of the self-healing instruction
count:1
ims send_time:2020-09-16:15:58:23
status:“active”
}
Wherein status is a self-healing factor status field, active indicates that the status of the activation factor is active, cluster _ name is a cluster identifier, and war _ name is an alarm type. When receiving the alarm information with the alarm type ceph osd.12down, the server calling the alarm processing matches whether the activation factor exists or not by using the cluster _ name and the war _ name in the alarm information.
Wherein, krill (command execution platform) result is an execution result field of the command execution platform. And under the condition that the activation factor does not exist, creating a corresponding activation factor, calling a command execution platform to execute a corresponding self-healing instruction, updating a result field of the created activation factor according to an execution result of the self-healing instruction, sending an alarm to an alarm information acquisition platform to inform of self-healing failure if the self-healing instruction fails to execute, and simultaneously setting the state of the self-healing factor as unavailable inactive.
In some embodiments, the step of matching whether the activation factor exists by using the cluster _ name and the war _ name in the alarm information by the server calling the alarm processing comprises the following steps: searching for an activation factor consistent with the cluster _ name and the war _ name of the alarm information to obtain a search result; when the search result shows that the activation factors of the cluster _ name and the war _ name exist currently, judging that the activation factors exist in the alarm information; and when the search result shows that the activation factors of the cluster _ name and the war _ name do not exist currently, judging that the alarm information does not have the activation factors.
Step S206, under the condition that the self-healing factor state field is available, executing the self-healing instruction to obtain an execution result so as to realize the processing of the alarm information;
for example, in the case that the self-healing factor status field is available, the active value corresponding to the status field represents that an activation factor is available. In the event that the status field indicates that an activation factor is available, a self-healing instruction may be executed.
Here, step S205 and step S206 provide a way to implement step S105.
Step S207, under the condition that the execution result is self-healing successful, putting the activation factor corresponding to the execution result into a waiting queue;
here, the waiting queue is configured to store the activation factor corresponding to the alarm that has been successfully executed after the alarm information is self-healed. And waiting for N acquisition periods of the alarm index, and judging whether the self-healing can be successful or not.
Step S208, after the self-healing instruction is executed for N cycles, obtaining an execution result in the N cycles;
step S209, determining that the alarm processing is successful when the execution result in each period is that the self-healing is successful.
Here, the execution result will be assigned to the result field of the activation factor.
In some embodiments, the method further comprises:
setting the state field of the self-healing factor as an unavailable state under the condition that the execution result is self-healing failure, and reporting the alarm information to a manual alarm processing platform; and the unavailable state is used for indicating that the alarm information of the same alarm type is reported to the manual alarm processing platform.
In some embodiments, the method further comprises:
and under the condition of creating the activation factor, determining the state value of the self-healing factor state field according to the cluster maintenance state, the cluster change state, the activation factor creating timestamp, the last healing timestamp and the self-healing time interval to which the cluster identifier belongs.
For example, when the cloud management platform generates the activation factor, the state value of the self-healing factor state field is calculated by formula (1):
a state value of the self-healing factor state field ═ (cluster repair state & cluster change state) & ((current timestamp-last healing timestamp-k × maximum self-healing time interval) >0) formula (1);
wherein, & represents and operation, and the state value of the self-healing factor state field after calculation is 1 to represent active; a state value of 0 indicates inactive; the cluster maintenance state, the cluster change state and the last healing timestamp in the formula can be obtained through a cloud management platform in a database query mode; the initial value of the maximum self-healing time interval is 3600s, the abnormal maximum self-healing time interval of the cluster can be obtained through dynamic calculation to be replaced, k represents a constant value and is set according to the actual situation, and in the embodiment of the invention, k is preferably set to be 2 through testing.
In the embodiment of the present application, on one hand, when the state of the activation factor is that the activation factor does not exist, a corresponding activation factor is created according to the cluster identifier and the alarm type, so that an alarm processor is prevented from reporting an error and the robustness of the alarm processing method is increased when the activation factor does not exist; on the other hand, when the execution result is that the self-healing is successful, the activation factor corresponding to the execution result is put into a waiting queue; and after waiting for the self-healing instruction to execute for N periods, obtaining the execution result in the N periods, so that the self-healing instruction can be waited for executing for multiple times under the condition of determining that the activation factor is available, and the reliability of the alarm processor is improved.
An embodiment of the present application provides an alarm processing method, and fig. 2B is a schematic diagram illustrating an implementation flow of the alarm processing method according to the embodiment of the present application, where as shown in fig. 2B, the method includes:
step S210, acquiring alarm information of each cluster in at least one file system cluster;
step S220, determining the alarm type and the cluster identification of each alarm message;
step S230, matching corresponding self-healing instructions according to the alarm type of each alarm message; the self-healing instruction comprises processing steps required to be executed for processing the alarm information;
step S240, determining the state of an activation factor corresponding to each alarm information according to the cluster identifier and the alarm type of each alarm information;
step S250, determining whether the activation factor is in a life cycle according to an update time field of the activation factor under the condition that the state of the activation factor is that the activation factor exists and a self-healing factor state field in the activation factor is available;
wherein the activation factor comprises a self-healing factor status field and an update time field; the self-healing factor state field is a field used for indicating whether a self-healing instruction is available or not in an activation factor, and the life cycle is the sum of the time for executing the self-healing instruction and the cycle for acquiring alarm information;
in some embodiments, the life cycle of the activation factor is the period of time for the command execution platform to execute the self-healing command and the period for the system monitoring alarm framework to collect the alarm information, and considering that there is network delay in issuing the self-healing execution, the production setting is generally larger than twice of the theoretical value.
Step S260, under the condition that the activation factor is in the life cycle, silencing the alarm information;
here, the activation factor is within a lifecycle to indicate that the alert information is executing a self-healing instruction.
In some embodiments, the alarm processor for processing the alarm information has a silent function, and when the alarm processor is called to perform alarm processing, the alarm processor calls the silent function and does not process the alarm when detecting that the activation factor corresponding to the alarm information is in the life cycle.
In the implementation process, the self-healing alarm operation can be executed only once in the life cycle of the activation factor, so that in order to avoid repeated execution of the self-healing instruction, under the condition that the alarm information is detected to be executing the self-healing instruction, the alarm information needs to be silenced, and secondary processing is not performed on the alarm information any more.
And step S270, executing the self-healing instruction under the condition that the activation factor is not in the life cycle to obtain an execution result so as to realize the processing of the alarm information.
In the embodiment of the application, under the condition that the activation factor is in a life cycle, silencing the alarm information; and under the condition that the activation factor is not in the life cycle, executing the self-healing instruction to obtain an execution result so as to realize the processing of the alarm information. Therefore, the self-healing alarm operation can be executed only once in the life cycle of the activation factor, and repeated execution of the self-healing instruction is avoided.
An alarm processing method according to an embodiment of the present application is provided, and fig. 3 is a schematic flow chart illustrating an implementation of the alarm processing method according to the embodiment of the present application, where as shown in fig. 3, the method includes:
step S301, acquiring alarm information of each cluster in at least one file system cluster;
step S302, determining the alarm type and the cluster identification of each alarm message;
step S303, matching a corresponding self-healing instruction according to the alarm type of each alarm message; the self-healing instruction comprises processing steps required to be executed for processing the alarm information;
step S304, determining the state of an activation factor corresponding to the alarm information according to the cluster identifier and the alarm type of each alarm information;
step S305, determining whether the activation factor is in a life cycle according to an update time field of the activation factor under the condition that the state of the activation factor is that the activation factor exists and a self-healing factor state field of the activation factor is available;
wherein the activation factor comprises a self-healing factor status field and an update time field; the self-healing factor state field is a field used for indicating whether a self-healing instruction is available or not in an activation factor, and the life cycle is the sum of the time for executing the self-healing instruction and the cycle for acquiring alarm information;
step S306, under the condition that the activation factor is in the life cycle, silencing the alarm information;
step S307, under the condition that the activation factor is not in the life cycle, acquiring a state topology tree of a corresponding cluster according to the cluster identifier in each alarm message;
here, the fact that the activation factor is not in the life cycle indicates that the self-healing instruction is not executed by the current alarm, and the self-healing instruction can be executed.
Here, the state topology tree is used to represent the distribution of OSDs (nodes) in the file system cluster.
Here, the cluster identifier may be used to determine the cluster that issued the alarm information.
Step S308, searching a storage medium for alarming the abnormal node according to the serial number of the abnormal node in each alarm message and the state topology tree of the corresponding cluster;
for example, according to the OSD number, the cluster state topology tree is used to find the IP address of the abnormal node and the drive letter of the storage medium corresponding to the ceph OSD down type alarm, and the storage medium with the alarm is determined.
Step S309, evaluating the storage medium to obtain an evaluation result;
here, the specific evaluation service includes at least one of: the system comprises a storage medium checking service, a clock correction service and a file system cluster quality evaluation service. Here, the storage medium checking service is to check whether the storage medium is damaged; the clock rectification service is used for determining that the clock of the storage medium needs to be rectified; the file system cluster quality evaluation service is used for determining whether the utilization rate of the storage medium exceeds a set upper utilization limit value under the condition that the clock is normal.
And step S310, executing the self-healing instruction according to the evaluation result to obtain an execution result so as to realize the processing of the alarm information.
For example, the self-healing instruction is executed according to the evaluation results of the storage medium check service, the clock correction service and the file system cluster quality evaluation service.
In some embodiments, the step S310, according to the evaluation result, executes the self-healing instruction to obtain an execution result, so as to implement processing on the alarm information, and includes:
step 311A, migrating the data in the abnormal node to a normal node in the cluster if the evaluation result is that the storage medium is damaged;
step S312A, removing the abnormal node from the cluster and masking the alarm, so as to implement processing of the alarm information.
For example, if the storage medium is detected to be damaged, the command execution platform may actively trigger out operation to remove the osd from the cluster, and the data in the cluster is automatically migrated to the osd with normal service, and meanwhile, the damage information of the storage medium is reported to the alarm information collection platform to shield the alarm.
In some embodiments, the step S310, according to the evaluation result, executes the self-healing instruction to obtain an execution result, so as to implement processing on the alarm information, and includes:
step 311B, in the case that the evaluation result is that the clock of the storage medium needs to be corrected, correcting the clock to implement the processing of the alarm information;
for example, if the detected clock skew is too large, clock correction is automatically triggered to change the time in the storage medium.
Step S311C, if the evaluation result is that the usage rate of the storage medium exceeds a set upper usage limit value, and the storage medium clock is normal, restarting the abnormal node to implement processing of the warning information.
For example, in the case that the clock of the storage medium is normal, comparing whether the usage rate of the storage medium exceeds 95%, in the case that the usage rate is higher than 95%, the osd service will be restarted.
In some embodiments, in the step S310, before the obtaining the warning information of each cluster in the at least one file system cluster, the method further includes: acquiring a monitoring index of each file system cluster in at least one file system cluster; and determining the monitoring index which does not meet the preset alarm rule as the alarm information of each cluster.
Here, the monitoring index may be cluster information of the file system cluster. For example, it may be the access rights of a certain file system cluster and the file owner of the files in the cluster. Generally speaking, the file system cluster starts a specific component of the file system to obtain a monitoring index of the file system cluster, wherein the specific component may be a ceph-mgr component, the ceph-mgr component supports a system monitoring alarm framework, the ceph-mgr component can expose information of the file system cluster, and in implementation, the system monitoring alarm framework sends a request to the ceph-mgr component at regular time to obtain the monitoring index of the file system cluster.
Here, the preset alarm rule may be an abnormal value of each index of each cluster.
In the implementation process, the system monitoring alarm framework obtains the information of the current cluster through ceph-mgr, finds out that the osd state is down through rule comparison, and triggers the rule to report to the alarm processor.
In an embodiment of the present application, on one hand, when the evaluation result is that the storage medium has been damaged, data in the abnormal node is migrated to a normal node in the cluster; removing the abnormal node from the cluster and shielding the alarm to realize the processing of the alarm information, so that the damage problem of the storage medium can be self-healed; on the other hand, under the condition that the evaluation result is that the clock of the storage medium needs to be corrected, the clock is corrected so as to realize the processing of the alarm information; the evaluation result does storage medium utilization exceeds the use upper limit value of settlement, just under the normal condition of storage medium clock, it is right abnormal node restarts to the realization is right warning information's processing, like this, can need to rectify and the utilization exceeds the use upper limit value scheduling problem of settlement to storage medium's clock and self-heal, improved warning processing's validity, avoided manual processing's time delay.
In the related art, when the warning information collection platform receives warning information related to a file system, the related warning information is displayed on a display interface, and related personnel contact staff of a corresponding system to manually process the warning information.
In the related art, there are the following disadvantages: 1) the file system alarms are manually processed, and manpower is consumed; 2) abnormal alarm of the file system requires relevant personnel to respond and process, and under the condition that the relevant personnel cannot process in time, the service is affected for a long time; 3) when the related personnel processes the alarm abnormity of the file system, the related personnel needs to be particularly familiar with the system generating the alarm abnormity, so that the related personnel has the risk of misoperation under the unfamiliar condition.
The embodiment of the application provides an alarm processing method, which comprises the following steps:
1) self-healing process of monitoring system
The file system starts a specific component, the specific component can be a ceph-mgr component, the ceph-mgr component supports a system monitoring alarm framework (Prometheus), the ceph-mgr component can expose information of a file system cluster, and the system monitoring alarm framework can obtain a monitoring index of the file system cluster by sending a request to the ceph-mgr component at regular time. According to preset alarm rules of various indexes in a system monitoring alarm framework, the indexes trigger the alarm rules and then are directly sent to an alarm processor (alert manager). Here, the alarm processor is responsible for uniformly processing the alarms generated by the system monitoring alarm framework, and alarm rules can be set in the alarm processor for screening self-healing alarms; setting self-healing conditions, sending the alarm meeting the self-healing conditions to the command execution platform for self-healing treatment, sending the alarm not meeting the self-healing conditions to the manual alarm processing platform, and informing the manual alarm processing.
The file system is an input output intensive application, so the alarm type may be ceph osd down type. The ceph osd down type of alarm may be caused by a number of conditions: for example, the corresponding underlying storage medium is abnormal, the osd with heavy traffic has long time and high load, the heartbeat is not sent in time, the clock offset of the osd node is too large, and the like.
The system monitoring alarm framework obtains the information of the current cluster through ceph-mgr, finds out that the osd state is down through rule comparison, triggers the rule to report to the alarm processor, the alarm processor inquires the activation factor of the corresponding cluster according to the reported ceph alarm, and judges whether to carry out self-healing processing according to the activation factor.
If the activation factor meets the activation condition, self-healing processing is carried out, the command execution platform finds out the IP address of the abnormal node and the storage medium which correspondingly generates ceph osd down type alarm by using the cluster state topology tree according to the protocol of the request and by transmitting the cluster identification and the osd number of the alarm, and the alarm processor sequentially calls the storage medium inspection service, the ntp clock correction service and the ceph cluster quality evaluation service which are packaged by the command execution platform to evaluate the storage medium. If the storage medium is detected to be damaged, the command execution platform can actively trigger out operation to remove the osd from the cluster, automatically transfer the data on the osd to the osd with normal service, report the medium damage information to the alarm information acquisition platform and shield the alarm; if the clock skew is detected to be too large, automatically triggering clock correction service; evaluating and calculating the ceph cluster quality, and restarting the osd service if the utilization rate of the storage medium is detected to exceed 95% and the storage medium and the clock check are normal; and if the alarm information can not be matched with the above conditions, directly reporting the alarm information to an alarm information acquisition platform.
And the activation factor does not meet the condition, and the alarm is directly transmitted to an alarm information acquisition platform. After the processing is completed, the activation factors of the cluster are stored before being sent to the alarm information acquisition platform, and are provided for the following alarms.
2) The alarm processor judges whether to execute self-healing
The alarm processor judges whether the self-healing can be carried out or not and comprises two aspects: self-healing instruction set + activation factor; the self-healing instruction set is a set formed by alarm processing methods, is converted into an encapsulated instruction set, can complete alarm processing only by sending the instruction set to a server to execute under normal conditions, and when an alarm processor receives alarm information, the alarm processor acquires and reports an alarm type field through a system monitoring alarm framework to match whether the corresponding self-healing instruction can be found out by the alarm or not, if the self-healing instruction set is matched, the alarm is capable of executing self-healing, otherwise, the self-healing instruction set is not matched; and the activation factor is provided by the cloud management platform, so that clusters in maintenance and change can be efficiently avoided, and the alarm which is sent to be self-healing executed is subjected to wrong self-healing processing under abnormal conditions such as repeated self-healing and the like.
3) Activating factor
The activation factors are generated by the cloud management platform and are divided according to alarm types, different activation factors of the alarm types on the same cluster are different, and the life cycle of the activation factors is the cycle of executing the self-healing instruction time by the command execution platform and collecting alarm information by the system monitoring alarm framework. In the implementation process, the life cycle of the activation factor is set to be twice, and the self-healing alarm operation can be executed only once in the life cycle of the activation factor so as to avoid repeated execution of the self-healing instruction. Taking ceph osd down as an example, the data structure of the activation factor is:
{
cluster_name:rbd_arm_ft01
warn_name:ceph osd.12down
warn_ip:10.110.3.8
last time 2020-09-16:15:58:23# time when last alarm occurred
now _ time:2020-09-16:15:58:23# time when alarm occurs this time
ttl_time:120s
warn_level:Major
……………
The command execution platform _ result indicates that ok "# ok represents that the self-healing instruction is executed successfully, wait represents that the self-healing instruction is executed as a result, fail represents that the self-healing instruction is executed as a failure
count:1
Alarm information acquisition platform _ send _ time:2020-09-16:15:58:23
status represents that active represents that the activation factor is available, inactive represents that the activation factor is unavailable, and the status cannot execute the self-healing instruction
}
When the alarm processor receives the alarm information with the alarm type of ceph osd.12down, the alarm processor matches whether an activation factor exists by using cluster _ name and war _ name in the alarm information:
and if the self-healing instruction fails to be executed, sending an alarm to an alarm information acquisition platform to inform the self-healing failure, and simultaneously setting the state of the activation factor as unavailable inactive.
And under the condition that the activation factor exists, the state is an active state, whether the activation factor is in the life cycle is calculated according to the update _ time, if so, the silent function of the alarm processor is called to actively silence the alarm, otherwise, the command execution platform is called to continuously send the self-healing instruction for processing.
4) Activation factor state calculation
The alarm processor judges whether the self-healing is available or not, the state of the activation factor needs to be concerned, and when the state of the activation factor is active, the self-healing operation can be performed, and a self-healing instruction set can be automatically executed; when the status of the activation factor is inactive, the self-healing operation cannot be executed.
The cloud management platform generates the activation factors and calculates the state values corresponding to the activation factors, and the formula is as follows:
the activation factor state value ═ (cluster repair state & cluster change state) & ((current timestamp-last healing timestamp-k × maximum self-healing time interval) > 0).
The activation factor state value is the state value of a self-healing factor state field of the activation factor, and represents and operation, and the calculated activation factor state value is 1 and represents active; a state value of 0 indicates inactive; the cluster maintenance state, the cluster change state and the last healing timestamp in the formula can be obtained through a cloud management platform in a database query mode; the initial value of the maximum self-healing time interval is 3600s, the abnormal maximum self-healing time interval of the cluster can be obtained through dynamic calculation to be replaced, k represents a constant value and is set according to the actual situation, and in the embodiment of the invention, k is preferably set to be 2 through testing.
5) Updating an activation factor
Two aspects of work need to be done in the treatment stage after self-healing: on one hand, the self-healing execution condition is obtained and backfilled into an activation factor, such as a count field, a command execution platform _ result field and the like; and on the other hand, the activation factors corresponding to the alarms which are successfully executed are placed in a waiting queue, T acquisition cycles of the alarm indexes are waited, and whether the self-healing is successful or not is judged. If the collected index is abnormal, the state of the warning factor is set to inactive, the activation factor is set to unavailable, and the warning information is directly reported to a warning information collection platform.
In the embodiment of the application, the alarm information acquisition platform is used for acquiring and storing alarm information; the alarm processor is used for processing the alarm information meeting the self-healing condition in the alarm information; and the manual alarm processing platform is used for informing a worker to process the alarm which does not meet the self-healing condition.
In the embodiment of the present application, an alarm processing method is provided by taking a self-healing rule of a ceph osd down alarm type as an example in combination with the alarm processing system provided in fig. 4, where the method includes:
step S401, the system monitoring alarm framework 41 collects monitoring indexes of all the file system clusters 42;
here, the system monitoring alarm framework obtains the current cluster state through ceph-mgr.
In some embodiments, the method further comprises: the system monitoring alarm frame 41 determines alarm information according to the monitoring indexes and alarm rules of preset indexes; the system monitor alarm framework 41 sends the alarm information to the alarm processor 43.
Step S402, the alarm processor 43(alert manager) performs alarm matching according to the alarm information;
here, the alarm matching is that the alarm processor determines whether the alarm information can perform self-healing.
In the implementation process, when receiving the alarm information, the alarm processor 43 collects the reported alarm type field through the system monitoring alarm framework 41 to actively match whether the corresponding self-healing instruction can be found in the alarm information, and if the corresponding self-healing instruction is matched to the self-healing instruction set, it indicates that the alarm has the ability to perform self-healing, otherwise, the alarm is not.
Executing step S404 under the condition that the alarm meets the self-healing condition and can perform self-healing; executing step S405 if the alarm does not satisfy the self-healing condition;
step S403, the alarm processor 43 acquires the activation factor from the cloud management platform 44;
step S404, the alarm processor 43 calls the command execution platform 46 to process the alarm when the alarm satisfies the self-healing condition and the state of the activation factor is available;
here, the state of the activation factor is available, which means that the self-healing operation can be performed, and the self-healing instruction set is automatically executed.
In step S405, the alarm processor 43 reports the alarm self-healing failure to the alarm information collection platform 45 when the alarm does not satisfy the self-healing condition.
Based on the foregoing embodiments, the present application provides an alarm processing apparatus, where the apparatus includes modules and units included in the modules, and may be implemented by a processor in a computer device; of course, the implementation can also be realized through a specific logic circuit; in implementation, the processor may be a Central Processing Unit (CPU), a Microprocessor (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or the like.
Fig. 5 is a schematic structural diagram of an alarm processing apparatus according to an embodiment of the present application, and as shown in fig. 5, the apparatus 500 includes a first obtaining module 501, a first determining module 502, a matching module 503, a second determining module 504, and an executing module 505, where:
a first obtaining module 501, configured to obtain alarm information of each cluster in at least one file system cluster;
a first determining module 502, configured to determine an alarm type and a cluster identifier of each alarm information;
the matching module 503 is configured to match a corresponding self-healing instruction according to the alarm type of each piece of alarm information; the self-healing instruction comprises processing steps required to be executed for processing the alarm information;
a second determining module 504, configured to determine, according to the cluster identifier and the alarm type of each alarm information, a state of an activation factor corresponding to the alarm information;
and the execution module 505 is configured to execute the self-healing instruction according to the state of the activation factor, so as to implement processing on the alarm information.
In some embodiments, the execution module 505 includes: a generating unit and a first executing unit, wherein: a generating unit, configured to create a corresponding activation factor according to the cluster identifier and the alarm type when the state of the activation factor is that the activation factor does not exist, where the activation factor includes a self-healing factor state field; the self-healing factor state field is a field used for indicating whether a self-healing instruction is available or not in an activation factor; and the first execution unit is used for executing the self-healing instruction under the condition that the self-healing factor state field is available to obtain an execution result so as to realize the processing of the alarm information.
In some embodiments, the apparatus 500 further comprises: the device comprises a storage module, a time delay module and a third determination module, wherein: the storage module is used for placing the activation factor corresponding to the execution result into a waiting queue under the condition that the execution result is self-healing success; the time delay module is used for waiting for the self-healing instruction to execute N periods and then obtaining the execution result in the N periods; and the third determining module is used for determining that the alarm processing is successful under the condition that the execution result in each period is self-healing success.
In some embodiments, the apparatus 500 further comprises a setup module, wherein: the setting module is used for setting the state field of the self-healing factor into an unavailable state under the condition that the execution result is self-healing failure, and reporting the alarm information to a manual alarm processing platform; and the unavailable state is used for indicating that the alarm information of the same alarm type is reported to the manual alarm processing platform.
In some embodiments, the execution module 505 includes: a determination unit, a quiesce unit, and a second execution unit, wherein: the determining unit is used for determining whether the activation factor is in a life cycle according to an update time field of the activation factor when the state of the activation factor is that the activation factor exists and the self-healing factor state field is available, wherein the life cycle is the sum of the time for executing the self-healing instruction and the period for acquiring the alarm information; the silencing unit is used for silencing the alarm information under the condition that the activation factor is in a life cycle; and the second execution unit is used for executing the self-healing instruction under the condition that the activation factor is not in the life cycle to obtain an execution result so as to realize the processing of the alarm information.
In some embodiments, the apparatus 500 further comprises: second acquisition module, search module and evaluation module, wherein: the second acquisition module is used for acquiring a state topology tree of a corresponding cluster according to the cluster identifier in each alarm message; the searching module is used for searching a storage medium for alarming the abnormal node according to the serial number of the abnormal node in each alarming message and the state topology tree of the corresponding cluster; the evaluation module is used for evaluating the storage medium to obtain an evaluation result; the first execution unit or the second execution unit is further configured to execute the self-healing instruction according to the evaluation result to obtain an execution result, so as to implement processing on the alarm information.
In some embodiments, the first execution unit or the second execution unit is further configured to migrate data in the abnormal node to a normal node in the cluster if the evaluation result is that the storage medium is damaged; and removing the abnormal node from the cluster and shielding the alarm so as to realize the processing of the alarm information.
In some embodiments, the first execution unit or the second execution unit is further configured to, in a case that the evaluation result is that a clock of the storage medium needs to be corrected, correct the clock to implement processing of the alarm information; and restarting the abnormal node to realize the processing of the alarm information under the condition that the evaluation result is that the utilization rate of the storage medium exceeds a set upper limit value and the clock of the storage medium is normal.
In some embodiments, the apparatus 500 further comprises: a third obtaining module and a fourth determining module, wherein: the third acquisition module is used for acquiring the monitoring index of each file system cluster in at least one file system cluster; and the fourth determining module is used for determining the monitoring index which does not meet the preset alarm rule as the alarm information of each cluster.
In some embodiments, the apparatus 500 further comprises a fifth determining module, wherein: and a fifth determining module, configured to determine a state value of the self-healing factor state field according to a cluster maintenance state, a cluster change state, a timestamp for creating the activation factor, a last healing timestamp, and a self-healing time interval to which the cluster identifier belongs, when the activation factor is created.
The above description of the apparatus embodiments, similar to the above description of the method embodiments, has similar beneficial effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.
It should be noted that, in the embodiment of the present application, if the alarm processing method is implemented in the form of a software functional module and is sold or used as a standalone product, the alarm processing method may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the related art may be embodied in the form of a software product stored in a storage medium, and including instructions for causing a computer device (which may be a personal computer, a server, etc.) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.
Correspondingly, the embodiment of the present application provides a computer device, which includes a memory and a processor, where the memory stores a computer program that can be executed on the processor, and the processor implements the steps in the above method when executing the program.
Correspondingly, the embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program realizes the steps of the above method when being executed by a processor.
Here, it should be noted that: the above description of the storage medium and device embodiments is similar to the description of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.
It should be noted that fig. 6 is a schematic hardware entity diagram of a computer device in an embodiment of the present application, and as shown in fig. 6, the hardware entity of the computer device 600 includes: a processor 601, a communication interface 602, and a memory 603, wherein
The processor 601 generally controls the overall operation of the computer device 600.
The communication interface 602 may enable the computer device to communicate with other terminals or servers via a network.
The Memory 603 is configured to store instructions and applications executable by the processor 601, and may also buffer data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or already processed by the processor 601 and modules in the computer apparatus 600, and may be implemented by a FLASH Memory (FLASH) or a Random Access Memory (RAM).
It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.
Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the related art may be embodied in the form of a software product stored in a storage medium, and including several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.
The above description is only for the embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (12)

1. An alarm processing method, characterized in that the method comprises:
acquiring alarm information of each cluster in at least one file system cluster;
determining the alarm type and cluster identification of each alarm message;
matching a corresponding self-healing instruction according to the alarm type of each alarm message;
determining the state of an activation factor corresponding to each alarm information according to the cluster identifier and the alarm type of each alarm information;
and executing the self-healing instruction according to the state of the activation factor so as to realize the processing of the alarm information.
2. The method according to claim 1, wherein the executing the self-healing instruction according to the status of the activation factor comprises:
under the condition that the state of the activation factor is that the activation factor does not exist, creating a corresponding activation factor according to the cluster identifier and the alarm type, wherein the activation factor comprises a self-healing factor state field; the self-healing factor state field is a field used for indicating whether a self-healing instruction is available or not in an activation factor;
and under the condition that the self-healing factor state field is available, executing the self-healing instruction to obtain an execution result so as to realize the processing of the alarm information.
3. The method of claim 2, further comprising:
under the condition that the execution result is self-healing success, putting an activation factor corresponding to the execution result into a waiting queue;
after the self-healing instruction is executed for N periods, obtaining an execution result in the N periods;
and determining that the alarm processing is successful under the condition that the execution result in each period is self-healing success.
4. The method of claim 2, further comprising:
setting the state field of the self-healing factor as an unavailable state under the condition that the execution result is self-healing failure, and reporting the alarm information to a manual alarm processing platform; and the unavailable state is used for indicating that the alarm information of the same alarm type is reported to the manual alarm processing platform.
5. The method according to claim 1, wherein the activation factor includes a self-healing factor status field and an update time field; the self-healing factor state field is a field used for indicating whether a self-healing instruction is available or not in the activation factor;
the executing the self-healing instruction according to the state of the activation factor includes:
determining whether the activation factor is in a life cycle according to an update time field of the activation factor under the condition that the state of the activation factor is that the activation factor exists and the self-healing factor state field is available, wherein the life cycle is the sum of the time for executing the self-healing instruction and the period for acquiring the alarm information;
silencing the alarm information if the activation factor is within a lifecycle;
and under the condition that the activation factor is not in the life cycle, executing the self-healing instruction to obtain an execution result so as to realize the processing of the alarm information.
6. The method according to claim 5, wherein the executing the self-healing instruction to obtain the execution result to implement the processing of the alarm information in the case that the activation factor is not in the life cycle comprises:
acquiring a state topology tree of a corresponding cluster according to the cluster identifier in each alarm message;
searching a storage medium for alarming the abnormal node according to the serial number of the abnormal node in each alarm message and the state topology tree of the corresponding cluster;
evaluating the storage medium to obtain an evaluation result;
and under the condition that the activation factor is not in the life cycle, executing the self-healing instruction according to the evaluation result to obtain an execution result so as to realize the processing of the alarm information.
7. The method according to claim 6, wherein the executing the self-healing instruction according to the evaluation result to obtain an execution result to implement the processing of the alarm information comprises:
under the condition that the storage medium is damaged according to the evaluation result, migrating the data in the abnormal node to a normal node in the cluster;
and removing the abnormal node from the cluster and shielding the alarm so as to realize the processing of the alarm information.
8. The method according to claim 6, wherein the executing the self-healing instruction according to the evaluation result to obtain an execution result to implement the processing of the alarm information further comprises:
under the condition that the evaluation result is that the clock of the storage medium needs to be corrected, correcting the clock to realize the processing of the alarm information;
and restarting the abnormal node to realize the processing of the alarm information under the condition that the evaluation result is that the utilization rate of the storage medium exceeds a set upper limit value and the clock of the storage medium is normal.
9. The method of any of claims 2 to 4, further comprising:
and under the condition of creating the activation factor, determining the state value of the self-healing factor state field according to the cluster maintenance state, the cluster change state, the activation factor creating timestamp, the last healing timestamp and the self-healing time interval to which the cluster identifier belongs.
10. An alert processing apparatus, characterized in that the apparatus comprises:
the first acquisition module is used for acquiring the alarm information of each cluster in at least one file system cluster;
the first determining module is used for determining the alarm type and the cluster identifier of each alarm message;
the matching module is used for matching the corresponding self-healing instruction according to the alarm type of each alarm message;
the second determining module is used for determining the state of the activation factor corresponding to the alarm information according to the cluster identifier and the alarm type of each alarm information;
and the execution module is used for executing the self-healing instruction according to the state of the activation factor so as to realize the processing of the alarm information.
11. A computer device comprising a memory and a processor, the memory storing a computer program operable on the processor, wherein the processor implements the steps of the method of any one of claims 1 to 9 when executing the program.
12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 9.
CN202011430102.9A 2020-12-07 2020-12-07 Alarm processing method and device, equipment and storage medium Pending CN112650642A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011430102.9A CN112650642A (en) 2020-12-07 2020-12-07 Alarm processing method and device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011430102.9A CN112650642A (en) 2020-12-07 2020-12-07 Alarm processing method and device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112650642A true CN112650642A (en) 2021-04-13

Family

ID=75350713

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011430102.9A Pending CN112650642A (en) 2020-12-07 2020-12-07 Alarm processing method and device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112650642A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113434327A (en) * 2021-07-13 2021-09-24 上海浦东发展银行股份有限公司 Fault processing system, method, equipment and storage medium
CN115190008A (en) * 2022-07-08 2022-10-14 中国建设银行股份有限公司 Fault processing method, fault processing device, electronic device and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105337765A (en) * 2015-10-10 2016-02-17 上海新炬网络信息技术有限公司 Distributed hadoop cluster fault automatic diagnosis and restoration system
WO2017107656A1 (en) * 2015-12-25 2017-06-29 中兴通讯股份有限公司 Virtualized network element failure self-healing method and device
CN107832200A (en) * 2017-10-24 2018-03-23 平安科技(深圳)有限公司 Alert processing method, device, computer equipment and storage medium
CN110232010A (en) * 2019-06-18 2019-09-13 深圳前海微众银行股份有限公司 A kind of alarm method, alarm server and monitoring server
CN110660198A (en) * 2019-09-29 2020-01-07 广东美的制冷设备有限公司 Alarm information processing method and device and household appliance
CN111049705A (en) * 2019-12-23 2020-04-21 深圳前海微众银行股份有限公司 Method and device for monitoring distributed storage system
CN111343009A (en) * 2020-02-14 2020-06-26 腾讯科技(深圳)有限公司 Service alarm notification method and device, storage medium and electronic equipment
CN111522859A (en) * 2020-03-23 2020-08-11 深圳奇迹智慧网络有限公司 Alarm analysis method and device, computer equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105337765A (en) * 2015-10-10 2016-02-17 上海新炬网络信息技术有限公司 Distributed hadoop cluster fault automatic diagnosis and restoration system
WO2017107656A1 (en) * 2015-12-25 2017-06-29 中兴通讯股份有限公司 Virtualized network element failure self-healing method and device
CN107832200A (en) * 2017-10-24 2018-03-23 平安科技(深圳)有限公司 Alert processing method, device, computer equipment and storage medium
CN110232010A (en) * 2019-06-18 2019-09-13 深圳前海微众银行股份有限公司 A kind of alarm method, alarm server and monitoring server
CN110660198A (en) * 2019-09-29 2020-01-07 广东美的制冷设备有限公司 Alarm information processing method and device and household appliance
CN111049705A (en) * 2019-12-23 2020-04-21 深圳前海微众银行股份有限公司 Method and device for monitoring distributed storage system
CN111343009A (en) * 2020-02-14 2020-06-26 腾讯科技(深圳)有限公司 Service alarm notification method and device, storage medium and electronic equipment
CN111522859A (en) * 2020-03-23 2020-08-11 深圳奇迹智慧网络有限公司 Alarm analysis method and device, computer equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LU FANG ET AL.: "Self-Healing Scheme in Alert Operating State for Smart Distribution Systems", 《JOURNAL OF ELECTRICAL ENGINEERING & TECHNOLOGY》, vol. 14, no. 3, 31 March 2019 (2019-03-31) *
吴凯兴;: "集群通信告警监控系统设计与实现", 信息通信, no. 04, 15 April 2020 (2020-04-15) *
姚浩: "基于大数据的告警信息处理和故障设备定位技术研究", 《电网与清洁能源》, vol. 30, no. 12, 31 December 2014 (2014-12-31) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113434327A (en) * 2021-07-13 2021-09-24 上海浦东发展银行股份有限公司 Fault processing system, method, equipment and storage medium
CN115190008A (en) * 2022-07-08 2022-10-14 中国建设银行股份有限公司 Fault processing method, fault processing device, electronic device and storage medium
CN115190008B (en) * 2022-07-08 2024-05-03 中国建设银行股份有限公司 Fault processing method, fault processing device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109726072B (en) WebLogic server monitoring and alarming method, device and system and computer storage medium
CN113238913B (en) Intelligent pushing method, device, equipment and storage medium for server faults
CN109039740B (en) Method and equipment for processing operation and maintenance monitoring alarm
CN109308252B (en) Fault positioning processing method and device
CN110661659A (en) Alarm method, device and system and electronic equipment
US9176798B2 (en) Computer-readable recording medium, failure prediction device and applicability determination method
CN112311617A (en) Configured data monitoring and alarming method and system
CN112650642A (en) Alarm processing method and device, equipment and storage medium
CN113434327B (en) Fault processing system, method, equipment and storage medium
CN109218102A (en) A kind of alarm monitoring method and system
CN110445650B (en) Detection alarm method, equipment and server
CN110109741B (en) Method and device for managing circular tasks, electronic equipment and storage medium
CN110417586A (en) Service monitoring method, service node, server and computer readable storage medium
CN112765161B (en) Alarm rule matching method and device, electronic equipment and storage medium
CN113656252B (en) Fault positioning method, device, electronic equipment and storage medium
CN112860504A (en) Monitoring method and device, computer storage medium and electronic equipment
CN113342608A (en) Method and device for monitoring streaming computing engine task
CN115102838B (en) Emergency processing method and device for server downtime risk and electronic equipment
CN115037653B (en) Service flow monitoring method, device, electronic equipment and storage medium
CN108255704B (en) Abnormal response method of script calling event and terminal thereof
CN115712521A (en) Cluster node fault processing method, system and medium
CN115580522A (en) Method and device for monitoring running state of container cloud platform
WO2014040470A1 (en) Alarm message processing method and device
CN113391983A (en) Alarm information generation method, device, server and storage medium
CN115705259A (en) Fault processing method, related device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination