WO2020124721A1 - 一种宕机通知方法及装置 - Google Patents

一种宕机通知方法及装置 Download PDF

Info

Publication number
WO2020124721A1
WO2020124721A1 PCT/CN2019/071879 CN2019071879W WO2020124721A1 WO 2020124721 A1 WO2020124721 A1 WO 2020124721A1 CN 2019071879 W CN2019071879 W CN 2019071879W WO 2020124721 A1 WO2020124721 A1 WO 2020124721A1
Authority
WO
WIPO (PCT)
Prior art keywords
notification
server
target
computer room
preset
Prior art date
Application number
PCT/CN2019/071879
Other languages
English (en)
French (fr)
Inventor
孙云云
Original Assignee
网宿科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 网宿科技股份有限公司 filed Critical 网宿科技股份有限公司
Priority to US17/042,908 priority Critical patent/US20210021460A1/en
Priority to EP19897731.6A priority patent/EP3896904A4/en
Publication of WO2020124721A1 publication Critical patent/WO2020124721A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0817Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/02Details
    • H04L12/12Arrangements for remote connection or disconnection of substations or of equipment thereof
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/02Details
    • H04L12/16Arrangements for providing special services to substations
    • H04L12/18Arrangements for providing special services to substations for broadcast or conference, e.g. multicast
    • H04L12/1895Arrangements for providing special services to substations for broadcast or conference, e.g. multicast for short real-time information, e.g. alarms, notifications, alerts, updates
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0604Management of faults, events, alarms or notifications using filtering, e.g. reduction of information by using priority, element types, position or time
    • H04L41/0609Management of faults, events, alarms or notifications using filtering, e.g. reduction of information by using priority, element types, position or time based on severity or priority
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0659Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities
    • H04L41/0661Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities by reconfiguring faulty entities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0686Additional information in the notification, e.g. enhancement of specific meta-data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/226Delivery according to priorities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/214Monitoring or handling of messages using selective forwarding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/56Unified messaging, e.g. interactions between e-mail, instant messaging or converged IP messaging [CPM]

Definitions

  • the present invention relates to the field of computer technology, and particularly to a method and device for notifying downtime.
  • CDN Content Delivery Network
  • the management server of the above automatic notification system can monitor the running status of the servers in each computer room. When it is detected that a server in a computer room is down, the above management server can obtain the pre-recorded downtime based on the location information of the downtime server, such as the IP address of the downtime server and the room number of the room where the downtime server is located. All notification methods of the server room where the server is located, such as mail method, telephone method, instant messaging software method, etc. Then, the management server may determine the notification method with the highest priority set in advance among all the notification methods.
  • the management server can pass the restart message recorded with the location information of the downtime server to the technician in the equipment room where the downtime server is located according to the notification method with the highest priority, so that the technician in the equipment room can base on the location information of the downtime server , Find the down server, and restart the down server.
  • the management server will send a restart message again according to the notification method with a lower priority until It was detected that the down server restarted successfully.
  • the notification method with the highest preset priority mentioned above may have a poor notification effect, so that the technicians in the equipment room cannot find the restart message in time, and the restart message needs to be repeatedly sent through other notification methods, which not only takes up more system resources, but also causes The downtime server cannot be restarted in time, so the service quality of the above automatic notification system is poor.
  • embodiments of the present invention provide a method and device for notifying downtime.
  • the technical solution is as follows:
  • a method for notifying downtime includes:
  • the adjusting the priority of all notification methods corresponding to the target computer room based on the statistical parameters of previous notifications of the target computer room includes:
  • the priority of all the notification methods is adjusted.
  • the adjusting the priority of all the notification methods according to the weight values of all the notification methods includes:
  • the notification method corresponding to the highest restart success rate in at least two of the notification methods is adjusted to the priority The highest said notification method
  • the notification corresponding to the shortest notification time in at least two of the notification methods is adjusted to the notification method with the highest priority.
  • the method further includes:
  • the priority of all the notification methods corresponding to the target computer room is adjusted.
  • the method further includes:
  • the target server is marked as a failed server, a feedback message recording the location information of the failed server is generated, and the feedback message is sent To the target computer room.
  • the method further includes:
  • the method further includes:
  • Every preset period obtain the downtime change value of the server that is down during the current preset period, and determine whether the downtime change value is greater than the preset change value, where the downtime change value is included in the current The value of the server that was down during the preset period and the value of the server that was successfully restarted;
  • a device for notifying downtime includes:
  • the data recording module is used to determine the target computer room where the target server is located when it is monitored that the target server is down;
  • the data processing module is used to adjust the priority of all notification modes corresponding to the target computer room based on the statistical parameters of previous notifications of the target computer room;
  • the automatic notification module is configured to send a restart message to the target server to the target computer room according to the notification method with the highest priority.
  • data processing module is specifically used for:
  • the priority of all the notification methods is adjusted.
  • data processing module is also specifically used for:
  • the notification method corresponding to the highest restart success rate in at least two of the notification methods is adjusted to the priority The highest said notification method
  • the notification corresponding to the shortest notification time in at least two of the notification methods is adjusted to the notification method with the highest priority.
  • data processing module is also used to:
  • the automatic notification module is also used to:
  • data processing module is also used to:
  • the target server is marked as a failed server, and a feedback message in which the positioning information of the failed server is recorded is generated;
  • the automatic notification module is also used to:
  • data processing module is also used to:
  • the data recording module is also used to:
  • the data processing module is also used to:
  • the downtime change value includes a value of a server that has been down during the current preset period and a value of a server that has successfully restarted;
  • a management server in a third aspect, includes a processor and a memory.
  • the memory stores at least one instruction, at least one program, code set, or instruction set.
  • the at least one instruction, the at least one A piece of program, the code set or the instruction set is loaded and executed by the processor to implement the downtime notification method as described in the first aspect.
  • a computer-readable storage medium stores at least one instruction, at least one program, code set, or instruction set, and the at least one instruction, the at least one program ,
  • the code set or instruction set is loaded and executed by the processor to implement the downtime notification method as described in the first aspect.
  • the target computer room where the target server is located is determined; based on the statistical parameters of previous notifications of the target computer room, the priority of all notification methods corresponding to the target computer room is adjusted, wherein the statistical parameters include at least Notification time and restart success rate; according to the highest priority notification method, send a restart message to the target server to the target computer room.
  • the priority of all notification methods corresponding to the computer room can be flexibly adjusted according to the statistical parameters of previous notifications in the computer room where the server is located, and then the server restart message can be sent to the computer room according to the notification method with the highest priority after adjustment, and then Each time through the notification method that is most suitable for the current situation, that is, the notification method that has a shorter notification time and a higher restart success rate, the restart message to the server room is sent to the computer room, so that the notification can be successful at a large extent. There is no need to repeatedly send the restart message through other notification methods, which not only saves system resources, but also enables the downtime server to restart in time, effectively improving the service quality of the automatic notification system.
  • FIG. 1 is a flowchart of a method for notifying downtime according to an embodiment of the disclosure
  • FIG. 2 is a schematic structural diagram of a device for notifying downtime according to an embodiment of the disclosure
  • FIG. 3 is a schematic structural diagram of an apparatus for notifying downtime according to an embodiment of the disclosure
  • FIG. 4 is a schematic structural diagram of a management server according to an embodiment of the present invention.
  • An embodiment of the present invention provides a method for notifying downtime.
  • the execution subject of the method may be a management server of any manufacturer, and the management server may be any server or a server cluster composed of multiple servers.
  • the management server can monitor the running state of the server in the computer room, and after monitoring that a server in a computer room is down, send a restart to restart the down server to the computer room through the notification method with the highest priority in the computer room news.
  • the above-mentioned management server may include a processor, a memory, and a transceiver.
  • the processor may be used to process the downtime notification method in the following process, and the memory may be used to store data required and data generated during the following process, and send and receive
  • the device can be used to receive and send the relevant data in the following process.
  • the monitoring function of the foregoing management server may also be implemented by other servers.
  • a management server with a monitoring function is used for description, and other situations are similar, and will not be described one by one.
  • Step 101 When it is detected that the target server is down, determine the target computer room where the target server is located.
  • the management personnel of the manufacturer can pre-statistic the location information of each computer room server and the notification method of each computer room.
  • the location information can be the computer room number, cabinet number, IP address, etc.
  • the notification method can be mail, telephone, instant messaging Software method, etc.
  • the management personnel of the manufacturer may store the statistical information in a special storage device or the management server, and may update or modify the information. In this way, when the management server monitors that a server (which may be called the target server) is down, it can determine the machine room where the target server is located (based on the storage device or the location information of each machine room server acquired locally). ).
  • Step 102 Based on the statistical parameters of previous notifications of the target computer room, adjust the priority of all notification methods corresponding to the target computer room.
  • the management server may pre-statistic the statistical parameters of previous notifications in the target computer room, where the statistical parameters may include at least the notification time and the restart success rate, and the notification time may be the previous notification of sending a restart message to the target server according to a notification method to restart The average time to success.
  • the restart success rate may be the number of notifications for sending a restart message through a notification method to make the corresponding downtime server restart successfully, accounting for the proportion of the total number of notifications in this notification method.
  • the management server can adjust the priority of all notification methods corresponding to the target computer room in real time based on the statistical parameters of previous notifications of the target computer room to obtain the current priority of each notification method.
  • the processing in step 102 may be as follows: based on the preset weight ratios of the notification time and the restart success rate, calculate the weight values of all notification methods corresponding to the target computer room; adjust all notification methods according to the weight values of all notification methods Priority.
  • the preset weight ratio of the notification time and the restart success rate may be set, for example, the preset weight ratio of the notification time is set to 40%, and the preset weight ratio of the restart success rate is 60%.
  • the management server can calculate the weight value of each notification method corresponding to the target computer room based on the preset weight ratios of the notification time and the restart success rate, and then can adjust the priority of each notification method according to the weight value of each notification method.
  • the calculation formula of the notification time of each notification mode corresponding to the target computer room may be:
  • the calculation formula of the restart success rate of each notification mode corresponding to the above target computer room may be:
  • X 1 represents the restart success rate of notification method 1
  • n 1 represents the number of times to send a restart message according to notification method 1.
  • the weight value of each notification method can be calculated.
  • the calculation formula of the weight value can be:
  • Y 1 represents the weight value of notification method 1.
  • the management server may adjust the priority of each notification method according to the size of each weight value.
  • the process of adjusting the priority of all notification methods may be as follows: When the weight values of all notification methods are different, the notification method corresponding to the smallest weight value is adjusted to the priority The highest notification method; or, when the weight value of at least two notification methods is the same and the minimum, among the at least two notification methods, the notification method corresponding to the highest restart success rate is adjusted to the notification method with the highest priority; or, When the weight values of at least two notification methods are the same and the minimum, and the restart success rate is the same, among the at least two notification methods, the notification method corresponding to the shortest notification time is adjusted to the notification method with the highest priority.
  • the management server may adjust the priority of all notification methods corresponding to the target computer room according to the weight values of the respective notification methods and the values of different statistical parameters. Take the three notification methods corresponding to the target computer room as an example.
  • the notification time, restart success rate, and weight value corresponding to the three notification methods can be T 1 , X 1 , Y 1 , T 2 , X 2 , Y 2 , T 3 , X 3 , Y 3 . Assuming that Y 1 ⁇ Y 2 ⁇ Y 3 , the management server can adjust the notification method corresponding to Y 1 to the notification method with the highest priority.
  • Step 103 Send a restart message to the target server to the target computer room according to the notification method with the highest priority.
  • the management server may follow the real-time adjustment of the notification method with the highest priority to the external communication equipment of the target computer room, such as the telephone and computer used by the technicians of the computer room , Smart phones, etc., notify the target computer room of the restart message carrying the positioning information of the target server.
  • the technician in the equipment room can find the target server based on the positioning information of the target server and restart the target server.
  • the management server can send the restart message to the target server through the notification method most suitable for the current situation, that is, the notification method with the highest priority adjusted based on the statistical parameters of the previous notifications of the target computer room, which can largely One notification is successful, and there is no need to repeatedly send restart messages through other notification methods, which not only saves system resources, but also enables the downtime server to restart in time, effectively improving the service quality of the automatic notification system.
  • the following processing may also be performed: acquiring and displaying the positioning information of the target server and the restart progress of the target server.
  • the management server may display the restart progress of the target server.
  • the restart progress displayed by the management server may be "downtime server IP : 1.1.1.1, current progress: a restart message has been sent, and a return receipt message will be returned".
  • the following processing may be performed: determine whether the target server is the first server in the target computer room to be down; if so, then Obtain the default notification method of the target computer room, and send a restart message to the target server to the target computer room according to the default notification method; if not, adjust the priority of all notification methods corresponding to the target computer room based on the statistical parameters of previous notifications of the target computer room.
  • the management server may pre-mark one of the multiple notification methods as the default notification method based on the setting requirements of the administrator, and mark the remaining notification methods among the multiple notification methods as candidate notification methods. In this way, when the management server monitors that the target server is down, it can determine whether the target server is the first server in the target computer room to be down. If the target server is the first server in the target computer room to go down, the management server can send a restart message to the target computer room according to the default notification method of the target computer room. If the target server is not the first server in the target computer room to go down, the management server can adjust the priority of all notification methods corresponding to the target computer room according to the statistical parameters of previous notifications in the target computer room.
  • the following processing can also be performed: if the target server's number of downtimes exceeds the preset number of times within the preset duration, the target server is marked For the faulty server, generate a feedback message that records the location information of the faulty server, and send the feedback message to the target computer room.
  • the management server may set the highest frequency (which may be referred to as a preset number of times) that the server allows for downtime within a preset time period, for example, the maximum number of downtimes within 15 days is 3 or 5 times. In this way, when the target server is monitored for downtime, the management server can obtain the previous downtime information of the target server, and determine whether the number of downtimes of the target server within the preset time period exceeds the preset number of times.
  • the management server can mark the target server as a failed server and generate a feedback message that records the location information of the failed server, and then can send the feedback message to the target
  • the computer room for example, can notify the corresponding technical personnel to carry out key investigations on the target server by means of mail or instant messaging software.
  • the following processing can also be performed: if the number of down servers exceeds the preset value, the proportion of each preset device attribute is calculated among all the down servers, and judgment is made Whether there is a target device attribute whose proportion is greater than the rated rate; if it is, generate a feedback message recording the target device attribute and its proportion, and send the feedback message to the management personnel.
  • the management server can determine the total number of servers that can be allowed to go down at the same time (it can be called a preset value), and the downtime. Set the maximum ratio of each device attribute of the server (which can be called the rated ratio). When the management server monitors that the number of servers that are down at a certain time exceeds the preset value, the management server can obtain the device attributes of all down servers (which can be called preset device attributes), where the preset device attributes can include all server Hardware attributes and software attributes.
  • the management server can sequentially calculate the proportions of the above preset device attributes to be 15%, 20%, 50%, and 15%, respectively. Assuming that the rated ratios of the above preset device attributes are 30%, 25%, 40%, and 20%, respectively, the management server can determine the preset device attributes (which can be called target device attributes) corresponding to the ratios greater than the rated ratio ) Is software 1.
  • the management server can generate a notification message including the above target device attributes and their corresponding proportions, such as "Hello, XX year XX month XX hour XX minutes, the number of downtime servers is 100, and software 1 accounts for 50%.
  • the rated ratio has been exceeded, please deal with it in time, thank you!, and send the above notification message to the manufacturer's management personnel, such as by mail, instant messaging software, etc.
  • the management personnel of the manufacturer can find the corresponding software 1 based on the content of the feedback message and perform corresponding processing.
  • the rated ratios of the above preset device attributes can be set and adjusted according to specific downtime conditions, which is not limited in this embodiment.
  • the above process of calculating the proportion of each preset device attribute in all the servers that are down and determining whether there is a target device attribute whose proportion is greater than the rated proportion can be as follows: Ratio, determine whether there is a rated ratio that is greater than the corresponding preset device attribute; if it is, determine the preset device attribute as the target device attribute; otherwise, calculate the next preset device attribute ratio.
  • the management server may sequentially calculate the proportion of each preset device attribute and determine whether the preset device attribute is the target device attribute. Specifically, still taking the above device attributes as CPU model 1, CPU model 2, software 1, and software 2, for example, the management server may first calculate the proportion of CPU model 1, and determine whether the proportion is greater than the rated proportion. The proportion is greater than the rated proportion corresponding to the preset device attribute, and the management server may determine the preset device attribute corresponding to the proportion as the target device attribute, and store the target device attribute and its proportion.
  • the management server may determine that the preset device attribute is not the target device attribute, skip the current preset device attribute and calculate the proportion of the next preset device attribute CPU model 2 And repeat the above process until it is determined whether all the above-mentioned preset device attributes are the target device attributes.
  • this embodiment also provides a notification method, and the specific processing may be as follows: every preset period, obtain the downtime change value of the server that is down during the current preset period, and determine whether the downtime change value is greater than The preset change value, where the downtime change value includes the value of the server that has been down during the current preset period and the value of the server that has been restarted successfully; if it is, then calculates each of all the servers that are down during the current preset period The proportion of preset device attributes, and generate a feedback message that records the current preset period, each preset device attribute, and the proportion of each preset device attribute, and send the feedback message to the management personnel; if not, obtain the above Proportion of each preset device attribute of all down servers in a preset period, and generate a feedback message that records the current preset period, each preset device attribute, and each preset device attribute ratio, and feedback The message is sent to the management.
  • the management server in order to avoid a large-scale server downtime, can obtain the number of newly added downtime servers in the current preset cycle and the sum of the number of servers that have been successfully restarted (can be It is called the downtime change value), and an upper limit is set for the downtime change value (it can be called a preset change value). Specifically, taking a preset period of 24 hours and a preset change value of 50 as an example, considering that fewer users use network services every morning, the management server can be set at 24:00 every day to obtain downtime in the past 24 hours Machine change value.
  • the management server can calculate that the change in downtime in the past 24 hours is 70, greater than the preset change of 50. Then, the management server can calculate the proportion of the device attributes of the server that has been down within the past 24 hours, determine the target device attributes, and generate the corresponding period including the above-mentioned preset period, each preset device attribute, and each preset device attribute The feedback message of the ratio is sent to the management personnel of the manufacturer.
  • the management server can obtain the downtime change value in the past 24 hours every fixed time period, such as every 3 hours or 4 hours. Specifically, the management server may calculate the proportion of the preset device attributes of the server that is down during each time period, determine the target device attributes, and generate the preset period, each preset device attribute, and each preset device attribute
  • the corresponding proportion of the feedback message is sent to the management personnel of the manufacturer.
  • the feedback message can be sent by mail or instant messaging software.
  • the content of the feedback message can be in the form of text or chart.
  • the management personnel can find the corresponding server according to the content of the feedback message, and carry out maintenance and shelf constraints to ensure the quality of network services provided by the manufacturer. In this way, the frequency of server downtime can be fundamentally reduced, and the service quality of each server can be improved.
  • the target computer room where the target server is located is determined; based on the statistical parameters of previous notifications of the target computer room, the priority of all notification methods corresponding to the target computer room is adjusted, wherein the statistical parameters include at least Notification time and restart success rate; according to the notification method with the highest priority, send a restart message to the target server to the target computer room.
  • the priority of all notification methods corresponding to the computer room can be flexibly adjusted according to the statistical parameters of previous notifications in the computer room where the server is located, and then the server restart message can be sent to the computer room according to the notification method with the highest priority after adjustment, and then Each time through the notification method that is most suitable for the current situation, that is, the notification method that has a shorter notification time and a higher restart success rate, the restart message to the server room is sent to the computer room, so that the notification can be successful at a large extent. There is no need to repeatedly send the restart message through other notification methods, which not only saves system resources, but also enables the downtime server to restart in time, effectively improving the service quality of the automatic notification system.
  • an embodiment of the present invention also provides a device for notifying downtime. As shown in FIG. 2, the device includes:
  • the data recording module 201 is used to determine the target computer room where the target server is located when it is monitored that the target server is down;
  • the data processing module 202 is used to adjust the priority of all notification modes corresponding to the target computer room based on the statistical parameters of previous notifications of the target computer room;
  • the automatic notification module 203 is configured to send a restart message to the target server to the target computer room according to the notification method with the highest priority.
  • data processing module 202 is specifically used to:
  • the priority of all the notification methods is adjusted.
  • data processing module 202 is also specifically used to:
  • the notification method corresponding to the highest restart success rate in at least two of the notification methods is adjusted to the priority The highest said notification method
  • the notification corresponding to the shortest notification time in at least two of the notification methods is adjusted to the notification method with the highest priority.
  • data processing module 202 is also used to:
  • the automatic notification module 203 is also used to:
  • data processing module 202 is also used to:
  • the target server is marked as a failed server, and a feedback message in which the positioning information of the failed server is recorded is generated;
  • the automatic notification module 203 is also used to:
  • data processing module 202 is also used to:
  • the data recording module 201 is also used to:
  • the data processing module 202 is also used to:
  • the downtime change value includes a value of a server that has been down during the current preset period and a value of a server that has successfully restarted;
  • the device further includes a data display module 204 for:
  • the management server 400 may have a relatively large difference due to different configurations or performances, and may include one or more central processors 422 (for example, one or more processors) and a memory 432, and one or more storage application programs 442 or The storage medium 430 of the data 444 (for example, one or one mass storage device).
  • the memory 432 and the storage medium 430 may be short-term storage or persistent storage.
  • the program stored in the storage medium 430 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the scheduling device.
  • the central processor 422 may be configured to communicate with the storage medium 430 and execute a series of instruction operations in the storage medium 430 on the management server 400.
  • the management server 400 may also include one or more power supplies 426, one or more wired or wireless network interfaces 450, one or more input and output interfaces 458, and/or one or more operating systems 431, such as Windows ServerTM, Mac OS XTM, Unix TM, Linux TM, FreeBSD TM, etc.
  • the management server 400 may include a memory, and one or more programs, wherein the one or more programs are stored in the memory, and are configured to be executed by one or more processors.
  • the one or more programs include for performing Instructions for the above downtime notification.
  • the program may be stored in a computer-readable storage medium.
  • the mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Debugging And Monitoring (AREA)
  • Computer And Data Communications (AREA)

Abstract

本发明公开了一种宕机通知方法,属于计算机技术领域,所述方法包括:当监控到目标服务器宕机时,确定所述目标服务器所在的目标机房;基于所述目标机房历次通知的统计参数,调整所述目标机房对应的所有通知方式的优先级,其中,所述统计参数至少包括通知时间和重启成功率;按照所述优先级最高的通知方式,向所述目标机房发送对所述目标服务器的重启消息。采用本发明可以节约系统资源,有效提高自动通知系统的服务质量。

Description

一种宕机通知方法及装置 技术领域
本发明涉及计算机技术领域,特别涉及一种宕机通知方法及装置。
背景技术
随着互联网业务的迅猛发展,越来越多的厂商选择在全国各地部署机房,如CDN(Content Delivery Network,内容分发网络)服务商、云计算厂商等。这些厂商通常采用自动通知系统,在各机房的服务器宕机后,通知机房的技术人员对宕机服务器进行重启。
上述自动通知系统的管理服务器,可以监控各机房的服务器的运行状态。当监控到某机房的某服务器宕机时,上述管理服务器可以基于宕机服务器的定位信息,如宕机服务器的IP地址、宕机服务器所在机房的机房编号等信息,获取预先记录的该宕机服务器所在机房的所有通知方式,如邮件方式、电话方式、即时通讯软件方式等。然后,上述管理服务器可以在上述所有通知方式中,确定出预先设定的优先级最高的通知方式。之后,管理服务器可以按照该优先级最高的通知方式,将记录有宕机服务器定位信息的重启消息传递给宕机服务器所在机房的技术人员,以使机房的技术人员可以基于宕机服务器的定位信息,找到宕机服务器,并对宕机服务器进行重启。另外,在按照优先级最高的通知方式发出重启消息后,如果监控到在预设时长内宕机服务器仍未重启成功,则管理服务器将会按照优先级较低的通知方式再次发出重启消息,直至监控到宕机服务器重启成功。
在实现本发明的过程中,发明人发现现有技术至少存在以下问题:
上述预先设定的优先级最高的通知方式可能通知效果较差,导致机房的技术人员不能及时发现到重启消息,需要通过其它通知方式重复发送重启消息,不仅占用了较多的系统资源,还导致宕机服务器不能及时重启,故而上述自动通知系统的服务质量较差。
发明内容
为了解决现有技术的问题,本发明实施例提供了一种宕机通知方法及装置,所述技术方案如下:
第一方面,提供了一种宕机通知方法,所述方法包括:
当监控到目标服务器宕机时,确定所述目标服务器所在的目标机房;
基于所述目标机房历次通知的统计参数,调整所述目标机房对应的所有通知方式的优先级,其中,所述统计参数至少包括通知时间和重启成功率;
按照所述优先级最高的通知方式,向所述目标机房发送对所述目标服务器的重启消息。
进一步的,所述基于所述目标机房历次通知的统计参数,调整所述目标机房对应的所有通知方式的优先级,包括:
基于所述通知时间和所述重启成功率各自的预设权重比值,计算所述目标机房对应的所有所述通知方式的权重值;
根据所有所述通知方式的所述权重值,调整所有所述通知方式的优先级。
进一步的,所述根据所有所述通知方式的所述权重值,调整所有所述通知方式的优先级,包括:
当所有所述通知方式的所述权重值均不相同时,将最小的所述权重值对应的所述通知方式调整为所述优先级最高的所述通知方式;
或者,当至少两个所述通知方式的所述权重值相同且最小时,在至少两个所述通知方式中将最高的所述重启成功率对应的所述通知方式,调整为所述优先级最高的所述通知方式;
或者,当至少两个所述通知方式的所述权重值相同且最小,且所述重启成功率均相同时,在至少两个所述通知方式中将最短的所述通知时间对应的所述通知方式,调整为所述优先级最高的所述通知方式。
进一步的,所述当监控到目标服务器宕机时,确定所述目标服务器所在的目标机房之后,还包括:
判断所述目标服务器是否为所述目标机房中第一个发生宕机的服务器;
如果是,则获取所述目标机房的默认通知方式,按照所述默认通知方式向所述目标机房发送对所述目标服务器的重启消息;
如果否,则基于所述目标机房历次通知的统计参数,调整所述目标机房对 应的所有所述通知方式的优先级。
进一步的,所述当监控到目标服务器宕机时,确定所述目标服务器所在的目标机房之后,还包括:
如果所述目标服务器在预设时长内的宕机次数超过预设次数,则将所述目标服务器标记为故障服务器,生成记录有所述故障服务器定位信息的反馈消息,并将所述反馈消息发送至所述目标机房。
进一步的,所述当监控到目标服务器宕机时之后,还包括:
如果宕机的服务器数量超过预设数值,则在所有宕机的服务器中计算各个预设设备属性的占比,并判断是否存在所述占比大于额定占比的目标设备属性;
如果是,则生成记录有所述目标设备属性及其所述占比的反馈消息,并将所述反馈消息发送至管理人员。
进一步的,所述方法还包括:
每隔预设周期,获取当前预设周期内宕机的服务器的宕机变化值,并判断所述宕机变化值是否大于预设变化值,其中,所述宕机变化值包括在所述当前预设周期内宕机的服务器的数值和重启成功的服务器的数值;
如果是,则计算在所述当前预设周期内所有宕机的服务器的各个预设设备属性的占比,并生成记录有所述当前预设周期、各个所述预设设备属性及各个所述预设设备属性的占比的反馈消息,并将所述反馈消息发送至管理人员;
如果否,则获取上一预设周期内所有宕机的服务器的各个预设设备属性的占比,并生成记录有所述当前预设周期、各个所述预设设备属性及各个所述预设设备属性的占比的反馈消息,并将所述反馈消息发送至管理人员。
第二方面,提供了一种宕机通知装置,所述装置包括:
数据记录模块,用于当监控到目标服务器宕机时,确定所述目标服务器所在的目标机房;
数据处理模块,用于基于所述目标机房历次通知的统计参数,调整所述目标机房对应的所有通知方式的优先级;
自动通知模块,用于按照所述优先级最高的通知方式,向所述目标机房发送对所述目标服务器的重启消息。
进一步的,所述数据处理模块,具体用于:
基于所述通知时间和所述重启成功率各自的预设权重比值,计算所述目标机房对应的所有所述通知方式的权重值;
根据所有所述通知方式的所述权重值,调整所有所述通知方式的优先级。
进一步的,所述数据处理模块,具体还用于:
当所有所述通知方式的所述权重值均不相同时,将最小的所述权重值对应的所述通知方式调整为所述优先级最高的所述通知方式;
或者,当至少两个所述通知方式的所述权重值相同且最小时,在至少两个所述通知方式中将最高的所述重启成功率对应的所述通知方式,调整为所述优先级最高的所述通知方式;
或者,当至少两个所述通知方式的所述权重值相同且最小,且所述重启成功率均相同时,在至少两个所述通知方式中将最短的所述通知时间对应的所述通知方式,调整为所述优先级最高的所述通知方式。
进一步的,所述数据处理模块,还用于:
判断所述目标服务器是否为所述目标机房中第一个发生宕机的服务器;
如果是,则获取所述目标机房的默认通知方式;
所述自动通知模块,还用于:
按照所述默认通知方式向所述目标机房发送对所述目标服务器的重启消息。
进一步的,所述数据处理模块,还用于:
如果所述目标服务器在预设时长内的宕机次数超过预设次数,则将所述目标服务器标记为故障服务器,生成记录有所述故障服务器定位信息的反馈消息;
所述自动通知模块,还用于:
将所述反馈消息发送至所述目标机房。
进一步的,所述数据处理模块,还用于:
如果宕机的服务器数量超过预设数值,则在所有宕机的服务器中计算各个预设设备属性的占比,并判断是否存在所述占比大于额定占比的目标设备属性;
如果是,则生成记录有所述目标设备属性及其所述占比的反馈消息,并将所述反馈消息发送至管理人员。
进一步的,所述数据记录模块,还用于:
每隔预设周期,获取当前预设周期内宕机的服务器的宕机变化值;
所述数据处理模块,还用于:
判断所述宕机变化值是否大于预设变化值,其中,所述宕机变化值包括在所述当前预设周期内宕机的服务器的数值和重启成功的服务器的数值;
如果是,则计算在所述当前预设周期内所有宕机的服务器的各个预设设备属性的占比,并生成记录有所述当前预设周期、各个所述预设设备属性及各个所述预设设备属性的占比的反馈消息,并将所述反馈消息发送至管理人员;
如果否,则获取上一预设周期内所有宕机的服务器的各个预设设备属性的占比,并生成记录有所述当前预设周期、各个所述预设设备属性及各个所述预设设备属性的占比的反馈消息,并将所述反馈消息发送至管理人员。
第三方面,提供了一种管理服务器,所述管理服务器包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如第一方面所述的宕机通知方法。
第四方面,提供了一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如第一方面所述的宕机通知方法。
在本实施例中,当监控到目标服务器宕机时,确定目标服务器所在的目标机房;基于目标机房历次通知的统计参数,调整目标机房对应的所有通知方式的优先级,其中,统计参数至少包括通知时间和重启成功率;按照优先级最高的通知方式,向目标机房发送对目标服务器的重启消息。这样,可以根据服务器所在机房的历次通知的统计参数,灵活的调整机房对应的所有通知方式的优先级,然后可以按照调整后的优先级最高的通知方式向机房发送对服务器的重启消息,进而可以每次都通过最适合当前情况的通知方式,即对应的通知时间较短且重启成功率较高的通知方式,向机房发送对宕机服务器的重启消息,从而可以很大程度上一次通知成功,而无需通过其它通知方式重复发送重启消息,不仅可以节约系统资源,还可以使得宕机服务器及时重启,有效提高自动通知系统的服务质量。另外,通过将所有机房中宕机的服务器的设备属性、宕机时 间及频次通知给管理人员,可以便于管理人员对服务器的宕机原因及宕机趋势进行分析,使得管理人员可以针对性地对服务器进行维护及相应的上架约束,从而可以从根本上降低服务器的宕机频次,提高各服务器的服务质量。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1是本发明实施例提供的一种宕机通知方法的流程图;
图2是本发明实施例提供的一种宕机通知装置的结构示意图;
图3是本发明实施例提供的一种宕机通知装置的结构示意图;
图4是本发明实施例提供的一种管理服务器的结构示意图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。
本发明实施例提供了一种宕机通知方法,该方法的执行主体可以是任意厂商的管理服务器,管理服务器可以是任意一台服务器,也可以是由多台服务器组成的服务器集群。该管理服务器可以对机房的服务器的运行状态进行监控,并可以在监控到某机房的某服务器宕机后,通过该机房优先级最高的通知方式,向该机房发送对宕机服务器进行重启的重启消息。上述管理服务器可以包括处理器、存储器、收发器,处理器可以用于进行下述流程中对于宕机通知方法的处理,存储器可以用于存储下述处理过程中需要的数据以及产生的数据,收发器可以用于接收和发送下述处理过程中的相关数据。可以理解,上述管理服务器的监控功能也可以由其他服务器来实现。本实施例以具备监控功能的管理服务器进行说明,其他情况与之类似,不再一一赘述。
下面将结合具体实施方式,对图1所示的一种宕机通知方法的处理流程进行详细的说明,内容可以如下:
步骤101:当监控到目标服务器宕机时,确定目标服务器所在的目标机房。
在实施中,厂商的管理人员可以预先统计各机房服务器的定位信息以及各机房的通知方式,定位信息可以是机房编号、机柜号、IP地址等,通知方式可以是邮件方式、电话方式、即时通讯软件方式等。之后,厂商的管理人员可以将统计的上述信息存储在专门的存储设备或上述管理服务器中,并可以对上述信息进行更新或修改。这样,当管理服务器监控到有服务器(可称为目标服务器)发生宕机时,可以基于上述存储设备或者本地获取的各机房服务器的定位信息,确定出目标服务器所在的机房(可称为目标机房)。
步骤102:基于目标机房历次通知的统计参数,调整目标机房对应的所有通知方式的优先级。
在实施中,管理服务器可以预先对目标机房的历次通知的统计参数进行统计,其中,统计参数至少可以包括通知时间和重启成功率,通知时间可以是历次按照某通知方式发送重启消息到目标服务器重启成功的时间的平均时间。重启成功率可以是通过某通知方式发送重启消息而使得对应的宕机服务器重启成功的通知次数,占该通知方式通知总次数的比例。这样,管理服务器在确定目标服务器所在的目标机房后,可以基于目标机房的历次通知的统计参数,对目标机房对应的所有通知方式的优先级进行实时调整,得到各个通知方式当前的优先级。
可选的,上述步骤102的处理可以如下:基于通知时间和重启成功率各自的预设权重比值,计算目标机房对应的所有通知方式的权重值;根据所有通知方式的权重值,调整所有通知方式的优先级。
在实施中,可以对通知时间和重启成功率的预设权重比值进行设置,如设置通知时间的预设权重比值为40%,重启成功率的预设权重比值为60%。这样,管理服务器可以基于通知时间和重启成功率各自的预设权重比值,计算目标机房对应的各通知方式的权重值,进而可以根据各通知方式的权重值,调整各通知方式的优先级。具体的,上述目标机房对应的各通知方式的通知时间的计算公式可以为:
Figure PCTCN2019071879-appb-000001
其中,T 1表示通知方式1的通知时间;a表示按照通知方式1发送重启消息后目标服务器成功重启的次数;t a表示第a次按照通知方式1发送重启消息到目标服务器重启成功的时间。
上述目标机房对应的各通知方式的重启成功率的计算公式可以为:
Figure PCTCN2019071879-appb-000002
其中,X 1表示表示通知方式1的重启成功率;n 1表示按照通知方式1发送重启消息的次数。
基于上述目标机房对应的各通知方式的通知时间及重启成功率,可以计算出各通知方式的权重值,权重值的计算公式可以为:
Y 1=T 1×40%+(1-X 1)×60%
其中,Y 1表示通知方式1的权重值。
这样,管理服务器在计算出目标机房对应的各通知方式的权重值之后,可以根据各权重值的大小,调整各通知方式的优先级。
可选的,上述根据所有通知方式的权重值,调整所有通知方式的优先级的处理可以如下:当所有通知方式的权重值均不相同时,将最小的权重值对应的通知方式调整为优先级最高的通知方式;或者,当至少两个通知方式的权重值相同且最小时,在至少两个通知方式中将最高的重启成功率对应的通知方式,调整为优先级最高的通知方式;或者,当至少两个通知方式的权重值相同且最小,且重启成功率均相同时,在至少两个通知方式中将最短的通知时间对应的通知方式,调整为优先级最高的通知方式。
在实施中,管理服务器在得到目标机房对应的所有通知方式的权重值后,可以根据各个通知方式的权重值及不同统计参数的数值,调整目标机房对应的所有通知方式的优先级。以目标机房对应有三种通知方式为例,三种通知方式对应的通知时间、重启成功率、权重值可以分别为T 1、X 1、Y 1,T 2、X 2、Y 2、T 3、X 3、Y 3。假设Y 1<Y 2<Y 3,管理服务器可以将Y 1对应的通知方式调整为优先级最高的通知方式。假设Y 1=Y 2<Y 3,管理服务器可以比较X 1与X 2的大小,假设X 1>X 2,管理服务器可以将X 1对应的通知方式调整为优先级最高的通知方式。假设Y 1=Y 2<Y 3,且X 1=X 2,管理服务器可以比较T 1、T 2的大小,假设T 1<T 2,管理服务器可以将T 1对应的通知方式调整为优先级最高的通知方式。
步骤103,按照优先级最高的通知方式,向目标机房发送对目标服务器的重启消息。
在实施中,管理服务器在调整完目标机房对应的通知方式的优先级之后,可以按照实时调整的优先级最高的通知方式向目标机房的对外通讯设备,如机房的技术人员所使用的电话、计算机、智能手机等,将携带有目标服务器定位 信息的重启消息通知到目标机房。之后,机房的技术人员可以基于目标服务器的定位信息找到该目标服务器,对目标服务器进行重启。这样,管理服务器可以每次都通过最适合当前情况的通知方式,即基于目标机房历次通知的统计参数而调整的优先级最高的通知方式,发送对目标服务器的重启消息,从而可以很大程度上一次通知成功,而无需通过其它通知方式重复发送重启消息,不仅可以节约系统资源,还可以使得宕机服务器及时重启,有效提高自动通知系统的服务质量。
可选的,在上述步骤103之后,还可以进行如下处理:获取并显示目标服务器的定位信息及目标服务器的重启进度。
在实施中,管理服务器在向机房发送重启消息后,可以显示目标服务器的重启进度。以目标服务器的IP地址为1.1.1.1为例,假设管理服务器已向目标服务器所在的目标机房发送重启消息,但目前未收到回执消息,则管理服务器显示的重启进度可以是“宕机服务器IP:1.1.1.1,当前进度:重启消息已发送,待返回回执消息”。
可选的,在当监控到目标服务器宕机时,确定目标服务器所在的目标机房后,还可以进行如下处理:判断目标服务器是否为目标机房中第一个发生宕机的服务器;如果是,则获取目标机房的默认通知方式,按照默认通知方式向目标机房发送对目标服务器的重启消息;如果否,则基于目标机房历次通知的统计参数,调整目标机房对应的所有通知方式的优先级。
在实施中,管理服务器可以基于管理人员的设置要求,预先将多个通知方式中的某个通知方式标记为默认通知方式,将多个通知方式中其余的通知方式标记为候选通知方式。这样,管理服务器在监控到目标服务器宕机时,可以对目标服务器是否为目标机房中第一个发生宕机的服务器进行判断。如果目标服务器为目标机房中第一个发生宕机的服务器,则管理服务器可以按照目标机房的默认通知方式,向目标机房发送对目标服务器的重启消息。如果目标服务器不是目标机房中第一个发生宕机的服务器,则管理服务器可以根据目标机房历次通知的统计参数,调整目标机房对应的所有通知方式的优先级。
可选的,在监控到目标服务器宕机时,确定目标服务器所在的目标机房后,还可以进行如下处理:如果目标服务器在预设时长内的宕机次数超过预设次数,则将目标服务器标记为故障服务器,生成记录有故障服务器定位信息的反馈消 息,并将反馈消息发送至目标机房。
在实施中,可以对频繁宕机的服务器进行故障标记,以提醒该服务器所在机房的技术人员,对标记的服务器进行重点排查,分析宕机原因。具体的,管理服务器可以对服务器在预设时长内允许发生宕机的最高频次(可称为预设次数)进行设置,如设置为15天内最多宕机3次或5次。这样,当监控到目标服务器宕机时,管理服务器可以对目标服务器的历次宕机信息进行获取,并判断目标服务器在预设时长内的宕机次数是否超过预设次数。如果目标服务器在预设时长内的宕机次数超过预设次数,则管理服务器可以将目标服务器标记为故障服务器,并生成记录有故障服务器定位信息的反馈消息,然后可以将该反馈消息发送至目标机房,如可以通过邮件方式、即时通讯软件方式等,通知相应的技术人员对目标服务器进行重点排查。
可选的,当监控到目标服务器宕机后,还可以进行如下处理:如果宕机的服务器数量超过预设数值,则在所有宕机的服务器中计算各个预设设备属性的占比,并判断是否存在占比大于额定占比的目标设备属性;如果是,则生成记录有目标设备属性及其占比的反馈消息,并将反馈消息发送至管理人员。
在实施中,考虑到服务器的大面积宕机可能会影响到厂商提供的网络服务质量,管理服务器可以对同一时刻允许发生宕机的服务器的总数量(可称为预设数值),及宕机的服务器的各设备属性的最高占比(可称为额定占比)进行设置。当管理服务器监控到某时刻宕机的服务器的数量超过预设值时,管理服务器可以获取所有宕机服务器的设备属性(可称为预设设备属性),其中预设设备属性可以包括所有服务器的硬件属性及软件属性等。以预设数值为90台,当前时刻宕机的服务器数量为100台为例,假设各预设设备属性对应的服务器数量分别为CPU型号1为15台、CPU型号2为20台、软件1为50台、软件2为15台,此时管理服务器依次可以计算出上述各预设设备属性的占比分别为15%、20%、50%、15%。假设上述各预设设备属性的额定占比分别为30%、25%、40%、20%,管理服务器可以判断出大于额定占比的占比对应的预设设备属性(可称为目标设备属性)为软件1。之后,管理服务器可以生成包括上述目标设备属性及其对应的占比的通知消息,如“您好,XX年XX月XX时XX分,宕机服务器数量为100台,其中软件1占比50%已超过额定占比,请及时处理,谢谢!”,并将上述通知消息发送至厂商的管理人员,如可以通过邮件方式、即时通讯软件 方式等。这样,厂商的管理人员可以基于反馈消息内容找到对应的软件1并进行相应的处理。需要说明的是,上述各个预设设备属性的额定占比可以根据具体的宕机情况进行设定和调整,本实施例对此不做限定。
可选的,上述在所有宕机的服务器中计算各个预设设备属性的占比,并判断是否存在占比大于额定占比的目标设备属性的具体处理可以如下:依次计算预设设备属性的占比,判断是否存在占比大于对应的预设设备属性的额定占比;如果是,则确定预设设备属性为目标设备属性;否则,计算下一预设设备属性的占比。
在实施中,管理服务器可以依次计算每个预设设备属性的占比,并判断该预设设备属性是否是目标设备属性。具体的,仍以上述设备属性为CPU型号1、CPU型号2、软件1、软件2为例,管理服务器可以先计算CPU型号1的占比,并判断该占比是否大于额定占比,如果该占比大于该预设设备属性对应的额定占比,管理服务器可以将该占比对应的预设设备属性确定为目标设备属性,并存储该目标设备属性及其占比。如果该占比小于该预设设备属性对应的额定占比,管理服务器可以确定该预设设备属性不是目标设备属性,跳过当前预设设备属性计算下一个预设设备属性CPU型号2的占比,并重复上述过程,直至判断完所有上述所有预设设备属性是否是目标设备属性。
可选的,本实施例还提供了一种通知方法,具体处理可以如下:每隔预设周期,获取当前预设周期内宕机的服务器的宕机变化值,并判断宕机变化值是否大于预设变化值,其中,宕机变化值包括在当前预设周期内宕机的服务器的数值和重启成功的服务器的数值;如果是,则计算在当前预设周期内所有宕机的服务器的各个预设设备属性的占比,并生成记录有当前预设周期、各个预设设备属性及各个预设设备属性的占比的反馈消息,并将反馈消息发送至管理人员;如果否,则获取上一预设周期内所有宕机的服务器的各个预设设备属性的占比,并生成记录有当前预设周期、各个预设设备属性及各个预设设备属性的占比的反馈消息,并将反馈消息发送至管理人员。
在实施中,为避免服务器发生大面积宕机,管理服务器可以每隔预设周期,获取当前预设周期内新增的发生宕机的服务器的数量,及重启成功的服务器的数量之和(可称为宕机变化值),并对该宕机变化值设置上限(可称为预设变化值)。具体的,以预设周期为24小时,预设变化值为50台为例,考虑到每天凌 晨使用网络服务的用户较少,管理服务器可以设置在每天凌晨24:00获取过去24小时内的宕机变化值。假设宕机的服务器的新增数量为30台,重启成功的服务器的数量为40台,管理服务器可以计算出在过去24小时内宕机变化值为70台,大于预设变化值50台。然后,管理服务器可以计算在过去24小时内宕机的服务器的设备属性的占比,确定出目标设备属性,并生成包括上述预设周期、各个预设设备属性及各个预设设备属性对应的占比的反馈消息,发送至厂商的管理人员。
值得一提的是,管理服务器可以每隔固定的时间段,如可以每隔3小时或者4小时,获取过去24小时内的宕机变化值。具体的,管理服务器可以计算在各时间段内宕机的服务器的预设设备属性的占比,确定出目标设备属性,并生成包括上述预设周期、各个预设设备属性、各个预设设备属性对应的占比的反馈消息,发送至厂商的管理人员,如可以通过邮件方式、即时通讯软件方式等发送反馈消息,反馈消息的内容可以是文字或者图表形式。之后,管理人员可以根据反馈消息的内容找到对应的服务器,并进行检修及上架约束,以保证厂商的提供的网络服务质量。这样,可以从根本上降低服务器的宕机频次,提高各服务器的服务质量。
本发明实施例提供的技术方案带来的有益效果是:
在本实施例中,当监控到目标服务器宕机时,确定目标服务器所在的目标机房;基于目标机房历次通知的统计参数,调整目标机房对应的所有通知方式的优先级,其中,统计参数至少包括通知时间和重启成功率;按照优先级最高的通知方式,向目标机房发送对目标服务器的重启消息。这样,可以根据服务器所在机房的历次通知的统计参数,灵活的调整机房对应的所有通知方式的优先级,然后可以按照调整后的优先级最高的通知方式向机房发送对服务器的重启消息,进而可以每次都通过最适合当前情况的通知方式,即对应的通知时间较短且重启成功率较高的通知方式,向机房发送对宕机服务器的重启消息,从而可以很大程度上一次通知成功,而无需通过其它通知方式重复发送重启消息,不仅可以节约系统资源,还可以使得宕机服务器及时重启,有效提高自动通知系统的服务质量。另外,通过将所有机房中宕机的服务器的设备属性、宕机时间及频次通知给管理人员,可以便于管理人员对服务器的宕机原因及宕机趋势进行分析,使得管理人员可以针对性地对服务器进行维护及相应的上架约束, 从而可以从根本上降低服务器的宕机频次,提高各服务器的服务质量。
基于相同的技术构思,本发明实施例还提供了一种宕机通知装置,如图2所示,所述装置包括:
数据记录模块201,用于当监控到目标服务器宕机时,确定所述目标服务器所在的目标机房;
数据处理模块202,用于基于所述目标机房历次通知的统计参数,调整所述目标机房对应的所有通知方式的优先级;
自动通知模块203,用于按照所述优先级最高的通知方式,向所述目标机房发送对所述目标服务器的重启消息。
进一步的,所述数据处理模块202,具体用于:
基于所述通知时间和所述重启成功率各自的预设权重比值,计算所述目标机房对应的所有所述通知方式的权重值;
根据所有所述通知方式的所述权重值,调整所有所述通知方式的优先级。
进一步的,所述数据处理模块202,具体还用于:
当所有所述通知方式的所述权重值均不相同时,将最小的所述权重值对应的所述通知方式调整为所述优先级最高的所述通知方式;
或者,当至少两个所述通知方式的所述权重值相同且最小时,在至少两个所述通知方式中将最高的所述重启成功率对应的所述通知方式,调整为所述优先级最高的所述通知方式;
或者,当至少两个所述通知方式的所述权重值相同且最小,且所述重启成功率均相同时,在至少两个所述通知方式中将最短的所述通知时间对应的所述通知方式,调整为所述优先级最高的所述通知方式。
进一步的,所述数据处理模块202,还用于:
判断所述目标服务器是否为所述目标机房中第一个发生宕机的服务器;
如果是,则获取所述目标机房的默认通知方式;
所述自动通知模块203,还用于:
按照所述默认通知方式向所述目标机房发送对所述目标服务器的重启消息。
进一步的,所述数据处理模块202,还用于:
如果所述目标服务器在预设时长内的宕机次数超过预设次数,则将所述目 标服务器标记为故障服务器,生成记录有所述故障服务器定位信息的反馈消息;
所述自动通知模块203,还用于:
将所述反馈消息发送至所述目标机房。
进一步的,所述数据处理模块202,还用于:
如果宕机的服务器数量超过预设数值,则在所有宕机的服务器中计算各个预设设备属性的占比,并判断是否存在所述占比大于额定占比的目标设备属性;
如果是,则生成记录有所述目标设备属性及其所述占比的反馈消息,并将所述反馈消息发送至管理人员。
进一步的,所述数据记录模块201,还用于:
每隔预设周期,获取当前预设周期内宕机的服务器的宕机变化值;
所述数据处理模块202,还用于:
判断所述宕机变化值是否大于预设变化值,其中,所述宕机变化值包括在所述当前预设周期内宕机的服务器的数值和重启成功的服务器的数值;
如果是,则计算在所述当前预设周期内所有宕机的服务器的各个预设设备属性的占比,并生成记录有所述当前预设周期、各个所述预设设备属性及各个所述预设设备属性的占比的反馈消息,并将所述反馈消息发送至管理人员;
如果否,则获取上一预设周期内所有宕机的服务器的各个预设设备属性的占比,并生成记录有所述当前预设周期、各个所述预设设备属性及各个所述预设设备属性的占比的反馈消息,并将所述反馈消息发送至管理人员。
进一步的,如图3所示,所述装置还包括数据显示模块204,用于:
获取并显示目标服务器的定位信息及目标服务器的重启进度。
图4是本发明实施例提供的管理服务器的结构示意图。该管理服务器400可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器422(例如,一个或一个以上处理器)和存储器432,一个或一个以上存储应用程序442或数据444的存储介质430(例如一个或一个以上海量存储设备)。其中,存储器432和存储介质430可以是短暂存储或持久存储。存储在存储介质430的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对调度设备中的一系列指令操作。更进一步地,中央处理器422可以设置为与存储介质430通信,在管理服务器400上执行存储介质430中的一系列指令操 作。
管理服务器400还可以包括一个或一个以上电源426,一个或一个以上有线或无线网络接口450,一个或一个以上输入输出接口458,和/或,一个或一个以上操作系统431,例如Windows Server TM,Mac OS XTM,Unix TM,Linux TM,FreeBSD TM等等。
管理服务器400可以包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行上述宕机通知的指令。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (16)

  1. 一种宕机通知方法,其特征在于,所述方法包括:
    当监控到目标服务器宕机时,确定所述目标服务器所在的目标机房;
    基于所述目标机房历次通知的统计参数,调整所述目标机房对应的所有通知方式的优先级,其中,所述统计参数至少包括通知时间和重启成功率;
    按照所述优先级最高的通知方式,向所述目标机房发送对所述目标服务器的重启消息。
  2. 根据权利要求1所述的方法,其特征在于,所述基于所述目标机房历次通知的统计参数,调整所述目标机房对应的所有通知方式的优先级,包括:
    基于所述通知时间和所述重启成功率各自的预设权重比值,计算所述目标机房对应的所有所述通知方式的权重值;
    根据所有所述通知方式的所述权重值,调整所有所述通知方式的优先级。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所有所述通知方式的所述权重值,调整所有所述通知方式的优先级,包括:
    当所有所述通知方式的所述权重值均不相同时,将最小的所述权重值对应的所述通知方式调整为所述优先级最高的所述通知方式;
    或者,当至少两个所述通知方式的所述权重值相同且最小时,在至少两个所述通知方式中将最高的所述重启成功率对应的所述通知方式,调整为所述优先级最高的所述通知方式;
    或者,当至少两个所述通知方式的所述权重值相同且最小,且所述重启成功率均相同时,在至少两个所述通知方式中将最短的所述通知时间对应的所述通知方式,调整为所述优先级最高的所述通知方式。
  4. 根据权利要求1所述的方法,其特征在于,所述当监控到目标服务器宕机时,确定所述目标服务器所在的目标机房之后,还包括:
    判断所述目标服务器是否为所述目标机房中第一个发生宕机的服务器;
    如果是,则获取所述目标机房的默认通知方式,按照所述默认通知方式向 所述目标机房发送对所述目标服务器的重启消息;
    如果否,则基于所述目标机房历次通知的统计参数,调整所述目标机房对应的所有所述通知方式的优先级。
  5. 根据权利要求1所述的方法,其特征在于,所述当监控到目标服务器宕机时,确定所述目标服务器所在的目标机房之后,还包括:
    如果所述目标服务器在预设时长内的宕机次数超过预设次数,则将所述目标服务器标记为故障服务器,生成记录有所述故障服务器定位信息的反馈消息,并将所述反馈消息发送至所述目标机房。
  6. 根据权利要求1所述的方法,其特征在于,所述当监控到目标服务器宕机时之后,还包括:
    如果宕机的服务器数量超过预设数值,则在所有宕机的服务器中计算各个预设设备属性的占比,并判断是否存在所述占比大于额定占比的目标设备属性;
    如果是,则生成记录有所述目标设备属性及其所述占比的反馈消息,并将所述反馈消息发送至管理人员。
  7. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    每隔预设周期,获取当前预设周期内宕机的服务器的宕机变化值,并判断所述宕机变化值是否大于预设变化值,其中,所述宕机变化值包括在所述当前预设周期内宕机的服务器的数值和重启成功的服务器的数值;
    如果是,则计算在所述当前预设周期内所有宕机的服务器的各个预设设备属性的占比,并生成记录有所述当前预设周期、各个所述预设设备属性及各个所述预设设备属性的占比的反馈消息,并将所述反馈消息发送至管理人员;
    如果否,则获取上一预设周期内所有宕机的服务器的各个预设设备属性的占比,并生成记录有所述当前预设周期、各个所述预设设备属性及各个所述预设设备属性的占比的反馈消息,并将所述反馈消息发送至管理人员。
  8. 一种宕机通知装置,其特征在于,所述装置包括:
    数据记录模块,用于当监控到目标服务器宕机时,确定所述目标服务器所 在的目标机房;
    数据处理模块,用于基于所述目标机房历次通知的统计参数,调整所述目标机房对应的所有通知方式的优先级;
    自动通知模块,用于按照所述优先级最高的通知方式,向所述目标机房发送对所述目标服务器的重启消息。
  9. 根据权利要求8所述的装置,其特征在于,所述数据处理模块具体用于:
    基于所述通知时间和所述重启成功率各自的预设权重比值,计算所述目标机房对应的所有所述通知方式的权重值;
    根据所有所述通知方式的所述权重值,调整所有所述通知方式的优先级。
  10. 根据权利要求9所述的装置,其特征在于,所述数据处理模块具体还用于:
    当所有所述通知方式的所述权重值均不相同时,将最小的所述权重值对应的所述通知方式调整为所述优先级最高的所述通知方式;
    或者,当至少两个所述通知方式的所述权重值相同且最小时,在至少两个所述通知方式中将最高的所述重启成功率对应的所述通知方式,调整为所述优先级最高的所述通知方式;
    或者,当至少两个所述通知方式的所述权重值相同且最小,且所述重启成功率均相同时,在至少两个所述通知方式中将最短的所述通知时间对应的所述通知方式,调整为所述优先级最高的所述通知方式。
  11. 根据权利要求8所述的装置,其特征在于,所述数据处理模块还用于:
    判断所述目标服务器是否为所述目标机房中第一个发生宕机的服务器;
    如果是,则获取所述目标机房的默认通知方式;
    所述自动通知模块,还用于:
    按照所述默认通知方式向所述目标机房发送对所述目标服务器的重启消息。
  12. 根据权利要求8所述的装置,其特征在于,所述数据处理模块还用于:
    如果所述目标服务器在预设时长内的宕机次数超过预设次数,则将所述目 标服务器标记为故障服务器,生成记录有所述故障服务器定位信息的反馈消息;
    所述自动通知模块,还用于:
    将所述反馈消息发送至所述目标机房。
  13. 根据权利要求8所述的装置,其特征在于,所述数据处理模块还用于:
    如果宕机的服务器数量超过预设数值,则在所有宕机的服务器中计算各个预设设备属性的占比,并判断是否存在所述占比大于额定占比的目标设备属性;
    如果是,则生成记录有所述目标设备属性及其所述占比的反馈消息,并将所述反馈消息发送至管理人员。
  14. 根据权利要求8所述的装置,其特征在于,所述数据记录模块还用于:
    每隔预设周期,获取当前预设周期内宕机的服务器的宕机变化值;
    所述数据处理模块还用于:
    判断所述宕机变化值是否大于预设变化值,其中,所述宕机变化值包括在所述当前预设周期内宕机的服务器的数值和重启成功的服务器的数值;
    如果是,则计算在所述当前预设周期内所有宕机的服务器的各个预设设备属性的占比,并生成记录有所述当前预设周期、各个所述预设设备属性及各个所述预设设备属性的占比的反馈消息,并将所述反馈消息发送至管理人员;
    如果否,则获取上一预设周期内所有宕机的服务器的各个预设设备属性的占比,并生成记录有所述当前预设周期、各个所述预设设备属性及各个所述预设设备属性的占比的反馈消息,并将所述反馈消息发送至管理人员。
  15. 一种管理服务器,其特征在于,所述管理服务器包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求1至7任一所述的宕机通知方法。
  16. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如权利要求1至7任 一所述的宕机通知方法。
PCT/CN2019/071879 2018-12-18 2019-01-16 一种宕机通知方法及装置 WO2020124721A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US17/042,908 US20210021460A1 (en) 2018-12-18 2019-01-16 Method and device for notifying downtime
EP19897731.6A EP3896904A4 (en) 2018-12-18 2019-01-16 METHOD AND MECHANISM FOR NOTIFYING DOWNTIME

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811555329.9 2018-12-18
CN201811555329.9A CN109639490B (zh) 2018-12-18 2018-12-18 一种宕机通知方法及装置

Publications (1)

Publication Number Publication Date
WO2020124721A1 true WO2020124721A1 (zh) 2020-06-25

Family

ID=66075435

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/071879 WO2020124721A1 (zh) 2018-12-18 2019-01-16 一种宕机通知方法及装置

Country Status (4)

Country Link
US (1) US20210021460A1 (zh)
EP (1) EP3896904A4 (zh)
CN (1) CN109639490B (zh)
WO (1) WO2020124721A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111142977B (zh) * 2019-12-26 2023-08-18 深圳前海环融联易信息科技服务有限公司 一种定时任务的处理方法、装置、计算机设备及存储介质
CN112000556B (zh) * 2020-07-06 2023-04-28 广州西山居网络科技有限公司 客户端程序宕机显示方法、装置及可读介质
CN111930594A (zh) * 2020-07-28 2020-11-13 滁州惠科光电科技有限公司 目标机台宕机的监控方法、装置及可读存储设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20040056971A (ko) * 2002-12-24 2004-07-01 한국전자통신연구원 테라라우터 시스템에서의 장애 처리 방법
CN102111310A (zh) * 2010-12-31 2011-06-29 网宿科技股份有限公司 Cdn设备状态监控方法和系统
CN104009863A (zh) * 2013-02-27 2014-08-27 联想(北京)有限公司 一种服务器系统、及自动获取服务器编号的方法
CN105094030A (zh) * 2015-08-06 2015-11-25 上海卓佑计算机技术有限公司 机房环境数据管理及实时分析处理系统
CN108156329A (zh) * 2018-01-25 2018-06-12 维沃移动通信有限公司 消息发送的方法、移动终端及计算机可读存储介质
CN108737132A (zh) * 2017-04-14 2018-11-02 优酷信息技术(北京)有限公司 一种告警信息处理方法及装置

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6306499B2 (ja) * 2014-12-25 2018-04-04 クラリオン株式会社 障害情報提供サーバ、障害情報提供方法
CN104965727B (zh) * 2015-04-29 2018-10-26 无锡天脉聚源传媒科技有限公司 一种重启服务器的方法及装置
CN106571972B (zh) * 2015-10-10 2021-02-12 北京国双科技有限公司 服务器的监控方法及装置
US9622180B1 (en) * 2015-12-23 2017-04-11 Intel Corporation Wearable device command regulation
CN106131680B (zh) * 2016-06-28 2019-12-03 青岛海信电器股份有限公司 电视通知显示时长调整方法、装置及电视系统
CN106209599A (zh) * 2016-07-26 2016-12-07 深圳天珑无线科技有限公司 一种信息的通知方法和终端
EP3542272B1 (en) * 2016-11-21 2024-01-31 Everbridge, Inc. Systems and methods for providing a notification system architecture
CN107453906A (zh) * 2017-08-01 2017-12-08 郑州云海信息技术有限公司 一种存储管理系统监控告警的设置方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20040056971A (ko) * 2002-12-24 2004-07-01 한국전자통신연구원 테라라우터 시스템에서의 장애 처리 방법
CN102111310A (zh) * 2010-12-31 2011-06-29 网宿科技股份有限公司 Cdn设备状态监控方法和系统
CN104009863A (zh) * 2013-02-27 2014-08-27 联想(北京)有限公司 一种服务器系统、及自动获取服务器编号的方法
CN105094030A (zh) * 2015-08-06 2015-11-25 上海卓佑计算机技术有限公司 机房环境数据管理及实时分析处理系统
CN108737132A (zh) * 2017-04-14 2018-11-02 优酷信息技术(北京)有限公司 一种告警信息处理方法及装置
CN108156329A (zh) * 2018-01-25 2018-06-12 维沃移动通信有限公司 消息发送的方法、移动终端及计算机可读存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3896904A4 *

Also Published As

Publication number Publication date
CN109639490B (zh) 2020-09-18
US20210021460A1 (en) 2021-01-21
EP3896904A4 (en) 2022-01-19
CN109639490A (zh) 2019-04-16
EP3896904A1 (en) 2021-10-20

Similar Documents

Publication Publication Date Title
US11226989B2 (en) Dynamic interest-based notifications
US20210200583A1 (en) Systems and methods for scheduling tasks
WO2020124721A1 (zh) 一种宕机通知方法及装置
WO2019061720A1 (zh) 一种数据同步的方法和系统
US9548886B2 (en) Help desk ticket tracking integration with root cause analysis
US9497072B2 (en) Identifying alarms for a root cause of a problem in a data processing system
US9497071B2 (en) Multi-hop root cause analysis
US11714658B2 (en) Automated idle environment shutdown
US20120311128A1 (en) Performance testing in a cloud environment
US20150281011A1 (en) Graph database with links to underlying data
US20190340036A1 (en) Digital Processing System for Event and/or Time Based Triggering Management, and Control of Tasks
WO2013037234A1 (zh) 参数接收方法及系统
US20220006716A1 (en) Method and managing apparatus for processing server anomalies
CN110336884B (zh) 服务器集群更新方法和装置
TW201447599A (zh) 保留及執行本機計算裝置的影像寫入之技術
US10051067B2 (en) Abstract activity counter
US10419368B1 (en) Dynamic scaling of computing message architecture
US20100269052A1 (en) Notifying of an unscheduled system interruption requiring manual intervention and adjusting interruption specifics reactive to user feedback
CN112260984A (zh) 一种带Wi-Fi功能的5G终端云管控方法
CN111310043B (zh) 用于推送信息的方法和装置
CN109634639B (zh) 一种宕机通知方式的更新方法及装置
US20240103992A1 (en) Alert rule manipulation in sync of temporary configuration change
US11824750B2 (en) Managing information technology infrastructure based on user experience
US20240160517A1 (en) Alert aggregation and health issues processing in a cloud environment
CN117349113A (zh) 通讯业务进程保护方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19897731

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019897731

Country of ref document: EP

Effective date: 20210714