CN111290918A - Server running state monitoring method and device and computer readable storage medium - Google Patents

Server running state monitoring method and device and computer readable storage medium Download PDF

Info

Publication number
CN111290918A
CN111290918A CN202010121452.0A CN202010121452A CN111290918A CN 111290918 A CN111290918 A CN 111290918A CN 202010121452 A CN202010121452 A CN 202010121452A CN 111290918 A CN111290918 A CN 111290918A
Authority
CN
China
Prior art keywords
server
bmc
error
information
running state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010121452.0A
Other languages
Chinese (zh)
Other versions
CN111290918B (en
Inventor
王相宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202010121452.0A priority Critical patent/CN111290918B/en
Publication of CN111290918A publication Critical patent/CN111290918A/en
Application granted granted Critical
Publication of CN111290918B publication Critical patent/CN111290918B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application discloses a method and a device for monitoring a running state of a server and a computer readable storage medium. The method comprises the steps that a check information query interface is arranged on the BMC in advance, and when the running state of a server is detected to be wrong, the current network transmission state is obtained; if the running state error of the server is the running state error which can be automatically recovered without manually and actively executing error correction operation, and the network transmission delay is not less than a preset delay threshold, sending an alarm mail carrying verification information to a target mailbox bound with the server in advance, so that a user determines whether to execute the error correction operation or not based on the consistency of the verification information and original verification information obtained by inquiring through a verification information inquiry interface after receiving the alarm mail; therefore, the problem of misoperation and even server data loss caused by the fact that faults are automatically recovered due to delayed processing of error information is solved, the misoperation probability of a user is reduced, and the running stability and reliability of the server are improved.

Description

Server running state monitoring method and device and computer readable storage medium
Technical Field
The present application relates to the field of BMC management control technologies, and in particular, to a method and an apparatus for monitoring a server running state, and a computer-readable storage medium.
Background
BMC (Baseboard Management Controller) is widely used in large-scale integrated Management of servers as a remote Management Controller for executing a server. The BMC can monitor the running state of each server in the system in real time, and in the BMC monitoring process, the running state of the server inevitably goes wrong.
In the related art, when the BMC monitors that the running state of the server has an error, the BMC reports the error in time and displays the error. However, if the operation and maintenance personnel or other staff do not see the reported error information or do not receive the information of the error in the running state in time, and some running states are automatically recovered after the error occurs, the operation and maintenance personnel or other staff operate according to the reported error operation, which may cause misoperation and even server data loss.
In view of this, how to solve the current situation of misoperation caused by delayed processing of error reporting information and even server data loss is a problem to be solved by those skilled in the art.
Disclosure of Invention
The application provides a server running state monitoring method and device and a computer readable storage medium, which solve the problem of misoperation even server data loss caused by delayed processing of error reporting information, reduce the probability of misoperation of a user and improve the running stability and reliability of a server.
In order to solve the above technical problems, embodiments of the present invention provide the following technical solutions:
in one aspect, an embodiment of the present invention provides a method for monitoring a server running state, where a check information query interface is set on a BMC in advance, and the method includes:
when detecting that the running state of the server is wrong, acquiring the current network transmission state;
if the operation state error type of the server is a preset target error type and the network transmission delay is not less than a preset delay threshold value, sending an alarm mail carrying verification information to a target mailbox pre-bound with the server so that a user can determine whether to execute error correction operation based on the consistency of the verification information and original verification information obtained through query of the verification information query interface;
the target error type is an operating state error type which can be automatically recovered through specific system operation without manually and actively executing error correction operation; the check information is determined according to the operation parameters for automatically recovering the server running state error.
Optionally, the sending the alarm mail carrying the verification information to the target mailbox pre-bound to the server includes:
and taking the starting time of the BMC and the restarting times of the operating system as the checking information, and synchronously sending the checking information to the target mailbox through the alarm mail.
Optionally, the enabling the user to determine whether to perform the error correction operation based on the consistency between the check information and the original check information queried from the BMC includes:
after receiving the alarm mail, acquiring the verification BMC starting time and the verification restarting times in the alarm mail;
acquiring the current BMC starting time and the current restarting times of the operating system from the BMC through the check information query interface;
if the current restart times and the check restart times are not consistent, error correction operation is not performed on the server;
if the difference value between the current BMC starting time and the checking BMC starting time is larger than a preset time difference value, not carrying out error correction operation on the server;
and if the current restarting times and the checking restarting times are consistent, and the difference value between the current BMC starting time and the checking BMC starting time is not more than a preset time difference value, carrying out error correction operation on the server based on the alarm information in the alarm mail.
Optionally, after detecting that the running state of the server is faulty, the method further includes:
packing the running state log information at the error detection moment of the running state of the server to generate a fault detection log packet;
and sending the fault detection log packet to the target mailbox, and setting a timestamp for the fault detection log packet, wherein the timestamp is the error detection time of the server in the running state.
Optionally, the setting of the check information query interface on the BMC in advance includes:
and defining an ipmi interface and/or a restful interface as the check information query interface in advance on the BMC.
Another aspect of the embodiments of the present invention provides a server operation state monitoring apparatus, including:
the query interface predefining module is used for setting a check information query interface on the BMC;
the network delay information acquisition module is used for acquiring the current network transmission state when detecting that the running state of the server is wrong;
the warning module is used for sending a warning mail carrying verification information to a target mailbox bound with the server in advance if the running state error type of the server is a preset target error type and the network transmission delay is not less than a preset delay threshold value, so that a user can determine whether to execute error correction operation based on the consistency of the verification information and original verification information obtained through query of the verification information query interface; the target error type is an operation state error type which can be automatically recovered through specific system operation without manually and actively executing error correction operation; the check information is determined according to the operation parameters for automatically recovering the server running state error.
Optionally, the alarm module is specifically configured to send the start time of the BMC and the restart times of the operating system as the check information to the target mailbox through the alarm mail in synchronization.
Optionally, the system further comprises a log packing module, wherein the log packing module is configured to pack running state log information at the time of the error detection of the running state of the server to generate a fault detection log packet; and sending the fault detection log packet to the target mailbox, and setting a timestamp for the fault detection log packet, wherein the timestamp is the error detection time of the server in the running state.
The embodiment of the present invention further provides a server operation state monitoring apparatus, including a processor, where the processor is configured to implement the steps of the server operation state monitoring method according to any one of the preceding items when executing the computer program stored in the memory.
Finally, an embodiment of the present invention provides a computer-readable storage medium, where a server operation status monitoring program is stored on the computer-readable storage medium, and when the server operation status monitoring program is executed by a processor, the server operation status monitoring method implements the steps of the server operation status monitoring method according to any one of the foregoing embodiments.
The technical proposal provided by the application has the advantages that after the error of the server running state is detected, the error type of the running state and the data transmission state of the current network are considered for carrying out mail alarm, for run state error types that do not require manual active execution of error correction operations to automatically recover from other system operations and in scenarios where the network has delays, the check information is carried while the alarm mail is sent, the consistency between the check information in the mail and the check information inquired from the BMC can be used as a reference standard for adopting corresponding measures for the alarm information in the alarm mail by a user, therefore, the current situation that misoperation and even server data are lost due to the fact that faults are automatically recovered due to delayed processing of error information in the related technology is solved, the misoperation probability of a user is reduced, and the running stability and reliability of the server are improved.
In addition, the embodiment of the invention also provides a corresponding implementation device and a computer readable storage medium for the server running state monitoring method, so that the method has higher practicability, and the device and the computer readable storage medium have corresponding advantages.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the related art, the drawings required to be used in the description of the embodiments or the related art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a schematic flowchart of a method for monitoring a server operating state according to an embodiment of the present invention;
fig. 2 is a structural diagram of a specific embodiment of a server operation state monitoring apparatus according to an embodiment of the present invention;
fig. 3 is a structural diagram of another specific embodiment of a server operation state monitoring apparatus according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," "third," "fourth," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may include other steps or elements not expressly listed.
Having described the technical solutions of the embodiments of the present invention, various non-limiting embodiments of the present application are described in detail below.
Referring to fig. 1, fig. 1 is a schematic flow chart of a server operation state monitoring method according to an embodiment of the present invention, where the embodiment of the present invention may include the following:
s101: and a check information query interface is arranged on the BMC in advance.
It is understood that one or more interfaces may be defined in the BMC in advance, and may be an ipmi interface or a restful interface, for example. The interface is opened to the outside as a check information query interface, so that a user can read the BMC internal information through the interface, and the read information is check information. The verification information is determined according to the operation parameters for automatically recovering the server running state errors, for example, the server crashes and reports errors, the errors can be automatically recovered after the server is restarted, and other measures are not needed, so that the verification information can be the restarting times and the restarting time of the server. Further, the check information is related to the type of error reporting in server operation, and the same type of error reporting in server operation corresponds to one group or one check information, which can be preset by those skilled in the art according to the actual application scenario. If different verification information is used for error reporting of operation of different types of servers for the current application scene, when a user inquires the verification information, the error reporting type of operation of the servers needs to be input so as to be output after matching.
S102: and when the running state of the server is detected to be wrong, acquiring the current network transmission state.
In the present application, for example, a server operation error may be identified by some keywords, but it is also possible to refer to any method for detecting a server operation state error described in the related art in other ways, and the present application does not limit this. After the transmission state of the current network is obtained, the network transmission delay condition can be known based on the network transmission state.
S103: judging whether the operation state error type of the server is a preset target error type or not and whether the network transmission delay is not less than a preset delay threshold or not; if yes, executing S104; if not, go to S105.
The target error type in the present application refers to an operating state error type that can be automatically recovered through a specific system operation without manually and actively performing an error correction operation, the target error type may be of multiple types or one type, which does not affect the implementation of the present application, the preset delay threshold may be determined according to an actual application scenario, for example, 200ms, and this is not limited in this application.
S104: and sending the alarm mail carrying the check information to a target mailbox pre-bound with the server so that a user can determine whether to execute error correction operation based on the consistency of the check information and original check information obtained by inquiring through a check information inquiry interface.
In this embodiment, the BMC may monitor the operating state of the server and provide a function of alarming the designated mailbox address by smtp (simple mail transfer protocol), and when the designated mailbox receives the alarm, the operation and maintenance personnel or the remote automatic take corresponding measures. Some alarms such as internal errors of a CPU or uncorrectable errors of a memory trigger a server to be down and restarted (or not), if some faults of the CPU and the memory are automatically recovered after restarting, if verification is not performed, operation is performed according to alarm information, misoperation or even server data loss may be caused, if network fluctuation is large, alarm mail delivery delay may be caused, mails received by operation and maintenance personnel or other working personnel are alarms several minutes or even longer, and at the moment, measures are taken without verification, and serious consequences may be caused.
The original check information obtained through the check information query interface is the check information updated in real time in the BMC, namely the original check information accurately reflects the current BMC state, and the check information carried in the alarm mail is the BMC state information at the time of sending the alarm mail. The original verification information and the verification information are named for distinguishing different concepts and more clearly and unambiguously expressing the technical scheme of the application. The rule for performing the error correction operation may be set in advance based on the original check information and the check information, that is, the error correction operation is performed when the original check information and the check information satisfy a preset condition, which is preset, for example, the error correction operation is performed when the original check information and the check information are consistent, or the error correction operation is performed when the difference between the value in the original check information and the value in the check information is not greater than a certain value.
S105: and directly sending the alarm mail to a target mailbox.
In the technical scheme provided by the embodiment of the invention, after the error of the running state of the server is detected, the error type of the running state and the data transmission state of the current network are taken into consideration for mail warning, the running state error type which can be automatically recovered due to the operation of other systems without manually and actively executing error correction operation and the scene with delay of the network can carry check information while sending the warning mail, and a user can use the consistency of the check information in the mail and the check information inquired from the BMC as a reference standard for adopting corresponding measures for the warning information in the warning mail, so that the current situation that the misoperation even the server data are lost due to the automatic recovery of faults caused by the delayed processing of the error reporting information in the related technology is solved, the misoperation probability of the user is reduced, and the running stability and reliability of the server are improved.
In the foregoing embodiment, the check information is not limited, and an implementation manner is provided in this embodiment, the start time of the BMC and the restart times of the operating system may be used as the check information and sent to the target mailbox by an alert mail, where the implementation manner may include the following:
the embodiment of the invention adds the check of the starting time of an Operating System (OS) and the starting time of the BMC in the BMC, the BMC records the starting time of the BMC and the starting/restarting time of the OS, provides ipmi as a check information query interface to the outside, and sends the starting time of the BMC and the restarting time of the OS synchronously through the alarm mail when the alarm occurs. And customizing rules after comparing the BMC starting time and the OS starting times in the alarm mail with the BMC starting time and the OS starting times acquired from the BMC interface, wherein the rules comprise the following steps: if the starting times of the OS are inconsistent, the restarting command is not sent any more, so that repeated restarting is avoided, and if the difference between the starting times of the two BMCs is more than 100 seconds, the alarm with overlarge delay is considered to be invalid.
The embodiment sets two check parameters of the Operating System (OS) starting times and the BMC starting time, and can set a series of rules according to the two check parameters to identify whether the alarm mail is effective or not and whether operation is needed or not, so that misoperation is reduced. There is no limitation on how to formulate rules to identify whether an alert e-mail is valid, and the application also provides an implementation manner, which may include:
after receiving the alarm mail, acquiring the verification BMC starting time and the verification restarting times in the alarm mail;
acquiring the current BMC starting time and the current restarting times of the operating system from the BMC through a check information query interface;
if the current restart times are inconsistent with the verification restart times, not carrying out error correction operation on the server;
if the difference value between the current BMC starting time and the checking BMC starting time is larger than the preset time difference value, not carrying out error correction operation on the server;
and if the current restarting times and the checking restarting times are consistent, and the difference value between the current BMC starting time and the checking BMC starting time is not more than the preset time difference value, carrying out error correction operation on the server based on the alarm information in the alarm mail.
As another optional implementation manner, in order to facilitate operation and maintenance personnel to locate a fault more quickly, after detecting that the running state of the server is in error, the running state log information at the time of detecting the running state error of the server can be packaged to generate a fault detection log packet, so that it is avoided that the subsequent log information covers useful log information, then the fault detection log packet is sent to a target mailbox, and a timestamp is set for the fault detection log packet, where the timestamp is the time of detecting the running state error of the server.
It should be noted that, in the present application, there is no strict sequential execution order among the steps, and as long as the logical order is met, the steps may be executed simultaneously or according to a certain preset order, and fig. 1 is only an exemplary manner, and does not represent that only the execution order is the order.
The embodiment of the invention also provides a corresponding device for the server running state monitoring method, so that the method has higher practicability. Wherein the means can be described separately from the functional module point of view and the hardware point of view. In the following, the server operation state monitoring apparatus provided in the embodiment of the present invention is introduced, and the server operation state monitoring apparatus described below and the server operation state monitoring method described above may be referred to in correspondence with each other.
Based on the angle of the functional module, referring to fig. 2, fig. 2 is a structural diagram of a server operation state monitoring apparatus according to an embodiment of the present invention, in a specific implementation, where the apparatus may include:
and the query interface predefining module 201 is used for setting a check information query interface on the BMC.
A network delay information obtaining module 202, configured to obtain a current network transmission state when detecting that the running state of the server is faulty.
The warning module 203 is configured to send a warning mail carrying the check information to a target mailbox pre-bound to the server if the running state error type of the server is the preset target error type and the network transmission delay is not less than the preset delay threshold, so that a user determines whether to execute an error correction operation based on the consistency of the check information and original check information obtained through query of the check information query interface; the target error type is an operation state error type which can be automatically recovered through specific system operation without manually and actively executing error correction operation; the check information is determined according to the operation parameters for automatically recovering the server running state error.
Optionally, in some embodiments of this embodiment, the apparatus may further include a log packing module, for example, where the log packing module is configured to pack running state log information at the time of detecting the running state error of the server to generate a fault detection log packet; and sending the fault detection log packet to a target mailbox, and setting a timestamp for the fault detection log packet, wherein the timestamp is the error detection time of the server in the running state.
In other embodiments of this embodiment, the alarm module 203 may be specifically configured to synchronize the boot time of the BMC and the restart times of the operating system as the check information and send the check information to the target mailbox through the alarm mail.
As another optional implementation manner, the alarm module may be further specifically configured to:
after receiving the alarm mail, acquiring the verification BMC starting time and the verification restarting times in the alarm mail;
acquiring the current BMC starting time and the current restarting times of the operating system from the BMC through a check information query interface;
if the current restart times are inconsistent with the verification restart times, not carrying out error correction operation on the server;
if the difference value between the current BMC starting time and the checking BMC starting time is larger than the preset time difference value, not carrying out error correction operation on the server;
and if the current restarting times and the checking restarting times are consistent, and the difference value between the current BMC starting time and the checking BMC starting time is not more than the preset time difference value, carrying out error correction operation on the server based on the alarm information in the alarm mail.
The functions of the functional modules of the server operation state monitoring apparatus according to the embodiment of the present invention may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the related description of the foregoing method embodiment, which is not described herein again.
Therefore, the embodiment of the invention solves the problem of misoperation and even server data loss caused by delayed processing of the error information, reduces the misoperation probability of the user, and improves the stability and reliability of the server operation.
The above-mentioned server operation state monitoring apparatus is described from the perspective of a functional module, and further, the present application also provides a server operation state monitoring apparatus, which is described from the perspective of hardware. Fig. 3 is a structural diagram of another server operation state monitoring apparatus according to an embodiment of the present application. As shown in fig. 3, the apparatus comprises a memory 30 for storing a computer program;
a processor 31, configured to implement the steps of the server operation state monitoring method according to the above-mentioned embodiment when executing the computer program.
The processor 31 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 31 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 31 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 31 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 31 may further include an AI (Artificial Intelligence) processor for processing a calculation operation related to machine learning.
Memory 30 may include one or more computer-readable storage media, which may be non-transitory. Memory 30 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 30 is at least used for storing the following computer program 301, wherein after being loaded and executed by the processor 31, the computer program can implement the relevant steps of the server operation state monitoring method disclosed in any of the foregoing embodiments. In addition, the resources stored by the memory 30 may also include an operating system 302, data 303, and the like, and the storage may be transient storage or permanent storage. Operating system 302 may include Windows, Unix, Linux, etc. Data 303 may include, but is not limited to, data corresponding to test results, and the like.
In some embodiments, the server operation status monitoring device may further include a display screen 32, an input/output interface 33, a communication interface 34, a power source 35, and a communication bus 36.
Those skilled in the art will appreciate that the configuration shown in fig. 3 does not constitute a limitation of the server operation state monitoring device and may include more or fewer components than those shown, such as sensors 37.
The functions of the functional modules of the server operation state monitoring apparatus according to the embodiment of the present invention may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the related description of the foregoing method embodiment, which is not described herein again.
Therefore, the embodiment of the invention solves the problem of misoperation and even server data loss caused by delayed processing of the error information, reduces the misoperation probability of the user, and improves the stability and reliability of the server operation.
It is to be understood that, if the server operation state monitoring method in the above embodiment is implemented in the form of a software functional unit and sold or used as a stand-alone product, it may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application may be substantially or partially implemented in the form of a software product, which is stored in a storage medium and executes all or part of the steps of the methods of the embodiments of the present application, or all or part of the technical solutions. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), an electrically erasable programmable ROM, a register, a hard disk, a removable magnetic disk, a CD-ROM, a magnetic or optical disk, and other various media capable of storing program codes.
Based on this, an embodiment of the present invention further provides a computer-readable storage medium, in which a server operation state monitoring program is stored, and the server operation state monitoring program is executed by a processor, according to the steps of the server operation state monitoring method in any of the above embodiments.
The functions of the functional modules of the computer-readable storage medium according to the embodiment of the present invention may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the related description of the foregoing method embodiment, which is not described herein again.
Therefore, the embodiment of the invention solves the problem of misoperation and even server data loss caused by delayed processing of the error information, reduces the misoperation probability of the user, and improves the stability and reliability of the server operation.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The method, the apparatus and the computer-readable storage medium for monitoring the operating state of the server provided by the present application are described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present application.

Claims (10)

1. A method for monitoring the running state of a server is characterized in that a check information query interface is arranged on a BMC in advance, and comprises the following steps:
when detecting that the running state of the server is wrong, acquiring the current network transmission state;
if the operation state error type of the server is a preset target error type and the network transmission delay is not less than a preset delay threshold value, sending an alarm mail carrying verification information to a target mailbox pre-bound with the server so that a user can determine whether to execute error correction operation based on the consistency of the verification information and original verification information obtained through query of the verification information query interface;
the target error type is an operating state error type which can be automatically recovered through specific system operation without manually and actively executing error correction operation; the check information is determined according to the operation parameters for automatically recovering the server running state error.
2. The method for monitoring the operating state of the server according to claim 1, wherein the sending the alarm mail carrying the verification information to the target mailbox pre-bound to the server comprises:
and taking the starting time of the BMC and the restarting times of the operating system as the checking information, and synchronously sending the checking information to the target mailbox through the alarm mail.
3. The method for monitoring the operating status of the server according to claim 2, wherein the enabling the user to determine whether to perform the error correction operation based on the consistency between the check information and the original check information queried from the BMC comprises:
after receiving the alarm mail, acquiring the verification BMC starting time and the verification restarting times in the alarm mail;
acquiring the current BMC starting time and the current restarting times of the operating system from the BMC through the check information query interface;
if the current restart times and the check restart times are not consistent, error correction operation is not performed on the server;
if the difference value between the current BMC starting time and the checking BMC starting time is larger than a preset time difference value, not carrying out error correction operation on the server;
and if the current restarting times and the checking restarting times are consistent, and the difference value between the current BMC starting time and the checking BMC starting time is not more than a preset time difference value, carrying out error correction operation on the server based on the alarm information in the alarm mail.
4. The server operation state monitoring method according to any one of claims 1 to 3, further comprising, after detecting that the operation state of the server is faulty:
packing the running state log information at the error detection moment of the running state of the server to generate a fault detection log packet;
and sending the fault detection log packet to the target mailbox, and setting a timestamp for the fault detection log packet, wherein the timestamp is the error detection time of the server in the running state.
5. The method for monitoring the operating status of the server according to claim 4, wherein the setting of the check information query interface on the BMC in advance comprises:
and defining an ipmi interface and/or a restful interface as the check information query interface in advance on the BMC.
6. A server operation state monitoring apparatus, comprising:
the query interface predefining module is used for setting a check information query interface on the BMC;
the network delay information acquisition module is used for acquiring the current network transmission state when detecting that the running state of the server is wrong;
the warning module is used for sending a warning mail carrying verification information to a target mailbox bound with the server in advance if the running state error type of the server is a preset target error type and the network transmission delay is not less than a preset delay threshold value, so that a user can determine whether to execute error correction operation based on the consistency of the verification information and original verification information obtained through query of the verification information query interface; the target error type is an operation state error type which can be automatically recovered through specific system operation without manually and actively executing error correction operation; the check information is determined according to the operation parameters for automatically recovering the server running state error.
7. The device for monitoring the operation status of the server according to claim 6, wherein the alarm module is specifically configured to send the start time of the BMC and the restart time of the operating system as the check information to the target mailbox through the alarm mail.
8. The server operation state monitoring device according to claim 6 or 7, further comprising a log packing module, wherein the log packing module is configured to pack operation state log information at the time of the error detection of the server operation state to generate a fault detection log packet; and sending the fault detection log packet to the target mailbox, and setting a timestamp for the fault detection log packet, wherein the timestamp is the error detection time of the server in the running state.
9. A server operation state monitoring apparatus comprising a processor for implementing the steps of the server operation state monitoring method according to any one of claims 1 to 5 when executing a computer program stored in a memory.
10. A computer-readable storage medium, wherein a server operation state monitoring program is stored on the computer-readable storage medium, and when executed by a processor, the server operation state monitoring program implements the steps of the server operation state monitoring method according to any one of claims 1 to 5.
CN202010121452.0A 2020-02-26 2020-02-26 Server running state monitoring method and device and computer readable storage medium Active CN111290918B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010121452.0A CN111290918B (en) 2020-02-26 2020-02-26 Server running state monitoring method and device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010121452.0A CN111290918B (en) 2020-02-26 2020-02-26 Server running state monitoring method and device and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN111290918A true CN111290918A (en) 2020-06-16
CN111290918B CN111290918B (en) 2022-12-27

Family

ID=71017255

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010121452.0A Active CN111290918B (en) 2020-02-26 2020-02-26 Server running state monitoring method and device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111290918B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111722954A (en) * 2020-06-30 2020-09-29 曙光信息产业(北京)有限公司 Server abnormity positioning method and device, storage medium and server
CN111858239A (en) * 2020-06-30 2020-10-30 浪潮电子信息产业股份有限公司 Server hard disk monitoring method, device, equipment and medium
CN111984490A (en) * 2020-09-28 2020-11-24 苏州浪潮智能科技有限公司 Warning device, method, equipment and medium for illegal operating system starting item
CN111994137A (en) * 2020-09-04 2020-11-27 深圳科安达电子科技股份有限公司 Alarm analysis method based on railway signal centralized monitoring
CN112434904A (en) * 2020-10-23 2021-03-02 国网山东省电力公司日照供电公司 Electric power data communication verification system in electric power network
CN112764991A (en) * 2021-01-19 2021-05-07 苏州浪潮智能科技有限公司 Method, system, device and medium for managing BMC based on image identification
CN113220358A (en) * 2021-04-25 2021-08-06 山东英信计算机技术有限公司 Multi-platform BIOS information storage method, system and medium
CN113778780A (en) * 2020-11-27 2021-12-10 北京京东尚科信息技术有限公司 Application stability determination method and device, electronic equipment and storage medium
CN114328104A (en) * 2021-12-25 2022-04-12 深圳市锐宝智联信息有限公司 Industrial control complete machine health state monitoring method, system, equipment and storage medium
CN114518972A (en) * 2022-02-14 2022-05-20 海光信息技术股份有限公司 Memory error processing method and device, memory controller and processor
CN115114212A (en) * 2022-06-30 2022-09-27 苏州浪潮智能科技有限公司 VPD (virtual private display) flashing method, device, equipment and medium
CN115333970A (en) * 2022-07-22 2022-11-11 苏州浪潮智能科技有限公司 Method and device for evaluating connection stability of equipment, computer equipment and storage medium
CN117076212A (en) * 2023-10-17 2023-11-17 北京卡普拉科技有限公司 Consistency check method, device, medium and equipment for MPI communication data content

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102981943A (en) * 2012-10-29 2013-03-20 新浪技术(中国)有限公司 Method and system for monitoring application logs
CN109617733A (en) * 2018-12-24 2019-04-12 浪潮电子信息产业股份有限公司 A kind of mail alarm method, device, server and computer readable storage medium
CN110674005A (en) * 2019-08-30 2020-01-10 苏州浪潮智能科技有限公司 Method and device for monitoring server memory and readable medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102981943A (en) * 2012-10-29 2013-03-20 新浪技术(中国)有限公司 Method and system for monitoring application logs
CN109617733A (en) * 2018-12-24 2019-04-12 浪潮电子信息产业股份有限公司 A kind of mail alarm method, device, server and computer readable storage medium
CN110674005A (en) * 2019-08-30 2020-01-10 苏州浪潮智能科技有限公司 Method and device for monitoring server memory and readable medium

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111858239B (en) * 2020-06-30 2022-06-10 浪潮电子信息产业股份有限公司 Server hard disk monitoring method, device, equipment and medium
CN111858239A (en) * 2020-06-30 2020-10-30 浪潮电子信息产业股份有限公司 Server hard disk monitoring method, device, equipment and medium
CN111722954A (en) * 2020-06-30 2020-09-29 曙光信息产业(北京)有限公司 Server abnormity positioning method and device, storage medium and server
CN111994137A (en) * 2020-09-04 2020-11-27 深圳科安达电子科技股份有限公司 Alarm analysis method based on railway signal centralized monitoring
CN111984490A (en) * 2020-09-28 2020-11-24 苏州浪潮智能科技有限公司 Warning device, method, equipment and medium for illegal operating system starting item
CN112434904A (en) * 2020-10-23 2021-03-02 国网山东省电力公司日照供电公司 Electric power data communication verification system in electric power network
CN113778780B (en) * 2020-11-27 2024-05-17 北京京东尚科信息技术有限公司 Application stability determining method and device, electronic equipment and storage medium
CN113778780A (en) * 2020-11-27 2021-12-10 北京京东尚科信息技术有限公司 Application stability determination method and device, electronic equipment and storage medium
CN112764991A (en) * 2021-01-19 2021-05-07 苏州浪潮智能科技有限公司 Method, system, device and medium for managing BMC based on image identification
CN113220358B (en) * 2021-04-25 2023-08-08 山东英信计算机技术有限公司 Multi-platform BIOS information storage method, system and medium
CN113220358A (en) * 2021-04-25 2021-08-06 山东英信计算机技术有限公司 Multi-platform BIOS information storage method, system and medium
CN114328104A (en) * 2021-12-25 2022-04-12 深圳市锐宝智联信息有限公司 Industrial control complete machine health state monitoring method, system, equipment and storage medium
CN114328104B (en) * 2021-12-25 2023-05-16 深圳市锐宝智联信息有限公司 Method, system, equipment and storage medium for monitoring health state of industrial control complete machine
CN114518972A (en) * 2022-02-14 2022-05-20 海光信息技术股份有限公司 Memory error processing method and device, memory controller and processor
CN115114212A (en) * 2022-06-30 2022-09-27 苏州浪潮智能科技有限公司 VPD (virtual private display) flashing method, device, equipment and medium
CN115114212B (en) * 2022-06-30 2023-08-04 苏州浪潮智能科技有限公司 VPD (virtual private digital) refreshing method, device, equipment and medium
CN115333970A (en) * 2022-07-22 2022-11-11 苏州浪潮智能科技有限公司 Method and device for evaluating connection stability of equipment, computer equipment and storage medium
CN115333970B (en) * 2022-07-22 2023-08-11 苏州浪潮智能科技有限公司 Device connection stability evaluation method and device, computer device and storage medium
CN117076212A (en) * 2023-10-17 2023-11-17 北京卡普拉科技有限公司 Consistency check method, device, medium and equipment for MPI communication data content
CN117076212B (en) * 2023-10-17 2024-02-23 北京卡普拉科技有限公司 Consistency check method, device, medium and equipment for MPI communication data content

Also Published As

Publication number Publication date
CN111290918B (en) 2022-12-27

Similar Documents

Publication Publication Date Title
CN111290918B (en) Server running state monitoring method and device and computer readable storage medium
CN110224858B (en) Log-based alarm method and related device
CN112948157B (en) Server fault positioning method, device and system and computer readable storage medium
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
AU2014399227B2 (en) Fault Processing Method, Related Apparatus and Computer
KR20150033711A (en) Run-time error repairing method, device and system
CN112506702B (en) Disaster recovery method, device, equipment and storage medium for data center
CN108845912B (en) Service interface calls the alarm method of failure and calculates equipment
CN110427303A (en) A kind of fault alarming method and device
CN110609778A (en) Method and system for storing server downtime log
CN105404581A (en) Database evaluation method and device
CN114356499A (en) Kubernetes cluster alarm root cause analysis method and device
CN112529223A (en) Equipment fault repair method and device, server and storage medium
EP2860633A1 (en) Method for maintaining file system of computer system
CN110011854A (en) MDS fault handling method, device, storage system and computer readable storage medium
CN114020509A (en) Method, device and equipment for repairing work load cluster and readable storage medium
CN111124724B (en) Node fault testing method and device of distributed block storage system
CN114116282B (en) Method and device for reporting and repairing network additional storage faults
CN115314361B (en) Server cluster management method and related components thereof
CN115794486A (en) Robot information acquisition method, system, device and readable medium
CN114461341A (en) Method, device and medium for preventing brain crack of cloud platform virtual machine
CN112068935A (en) Method, device and equipment for monitoring deployment of kubernets program
CN111444032A (en) Computer system fault repairing method, system and equipment
CN111400094A (en) Method, device, equipment and medium for restoring factory settings of server system
CN116484373B (en) Abnormal process checking and killing method, system, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant