CN114816022A

CN114816022A - Server power supply abnormity monitoring method, system and storage medium

Info

Publication number: CN114816022A
Application number: CN202210463541.2A
Authority: CN
Inventors: 于淏宁
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2022-04-28
Filing date: 2022-04-28
Publication date: 2022-07-29
Anticipated expiration: 2042-04-28
Also published as: CN114816022B

Abstract

The invention discloses a server power supply abnormity monitoring method, a system and a storage medium, and relates to the technical field of computers. The method comprises the following steps: starting a server, and starting power-on according to a power-on sequence; in the starting process of the server, the CPLD continuously monitors the power-on state of the server; when the power-on state is abnormal, recording power-on abnormal information and detecting the starting state of the BMC; and determining the storage position of the power-on abnormal information according to the starting state of the BMC. The invention can record the power supply state when abnormality occurs so as to rapidly analyze and locate the cause of the problem in the following.

Description

Server power supply abnormity monitoring method, system and storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a server power supply abnormity monitoring method, a server power supply abnormity monitoring system and a storage medium.

Background

In the big data era, a data center bears massive operational data, the deployed servers are increasingly dense, and the requirements on the stability and the reliability of the servers are continuously improved. Because the server needs 24 hours to operate continuously, the factors causing the server to break down are increasing along with the increase of the service time. When the server deployed in the data center has an abnormal power failure, the current power state needs to be recorded for subsequent analysis by an engineer, so as to quickly locate the cause of the failure. Therefore, it is an urgent technical problem to be solved by those skilled in the art to develop a fast, accurate and stable fault recording mechanism.

As shown in fig. 1, the fault detection and recording mechanism adopted in the prior art is implemented based on a CPLD (complex programmable logic device) and a BMC (baseboard management controller) in a matching manner. The key power supply signal of the server is connected to the CPLD through hardware, and the CPLD is sequentially pulled up or pulled down according to a pre-designed power-on time sequence when the server is started, so that the power-on process of the server is completed. In the process of starting and operating the server, the CPLD can continuously monitor the state of the incoming power signal, including signals such as EN, PWRGD, Alert and the like of devices such as a CPU (central processing unit), a PSU (power supply unit), a DIMM (dual in-line memory module), an intelligent network card (OCP) and the like. As shown in fig. 1, and reports the real-time status to the BMC via the i2c bus. When the abnormality occurs, a technician can locate the fault reason according to the log recorded by the BMC. However, in the above monitoring process, there are the following problems:

(1) if an abnormality occurs in the power-on startup process, such as power-on timeout or abnormal power failure of a certain key signal power supply signal, the CPLD can detect the abnormality, but the BMC may not be started successfully or still be in the starting process at this time, and the abnormal state reported by the CPLD cannot be received, so that the reason for the abnormality cannot be recorded. If the fault occurs probabilistically, the subsequent positioning of the fault reason is very difficult;

(2) in the running process of the server, the BMC is hung up after part of electric signals are abnormal, and further the abnormality cannot be recorded;

(3) some clients will require to be immediately powered off after abnormal power failure occurs to protect the equipment, so that the BMC cannot record the abnormality because the server is powered off and the power failure occurs before the BMC acquires the power signal states from the CPLD.

Disclosure of Invention

In order to solve at least one of the problems mentioned in the background art, the present invention provides a server power supply abnormality monitoring method, system and storage medium, wherein a CPLD can record the power supply state when an abnormality occurs, so as to quickly analyze and locate the cause of the problem in the following.

The embodiment of the invention provides the following specific technical scheme:

in a first aspect, a server power supply abnormality monitoring method includes:

starting a server, and starting power-on according to a power-on sequence;

in the starting process of the server, the CPLD continuously monitors the power-on state of the server;

when the power-on state is abnormal, recording power-on abnormal information and detecting the starting state of the BMC;

and determining the storage position of the power-on abnormal information according to the starting state of the BMC.

Further, the determining the storage location of the abnormal information according to the start state of the BMC includes:

if the BMC is started, storing the power-on abnormal information in a first storage module so that the BMC can read and locate a fault reason;

and if the BMC is not started, storing the power-on abnormal information in a first flash memory module so that the BMC can read and locate the fault reason after starting.

Further, the method also comprises the following steps:

and if the power-on state is not abnormal, storing the power-on state data in a second storage module, reading the power-on state data from the second storage module after the BMC is started, and checking whether the abnormality exists.

Further, the method further comprises:

after the server is started, the CPLD continuously monitors abnormal power failure information of the server;

and determining the storage position of the abnormal power failure information according to the influence of the abnormal power failure information on the BMC.

Further, the determining the storage location of the abnormal power failure information according to the influence of the abnormal power failure information on the BMC includes:

if the abnormal power failure information does not cause the BMC to be hung up, recording the abnormal power failure information in a third storage module for the BMC to read and locate a fault reason;

and if the abnormal power failure information can cause the BMC to be hung, the abnormal power failure information is simultaneously stored in the third storage module and the second flash memory module so that the BMC can read and locate the fault reason after starting.

Further, if the abnormal power failure information does not cause the BMC to be hung up, recording the abnormal power failure information in a third storage module, further comprising:

reading the abnormal power failure information and sending a clearing instruction through the BMC;

and the CPLD can clear the abnormal power failure information in the third storage module and close the server according to the clearing instruction.

Further, the method also comprises the following steps:

and if the abnormal power failure information does not appear, storing the data after the server is started in a second storage module, reading the data after the server is started from the second storage module after the BMC is started, and checking whether the abnormality exists.

In a second aspect, a server power supply abnormality monitoring system is provided, the system including:

the control module is used for starting the server and starting power-on according to a power-on sequence;

the power-on monitoring module is used for continuously monitoring the power-on state of the server by the CPLD in the starting process of the server, recording power-on abnormal information and detecting the starting state of the BMC when the power-on state is abnormal, and determining the storage position of the power-on abnormal information according to the starting state of the BMC;

and the operation monitoring module is used for continuously monitoring abnormal power failure information of the server by the CPLD after the server is started, and determining the storage position of the abnormal power failure information according to the influence of the abnormal power failure information on the BMC.

In a third aspect, a computer device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the following steps when executing the computer program:

starting a server, and starting power-on according to a power-on sequence;

In a fourth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

starting a server, and starting power-on according to a power-on sequence;

The embodiment of the invention has the following beneficial effects:

1. according to the invention, the power supply of the server is monitored by the CPLD, when abnormal power-on occurs in the power-on process and abnormal power failure occurs after the server is started, abnormal server state data are recorded in the CPLD, and after the BMC is restarted, the abnormal data are read from the CPLD, and the fault reason is positioned, so that the condition that the abnormal data are not recorded because the BMC is not started or is still in the starting process is prevented, the time required by the recurrence problem is reduced, and the fault reason positioning and fault processing efficiency are improved;

2. the BMC is not started in the power-on process or the BMC is hung up when abnormal power failure occurs in the operation process of the server, and at the moment, the CPLD records abnormal state data in the first flash memory module and the second flash memory module, so that after the server is powered off and restarted, the BMC reads the abnormal data from the UFM due to the non-volatility of the first flash memory module and the second flash memory module.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram for embodying fault detection in the background art;

FIG. 2 is a schematic diagram of the overall architecture for embodying the monitoring method of the present application;

FIG. 3 is a detailed flow chart for embodying the monitoring method in the present application

Fig. 4 is an internal structural diagram of a computer apparatus for embodying the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the big data era, a data center bears massive operational data, the deployed servers are increasingly dense, and the requirements on the stability and the reliability of the servers are continuously improved. Because the server needs 24 hours to operate continuously, the factors causing the server to break down are increasing along with the increase of the service time. When the server deployed in the data center has an abnormal power failure, the current power state needs to be recorded for subsequent analysis by an engineer, so as to quickly locate the cause of the failure. Therefore, it is an urgent technical problem to be solved by those skilled in the art to develop a fast, accurate and stable fault recording mechanism. The following problems exist in the existing monitoring process: if an abnormality occurs in the power-on startup process, such as power-on timeout or abnormal power failure of a certain key signal power supply signal, the CPLD can detect the abnormality, but the BMC may not be started successfully or still be in the starting process at this time, and the abnormal state reported by the CPLD cannot be received, so that the reason for the abnormality cannot be recorded. If the fault occurs probabilistically, the subsequent positioning of the fault reason is very difficult; in the running process of the server, the BMC is hung up after part of electric signals are abnormal, and further the abnormality cannot be recorded; some clients will require to be immediately powered off after abnormal power failure occurs to protect the equipment, so that the BMC cannot record the abnormality because the server is powered off and the power failure occurs before the BMC acquires the power signal states from the CPLD. Based on the above problems, the present application provides various server power supply abnormality monitoring methods, systems and storage media, which can record the power supply state when an abnormality occurs, so as to quickly analyze and locate the cause of the problem in the following.

Example one

A server power supply abnormality monitoring method, as shown in fig. 2 and 3, includes the following steps:

step S1: starting a server, and starting power-on according to a power-on sequence; in the starting process of the server, the CPLD continuously monitors the power-on state of the server; .

Key power supply signals of the server are sequentially connected to the CPLD through hardware, when the server is started, the CPLD is sequentially pulled up or pulled down according to a pre-designed power-on time sequence so as to complete a power-on process of the server during starting, and specifically, an EN signal and a PG signal of a CPU in the server are accessed into the CPLD; the PWROK signal and the PG signal on the PSU are accessed into the CPLD; the PWRGD signal on the DIMM is accessed into the CPLD; the EN signal and the PWRGD signal on the OCP are accessed into the CPLD, and the CPLD can continuously monitor the state of the accessed power supply signal and record related data.

Step S2: when the power-on state is abnormal, recording power-on abnormal information and detecting the starting state of the BMC; and determining the storage position of the power-on abnormal information according to the starting state of the BMC.

When the power-on state is abnormal, the specific power-on process is overtime or abnormal power failure, the CPLD judges whether the BMC is started or not according to the heartbeat signal of the BMC; if the BMC is started, the power-on abnormal information is recorded in the first storage module, after the BMC is started, the BMC reads the power-on abnormal information in the first storage module through the i2C bus, and after the reading is finished, the CPLD controls the server to be powered off, so that power-off protection of the server is achieved. Wherein, the first storage module includes but is not limited to FIFO module, and the first storage module is arranged in CPLD.

At this time, the data stored in the first storage module at least includes: the CPLD can judge the position of the abnormality according to the data in the first storage module so as to determine the cause of the fault problem. Specifically, the first preset time period is any time period within 1-200 microseconds.

If the BMC is not started, the power-on abnormal information is recorded in the first flash memory module. The first flash memory module comprises but is not limited to a UFM module, the first flash memory module is arranged in a CPLD, the server is powered off and restarted, after the server is powered off and restarted and a BMC is started, the BMC reads power-on abnormal information from the UFM and analyzes and positions the fault reason.

If the power-on state is not abnormal, the power-on state data is stored in the second storage module, after the BMC is started, the BMC reads the server state data in the second storage module through the i2C bus, and checks whether the power-on process is abnormal or not so as to judge the state of the server power supply. Wherein, the second storage module includes but is not limited to a FIFO module, and the FIFO is arranged in the CPLD.

And after the power-on process is successfully completed, the server starts to operate, and when abnormality occurs in the operation process, the following steps are started.

Step S3: after the server is started, the CPLD continuously monitors abnormal power failure information of the server; and determining the storage position of the abnormal power failure information according to the influence of the abnormal power failure information on the BMC.

If the running state of the server is abnormal, specifically, when abnormal power failure possibly occurs, the CPLD receives abnormal power failure information, the abnormal power failure information comprises an abnormal signal, and judges whether the abnormal power failure information can cause the BMC to be hung; if the abnormal signal can not cause the BMC to be hung, recording abnormal power failure information into a third storage module; wherein, the third storage module includes but is not limited to a FIFO module, and the data recorded in the FIFO module at this time at least includes: the power supply state data before the abnormal operation occurs, the power supply state data at the moment of the abnormal operation, and the power supply state data in a second preset time period after the abnormal operation occurs.

At this time, the BMC reads the server state data with abnormal operation through the i2C bus and sends a clear instruction, and the CPLD receives the clear instruction and performs an operation of clearing the server state data with abnormal operation in the third storage module and a server operation shutdown operation, so as to implement power failure protection on the server.

If the abnormal signal can cause the BMC to be hung, the abnormal power failure information is recorded in the third storage module and the second flash memory module simultaneously so as to read and locate the fault reason after the BMC is started. Specifically, the CPLD records the abnormal power failure information into the UFM, performs power failure restart on the server, and after the server is powered off and restarted and the BMC is started, the BMC reads the abnormal power failure information from the UFM and analyzes and positions the fault reason.

And if the abnormal power failure information does not appear in the running process of the server, storing the data after the server is started in the second storage module, reading the data after the server is started from the second storage module after the BMC is started, and checking whether the abnormality exists.

Through the arrangement, the first storage module, the second storage module, the third storage module, the first flash memory module and the second flash memory module in the CPLD are used for temporarily storing abnormal states in the power-on and running states, so that technicians can quickly locate fault reasons by using the information after the BMC is recovered, the time required by recurring problems is shortened, and the efficiency of locating the fault reasons is improved.

Example two

Corresponding to the foregoing embodiment, the present application provides a server power supply abnormality monitoring system, including:

the control module is used for starting the server and starting power-on according to a power-on time sequence;

the operation monitoring module is used for continuously monitoring abnormal power failure information of the server by the CPLD after the server is started, and determining the storage position of the abnormal power failure information according to the influence of the abnormal power failure information on the BMC;

the first checking module is used for storing the power-on state data in the second storage module when the power-on state is not abnormal, reading the power-on state data from the second storage module after the BMC is started and checking whether the abnormality exists;

and the second check module is used for storing the data after the server is started in the second storage module when the abnormal power failure information does not appear, reading the data after the server is started from the second storage module after the BMC is started, and checking whether the abnormality exists.

In a preferred embodiment, the power-on monitoring module is further configured to, when the power-on state is abnormal, the CPLD determines whether the BMC is started according to a heartbeat signal of the BMC; if the BMC is started, the power-on abnormal information is recorded in the first storage module, after the BMC is started, the BMC reads the power-on abnormal information in the first storage module through the i2C bus, and after the reading is completed, the CPLD controls the server to be powered off and shut down, so that power-off protection of the server is realized. If the BMC is not started, the power-on abnormal information is recorded in the first flash memory module. The first flash memory module comprises but is not limited to a UFM module, the first flash memory module is arranged in a CPLD, the server is powered off and restarted, after the server is powered off and restarted and a BMC is started, the BMC reads power-on abnormal information from the UFM and analyzes and positions the fault reason.

In a preferred embodiment, if the power-on state is not abnormal, the power-on state data is stored in the second storage module, and after the BMC is started, the BMC reads the server state data in the second storage module through the i2C bus and checks whether there is an abnormality in the power-on process, so as to determine the state of the server power supply. Wherein, the second storage module includes but is not limited to a FIFO module, and the FIFO is arranged in the CPLD.

In a preferred embodiment, the data stored in the first storage module at least comprises: the data stored in the first storage module at least comprises: the CPLD can judge the position of the abnormality according to the data in the first storage module so as to determine the cause of the fault problem.

In a preferred embodiment, the operation monitoring module is further configured to determine whether the abnormal signal causes the BMC to hang up when the operation state is abnormal; if the abnormal signal can not cause the BMC to be hung, recording abnormal power failure information into a third storage module; the BMC reads abnormal power failure information through the i2C bus and sends out a clearing instruction, and the CPLD receives the clearing instruction and executes operation of clearing abnormal server state data in the third storage module and operation of closing the server so as to realize power failure protection of the server. If the abnormal signal can cause the BMC to be hung, the server state data with abnormal operation is simultaneously recorded in the third storage module and the second flash memory module so as to read and locate the fault reason after the BMC is started. And the CPLD records the abnormal power failure information into the UFM, performs power failure restart on the server, and after the server is powered off and restarted and the BMC is started, the BMC reads the abnormal power failure information from the UFM and analyzes and positions the fault reason.

In a preferred embodiment, the data recorded in the third storage module at this time at least includes: the power supply state data before the abnormal operation occurs, the power supply state data at the moment of the abnormal operation, and the power supply state data in a second preset time period after the abnormal operation occurs.

EXAMPLE III

There is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

step 101: starting a server, and starting power-on according to a power-on sequence; (ii) a

Step 102: in the starting process of the server, the CPLD continuously monitors the power-on state of the server;

step 103: when the power-on state is abnormal, recording power-on abnormal information and detecting the starting state of the BMC;

determining the storage position of the power-on abnormal information according to the starting state of the BMC;

step 104: after the server is started, the CPLD continuously monitors abnormal power failure information of the server;

In a preferred embodiment, step 103 further includes determining whether the BMC is started when the power-on state is abnormal; if the BMC is started, the starting abnormal information is recorded in the first storage module, after the BMC is started, the BMC reads the power-on abnormal information in the first storage module through the i2C bus to locate the fault reason, and after the reading is completed, the CPLD controls the server to be powered off and shut down, so that the power-off protection of the server is realized.

If the BMC is not started, the power-on abnormal information is recorded in the first flash memory module, the server is powered off and restarted, after the server is powered off and restarted and the BMC is started, the BMC reads the power-on abnormal information from the UFM and analyzes and positions the fault reason.

In a preferred embodiment, step 104 further includes after the server is powered on, when the running status is abnormal, determining whether the abnormal power-down information causes the BMC to hang up; if the abnormal power failure information cannot cause the BMC to be hung, recording the abnormal power failure information into a third storage module; the BMC reads abnormal power failure information through the i2C bus and sends out a clearing instruction, and the CPLD receives the clearing instruction and executes operation of clearing abnormal server state data in the third storage module and operation of closing the server so as to realize power failure protection of the server.

And if the abnormal power failure information can cause the BMC to be hung, recording the server state data with abnormal operation in the third storage module and the second flash memory module at the same time. And the CPLD records the abnormal power failure information into the UFM, performs power failure restart on the server, and after the server is powered off and restarted and the BMC is started, the BMC reads the server state data with abnormal operation from the UFM and analyzes and positions the fault reason.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a flash memory module storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing abnormal data in the power-on process and the operation process.

The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a server power anomaly monitoring method.

Those skilled in the art will appreciate that the architecture shown in fig. 4 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

Example four

In one embodiment, a variety of computer-readable storage media are provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

step 201: starting a server, and starting power-on according to a power-on sequence;

step 202: in the starting process of the server, the CPLD continuously monitors the power-on state of the server;

step 203: when the power-on state is abnormal, recording power-on abnormal information and detecting the starting state of the BMC;

step 204: after the server is started, the CPLD continuously monitors abnormal power failure information of the server;

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A server power supply abnormity monitoring method is characterized by comprising the following steps:

starting a server, and starting power-on according to a power-on sequence;

2. The method for monitoring the power supply abnormality of the server according to claim 1, wherein the determining the storage location of the abnormality information according to the boot state of the BMC includes:

3. The server power supply abnormality monitoring method according to claim 1 or 2, characterized by further comprising:

4. The server power supply abnormality monitoring method according to claim 3, characterized by further comprising:

5. The server power supply abnormality monitoring method according to claim 4, wherein the determining the storage location of the power failure abnormality information according to the influence of the abnormality power failure information on the BMC includes:

and if the abnormal power failure information can cause the BMC to be hung, storing the abnormal power failure information in the third storage module and the second flash memory module simultaneously so as to read and locate the fault reason after the BMC is started.

6. The method for monitoring the abnormal power supply of the server according to claim 5, wherein if the abnormal power failure information does not cause the BMC to hang up, the method further includes, after recording the abnormal power failure information in a third storage module:

7. The server power supply abnormality monitoring method according to claim 6, characterized by further comprising:

8. A server power anomaly monitoring system, the system comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 7 are implemented when the computer program is executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.